SW4-003: Observability Extension¶

Status: Draft Version: 0.1.0 Date: 2026-01-10 Extends: Core Spec §4 (Architecture - Observability Sink)

Abstract¶

This extension defines standard observability interfaces for SW4RM implementations, including OpenTelemetry integration, Prometheus metrics, and health check endpoints. Implementations conforming to this extension provide production-grade monitoring capabilities.

Motivation¶

The core specification mentions an "Observability Sink" but does not define:

Standard metric names and labels
Tracing span conventions
Health check interfaces
Log correlation requirements

Without standardization, operators cannot build unified dashboards or alerts across SW4RM deployments.

1. Metrics¶

1.1. Metric Naming Convention¶

All SW4RM metrics MUST use the prefix sw4rm_ and follow Prometheus naming conventions:

Snake_case names
Unit suffix where applicable (_total, _seconds, _bytes)
Labels for dimensions (service, method, status, agent_id)

1.2. Required Metrics¶

Implementations MUST expose these metrics:

RPC Metrics¶

sw4rm_rpc_requests_total{service, method, status}
  Counter: Total RPC requests by service, method, and gRPC status

sw4rm_rpc_duration_seconds{service, method}
  Histogram: RPC latency distribution
  Buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]

sw4rm_rpc_in_flight{service}
  Gauge: Currently executing RPCs per service

Agent Metrics¶

sw4rm_agents_registered_total
  Gauge: Number of currently registered agents

sw4rm_agent_state{agent_id, state}
  Gauge: Agent state (1 = current state, 0 = other states)

sw4rm_agent_heartbeat_age_seconds{agent_id}
  Gauge: Seconds since last heartbeat

Scheduler Metrics¶

sw4rm_tasks_queued_total{priority}
  Gauge: Tasks currently queued by priority

sw4rm_tasks_completed_total{status}
  Counter: Completed tasks by status (success, failed, preempted)

sw4rm_preemption_total{type}
  Counter: Preemptions by type (cooperative, forced)

Negotiation Metrics¶

sw4rm_negotiations_active_total
  Gauge: Currently active negotiation rooms

sw4rm_negotiation_decisions_total{outcome}
  Counter: Decisions by outcome (approved, revision_requested, escalated)

sw4rm_negotiation_quorum_failures_total
  Counter: Negotiations that failed to reach quorum (see SW4-001)

sw4rm_negotiation_vote_latency_seconds
  Histogram: Time from proposal to final vote

Handoff Metrics¶

sw4rm_handoffs_total{status}
  Counter: Handoffs by status (accepted, rejected, timeout)

sw4rm_handoff_duration_seconds
  Histogram: Time from request to completion

1.3. Metric Labels¶

Standard labels that SHOULD be applied consistently:

Label	Description	Example Values
`service`	gRPC service name	`RegistryService`, `SchedulerService`
`method`	RPC method name	`RegisterAgent`, `SubmitTask`
`status`	gRPC status code	`OK`, `DEADLINE_EXCEEDED`, `INTERNAL`
`agent_id`	Agent identifier	`agent-001`
`priority`	Task priority	`-19`, `0`, `20`
`outcome`	Decision outcome	`approved`, `rejected`

2. Distributed Tracing¶

2.1. OpenTelemetry Integration¶

Implementations SHOULD support OpenTelemetry tracing with:

Automatic span creation for all RPCs
Context propagation via gRPC metadata
Correlation with correlation_id from messages

2.2. Span Naming Convention¶

sw4rm.{service}.{method}

Examples:

sw4rm.registry.register_agent
sw4rm.scheduler.submit_task
sw4rm.negotiation_room.submit_proposal

2.3. Required Span Attributes¶

Attribute	Type	Description
`sw4rm.correlation_id`	string	Message correlation ID
`sw4rm.agent_id`	string	Requesting agent ID
`sw4rm.task_id`	string	Task ID (if applicable)
`sw4rm.negotiation_room_id`	string	Room ID (if applicable)
`sw4rm.artifact_id`	string	Artifact ID (if applicable)

2.4. Span Events¶

Implementations SHOULD add span events for significant state changes:

span.add_event("quorum_evaluated", {
    "votes_received": 3,
    "votes_expected": 5,
    "quorum_met": true
})

span.add_event("decision_rendered", {
    "outcome": "approved",
    "aggregated_score": 8.5
})

3. Health Checks¶

3.1. Health Check Endpoint¶

Implementations MUST expose a health check endpoint:

gRPC: grpc.health.v1.Health/Check (standard gRPC health protocol)

HTTP (optional): GET /health returning:

{
  "status": "healthy|degraded|unhealthy",
  "version": "0.5.0",
  "uptime_seconds": 3600,
  "checks": {
    "registry_connection": "healthy",
    "scheduler_connection": "healthy",
    "database_connection": "healthy"
  }
}

3.2. Readiness vs Liveness¶

Implementations SHOULD distinguish:

Liveness (/health/live): Process is running
Readiness (/health/ready): Process can accept traffic

3.3. Component Health Checks¶

Each SW4RM component SHOULD check:

Component	Health Checks
Registry	Database connectivity, schema version
Scheduler	Registry connectivity, queue health
Agent	Registry registration, heartbeat success
NegotiationRoom	Store connectivity, pending decisions

4. Structured Logging¶

4.1. Log Format¶

Implementations SHOULD use structured logging (JSON) with:

{
  "timestamp": "2026-01-10T12:00:00.000Z",
  "level": "info",
  "message": "Task completed",
  "service": "scheduler",
  "correlation_id": "abc-123",
  "trace_id": "def-456",
  "span_id": "ghi-789",
  "agent_id": "agent-001",
  "task_id": "task-100",
  "duration_ms": 1500
}

4.2. Required Log Fields¶

Field	Description
`timestamp`	ISO 8601 timestamp
`level`	Log level (debug, info, warn, error)
`message`	Human-readable message
`service`	Service name
`correlation_id`	Message correlation ID

4.3. Log Correlation¶

Logs MUST include correlation_id when available to enable:

Tracing request flow across services
Correlating logs with distributed traces
Debugging multi-agent interactions

5. Alerting Recommendations¶

5.1. Critical Alerts¶

Implementations SHOULD alert on:

Condition	Severity	Description
`sw4rm_agents_registered_total == 0`	Critical	No agents registered
`sw4rm_agent_heartbeat_age_seconds > 60`	Warning	Agent may be unhealthy
`sw4rm_rpc_requests_total{status="INTERNAL"} increase > 10/min`	Critical	Internal errors spike
`sw4rm_negotiation_quorum_failures_total increase > 5/hour`	Warning	Quorum problems

5.2. SLI/SLO Recommendations¶

SLI	Recommended SLO
RPC success rate	99.9%
RPC p99 latency	< 1s (varies by operation)
Heartbeat freshness	< 30s
Negotiation quorum success	> 95%

6. Implementation Requirements¶

6.1. MUST Requirements¶

Implementations MUST:

Expose gRPC health check endpoint
Include correlation_id in all logs
Expose basic RPC metrics (requests, duration)

6.2. SHOULD Requirements¶

Implementations SHOULD:

Support OpenTelemetry tracing export
Expose Prometheus-compatible metrics endpoint
Use structured JSON logging
Expose all metrics defined in §1.2

6.3. MAY Requirements¶

Implementations MAY:

Provide Grafana dashboard templates
Include alerting rule templates
Support custom metric labels

7. Compatibility¶

This extension is additive. Implementations not conforming to SW4-003:

Will have limited observability
Cannot participate in unified monitoring
SHOULD document available observability interfaces

8. References¶

Core Spec §4: Architecture (Observability Sink)
OpenTelemetry Specification: https://opentelemetry.io/docs/specs/
Prometheus Naming Conventions: https://prometheus.io/docs/practices/naming/
gRPC Health Checking Protocol: https://github.com/grpc/grpc/blob/master/doc/health-checking.md

This extension is part of the SW4RM protocol extension series.