3. SW4RM Protocol Specification¶
SW4RM Protocol v0.1 | Status: Production Ready | Last Updated: 2025-08-09
This comprehensive protocol specification defines the complete SW4RM message-driven agent communication system. The protocol is built on industry-standard gRPC and Protocol Buffers, providing a robust foundation for enterprise-grade distributed agentic systems with guaranteed message delivery, comprehensive observability, and enterprise security features.
Terminology: “Agent” in this specification follows the supervised, process‑isolated definition in documentation/index.md (see “Agents and Agentic Interaction”), which differs from common industry usage where “agent” may mean an LLM wrapper or in‑process automation.
3.1. Executive Summary¶
The SW4RM protocol addresses the fundamental challenges of distributed agentic systems by providing a complete communication framework with the following core capabilities:
- Guaranteed Message Delivery: At-least-once delivery semantics with configurable consistency levels
- Comprehensive State Management: Persistent state across failures with automatic recovery mechanisms
- Enterprise Security: Zero-trust architecture with mutual TLS and role-based access control
- Production Observability: Complete distributed tracing, metrics, and audit logging
- Horizontal Scalability: Linear scaling with no single points of failure
- Multi-Tenancy Support: Secure isolation between different agent workloads
3.2. Architectural Foundation and Design Principles¶
3.2.1. Service-Oriented Architecture (SOA) Implementation¶
SW4RM implements a microservices architecture with clear service boundaries, standardized communication protocols, and comprehensive fault tolerance mechanisms. The architecture is designed for:
- Independent Service Scaling: Each service can be scaled independently based on workload requirements
- Fault Isolation: Service failures are contained and do not cascade to other components
- Technology Diversity: Services can be implemented in different technologies while maintaining protocol compatibility
- Operational Independence: Services can be deployed, monitored, and managed independently
graph TB
subgraph "Client Layer [gRPC/TLS]"
AGENT[Agent Applications<br/>Business Logic Layer]
SDK[SW4RM SDK<br/>Runtime Library]
end
subgraph "Core Infrastructure Services"
REGISTRY[Registry Service<br/>:50051<br/>Agent Discovery & Health]
ROUTER[Router Service<br/>:50052<br/>Message Delivery & Routing]
SCHEDULER[Scheduler Service<br/>:50053<br/>Task Distribution & Load Balancing]
end
subgraph "Extended Capability Services"
HITL[Human-in-the-Loop Service<br/>:50061<br/>Approval Workflows & Escalation]
WORKTREE[Worktree Service<br/>:50062<br/>Git Integration & Repository Management]
TOOLS[Tool Service<br/>:50063<br/>External System Integration]
NEGOTIATE[Negotiation Service<br/>:50064<br/>Multi-Agent Consensus & Coordination]
REASON[Reasoning Service<br/>:50065<br/>Decision Support & Analytics]
AUDIT[Audit Service<br/>:50066<br/>Compliance & Security Logging]
CONNECT[Connector Service<br/>:50067<br/>External API Integration]
end
subgraph "Data & Storage Layer"
POSTGRES[(PostgreSQL Cluster<br/>Transactional State)]
REDIS[(Redis Cluster<br/>Session & Cache)]
S3[(Object Storage<br/>Large Payloads & Archives)]
GIT[(Git Repositories<br/>Source Code & Configuration)]
end
subgraph "Observability & Security"
PROMETHEUS[Prometheus<br/>Metrics Collection]
JAEGER[Jaeger<br/>Distributed Tracing]
VAULT[HashiCorp Vault<br/>Secrets Management]
CONSUL[Consul<br/>Service Discovery]
end
AGENT -->|gRPC/TLS| SDK
SDK -->|Load Balanced| REGISTRY
SDK -->|Message Flow| ROUTER
SDK -->|Task Requests| SCHEDULER
SDK -->|Approval Requests| HITL
SDK -->|Repository Operations| WORKTREE
SDK -->|External Calls| TOOLS
SDK -->|Coordination| NEGOTIATE
SDK -->|Analytics| REASON
SDK -->|Audit Events| AUDIT
SDK -->|API Integrations| CONNECT
REGISTRY --> POSTGRES
ROUTER --> POSTGRES
ROUTER --> REDIS
SCHEDULER --> POSTGRES
SCHEDULER --> REDIS
WORKTREE --> GIT
TOOLS --> S3
AUDIT --> S3
REGISTRY -.-> PROMETHEUS
ROUTER -.-> JAEGER
SCHEDULER -.-> VAULT
HITL -.-> CONSUL
3.2.2. Fundamental Protocol Design Principles¶
3.2.2.1. Message-Driven Communication Model¶
Event Sourcing Architecture: All system interactions are represented as immutable events (messages) that form an event log, enabling complete system state reconstruction and audit trails.
Technical Implementation:
- Message Persistence: All messages are durably persisted before acknowledgment using write-ahead logging
- Event Ordering: Global message ordering using hybrid logical clocks (HLC) for causal consistency
- Message Deduplication: SHA-256 based content hashing prevents duplicate message processing
- Delivery Semantics: Configurable delivery guarantees (at-most-once, at-least-once, exactly-once)
3.2.2.2. Distributed System Consistency Model¶
Implementation of Eventual Consistency with Strong Consistency Options:
- Eventual Consistency (Default): Default mode with eventual convergence guarantees
- Strong Consistency: Configurable strong consistency for critical operations using distributed consensus
- Causal Consistency: Maintains causal relationships between related messages using vector clocks
- Session Consistency: Guarantees consistency within agent session boundaries
Consistency Configuration:
message ConsistencyConfig {
ConsistencyLevel default_level = 1;
map<string, ConsistencyLevel> operation_overrides = 2;
uint32 eventual_consistency_timeout_ms = 3; // Default: 5000ms
uint32 strong_consistency_timeout_ms = 4; // Default: 30000ms
}
enum ConsistencyLevel {
EVENTUAL = 0; // Best performance, eventual convergence
CAUSAL = 1; // Maintains causal relationships
SESSION = 2; // Consistency within agent sessions
STRONG = 3; // Distributed consensus, highest latency
}
3.2.2.3. Comprehensive Security Architecture¶
Zero-Trust Network Model: Every service interaction requires authentication and authorization, with no implicit trust relationships.
Security Implementation Layers:
- Transport Security:
- Mutual TLS (mTLS) for all inter-service communication
- TLS 1.3 with forward secrecy using ECDHE key exchange
- Certificate rotation with 24-hour certificate lifetime
-
Certificate pinning for critical service connections
-
Authentication & Authorization:
- OAuth 2.0 / OpenID Connect integration for external authentication
- JWT tokens with configurable expiration (default: 1 hour)
- Role-Based Access Control (RBAC) with fine-grained permissions
-
Attribute-Based Access Control (ABAC) for complex authorization scenarios
-
Data Protection:
- AES-256-GCM encryption for sensitive payloads
- Field-level encryption for PII and sensitive data
- Cryptographic signatures for message integrity verification
- Key management integration with HashiCorp Vault or AWS KMS
Security Configuration Example:
message SecurityConfig {
TLSConfig tls_config = 1;
AuthenticationConfig auth_config = 2;
EncryptionConfig encryption_config = 3;
AuditConfig audit_config = 4;
}
message TLSConfig {
string ca_cert_path = 1;
string client_cert_path = 2;
string client_key_path = 3;
repeated string cipher_suites = 4;
uint32 handshake_timeout_seconds = 5; // Default: 10
bool enable_cert_pinning = 6;
}
message AuthenticationConfig {
string jwt_secret_key = 1;
uint32 token_expiry_seconds = 2; // Default: 3600
repeated string allowed_issuers = 3;
bool enable_service_accounts = 4;
string service_account_key_path = 5;
}
3.2.2.4. Enterprise-Grade Observability Framework¶
Three Pillars of Observability Implementation:
- Comprehensive Metrics Collection:
- Business metrics: Message processing rates, success/failure ratios, processing latencies
- System metrics: CPU, memory, network, disk utilization per service
- Custom metrics: Domain-specific KPIs and performance indicators
-
Real-time alerting with configurable thresholds and escalation policies
-
Distributed Tracing:
- OpenTelemetry-compliant distributed tracing across all service boundaries
- Trace sampling strategies: Always, never, probabilistic, adaptive
- Trace correlation across message processing pipelines
-
Performance bottleneck identification and optimization recommendations
-
Structured Audit Logging:
- Immutable audit logs with cryptographic integrity verification
- Comprehensive security event logging (authentication, authorization, data access)
- Business process audit trails for compliance requirements
- Log retention policies with automated archival to cold storage
Observability Configuration:
message ObservabilityConfig {
MetricsConfig metrics = 1;
TracingConfig tracing = 2;
LoggingConfig logging = 3;
}
message TracingConfig {
bool enabled = 1;
string jaeger_endpoint = 2;
SamplingStrategy sampling = 3;
map<string, string> tags = 4;
}
enum SamplingStrategy {
ALWAYS = 0;
NEVER = 1;
PROBABILISTIC = 2; // Requires sampling_rate
ADAPTIVE = 3; // AI-based sampling
}
3.3. Core Concepts¶
3.3.1. Message Envelope¶
Every message is wrapped in a standard envelope providing:
message Envelope {
string message_id = 1; // UUIDv4 per attempt
string idempotency_token = 2; // Stable across retries
string producer_id = 3; // Source agent identifier
string correlation_id = 4; // Request/response correlation
uint64 sequence_number = 5; // Ordering within conversation
uint32 retry_count = 6; // Retry attempt number
MessageType message_type = 7; // Message classification
string content_type = 8; // Payload format (MIME type)
uint64 content_length = 9; // Payload size in bytes
string repo_id = 10; // Repository context (optional)
string worktree_id = 11; // Worktree context (optional)
string hlc_timestamp = 12; // Hybrid logical clock
uint64 ttl_ms = 13; // Time-to-live in milliseconds
google.protobuf.Timestamp timestamp = 14; // Delivery timestamp
bytes payload = 15; // Message content
}
3.3.2. Message Types¶
| Type | Value | Description | Use Case |
|---|---|---|---|
DATA |
2 | Application payload | Business logic, responses, content |
CONTROL |
1 | System commands | Status requests, configuration |
ACKNOWLEDGEMENT |
5 | Message confirmations | Delivery receipts, error reports |
HITL_INVOCATION |
6 | Human-in-the-loop requests | Approval workflows, escalations |
WORKTREE_CONTROL |
7 | Repository operations | Bind, unbind, switch contexts |
NEGOTIATION |
8 | Multi-party coordination | Consensus, resource allocation |
TOOL_CALL |
9 | External tool execution | API calls, system commands |
TOOL_RESULT |
10 | Tool execution results | Success responses, data returns |
TOOL_ERROR |
11 | Tool execution failures | Error conditions, exceptions |
3.3.3. Acknowledgment Lifecycle¶
Every message follows a predictable ACK progression:
sequenceDiagram
participant S as Sender
participant R as Router
participant T as Target
S->>R: SendMessage(envelope)
R-->>S: SendMessageResponse{accepted: true}
R->>T: Deliver envelope
T-->>R: ACK{stage: RECEIVED}
T->>T: Parse and validate
T-->>R: ACK{stage: READ}
T->>T: Process message
alt Success
T-->>R: ACK{stage: FULFILLED}
else Error
T-->>R: ACK{stage: FAILED, error_code: X}
end
R->>S: Forward ACKs
ACK Stages:
RECEIVED(1): Message delivered to targetREAD(2): Message parsed and validatedFULFILLED(3): Processing completed successfullyREJECTED(4): Message rejected due to policy/validationFAILED(5): Processing failed due to errorTIMED_OUT(6): Processing exceeded time limits
3.4. Service Architecture¶
3.4.1. Core Services¶
Registry Service - Agent lifecycle management
- Agent registration and discovery
- Health monitoring and heartbeats
- Capability advertisement
Router Service - Message delivery
- Reliable message routing between agents
- Message streaming and buffering
- Load balancing and failover
Scheduler Service - Work coordination
- Task distribution and prioritization
- Resource allocation and preemption
- Activity buffer management
3.4.2. Extended Services¶
HITL Service - Human oversight
- Escalation workflows and approvals
- Decision points and manual overrides
- Audit trails and compliance
Worktree Service - Repository context
- Git repository binding and switching
- Branch and commit management
- Workspace isolation
Tool Service - External integrations
- API and system command execution
- Result capture and error handling
- Permission and security policies
3.5. Message Patterns¶
3.5.1. Request-Response¶
// Request
message: {
message_type: DATA,
correlation_id: "req-123",
payload: {...}
}
// Response
message: {
message_type: DATA,
correlation_id: "req-123",
payload: {...}
}
3.5.2. Fire-and-Forget¶
3.5.3. Command Pattern¶
3.6. Error Handling¶
3.6.1. Error Codes¶
| Code | Name | Description |
|---|---|---|
| 0 | UNSPECIFIED |
No error or unknown error |
| 1 | BUFFER_FULL |
Message queue capacity exceeded |
| 2 | NO_ROUTE |
No path to destination agent |
| 3 | ACK_TIMEOUT |
Acknowledgment not received in time |
| 6 | VALIDATION_ERROR |
Message format or content invalid |
| 7 | PERMISSION_DENIED |
Insufficient privileges for operation |
| 9 | OVERSIZE_PAYLOAD |
Message exceeds size limits |
| 99 | INTERNAL_ERROR |
Unexpected system failure |
3.6.2. Error Response Pattern¶
ack: {
ack_for_message_id: "original-msg-id",
ack_stage: FAILED,
error_code: VALIDATION_ERROR,
note: "Required field 'agent_id' missing"
}
3.7. Security Model¶
3.7.1. Authentication¶
- Service-to-service authentication via mutual TLS
- Agent identity verification through public key cryptography
- Token-based session management
3.7.2. Authorization¶
- Role-based access control (RBAC) for service operations
- Message-level permissions based on sender/receiver identity
- Policy-based filtering and transformation
3.7.3. Data Protection¶
- End-to-end encryption for sensitive payloads
- Audit logging of all security-relevant operations
- Compliance with data residency and retention policies
3.8. Deployment Considerations¶
3.8.1. Scalability¶
- Horizontal scaling of all services
- Message partitioning and sharding
- Load balancing with session affinity
3.8.2. Reliability¶
- At-least-once message delivery guarantees
- Circuit breakers and retry policies
- Graceful degradation during partial failures
3.8.3. Observability¶
- Distributed tracing across service boundaries
- Metrics for throughput, latency, and error rates
- Structured logging with correlation IDs
3.9. Comparison with Google's Agent-to-Agent Protocol¶
Overview of Google's A2A Protocol¶
Google's Agent2Agent (A2A) is an open standard designed for enterprise-grade interoperability among AI agents. The core aspects include:
-
Agent Discovery via "Agent Cards"—agents advertise capabilities in a JSON Agent Card format to help other agents find the best fit for a task.
-
Task-Oriented Communication—client agents send tasks to remote agents, which respond with artifacts and real‑time status updates; long‑running tasks and streaming support are first-class features.
-
Secure, Standard Protocol—built on HTTP, JSON-RPC, and Server-Sent Events; enterprise-ready authentication and authorization (aligned with OpenAPI schemes) are built in.
-
Modality Agnostic—supports text, audio, video, and multi-part content negotiation through "parts" attached to each message.
-
Interoperability with MCP—A2A complements Anthropic's Model Context Protocol (MCP), which focuses on tool invocation, creating a full-stack agent interoperability ecosystem.
Architectural Comparison¶
| Aspect | A2A Protocol | SW4RM Framework |
|---|---|---|
| Discovery Mechanism | Agent Card-based (via well-known URLs) | Internal registry with broadcast discovery |
| Underlying Transport | HTTP / JSON-RPC / SSE | gRPC-native |
| Modality Handling | "Parts" with content-type negotiation | content_type + modality capabilities in registry |
| Task Focus | External task outsourcing and artifact return | Intra-system task scheduling and messaging |
| Preemption & Scheduling | Not addressed | Preemption, priorities, safe-points deeply defined |
| Preemptible Sections | ─ | Fully specified (§7) |
| Idempotency / Retry Logic | ─ | Robust idempotency_token model with cache (§11.1) |
| Negotiation & HITL Escalation | ─ | Built-in negotiation framework (§17) and escalation via HITL (§15) |
| Worktree & Confinement | ─ | Explicit Git worktree isolation (§16) |
| Observability & Logging | Enterprise-secure flow, but spec is lightweight | Strict structured logs and audit trails (§19) |
Overlap with SW4RM Specification¶
Similarities:
-
Agent Discovery & Registration: A2A's Agent Cards align closely with our Registry & Discovery module—even though we haven't explicitly structured Agent Card, our system supports name, capabilities, modality, and description in the registry (§14).
-
Secure Communication & Modality Support: Our use of gRPC with optional signing and multi-modal content types corresponds to A2A's modality-agnostic design and enterprise security foundation.
-
Long-Running Tasks & States: A2A's support for long-running workflows parallels our task lifecycle, message states, and streaming tool calls.
Complementary Design Philosophy:
Google's A2A focuses on secure, interoperable agent messaging across enterprise boundaries, emphasizing discovery, modality negotiation, and long-running tasks. SW4RM defines the deeper machinery—scheduling, cancellation, preemption, idempotency, negotiation, worktree confinement, logs, and tool integration. These two can co-exist: use A2A for agent-to-agent orchestration and SW4RM for a resilient internal engine.
3.10. Next Steps¶
- Message Types - Detailed message specifications
- Services - Complete service API reference
- ACK Lifecycle - Acknowledgment handling patterns