Skip to content

3. SW4RM Protocol Specification

(Link to full RFC)

SW4RM Protocol v0.1 | Status: Production Ready | Last Updated: 2025-08-09

This comprehensive protocol specification defines the complete SW4RM message-driven agent communication system. The protocol is built on industry-standard gRPC and Protocol Buffers, providing a robust foundation for enterprise-grade distributed agentic systems with guaranteed message delivery, comprehensive observability, and enterprise security features.

Terminology: “Agent” in this specification follows the supervised, process‑isolated definition in documentation/index.md (see “Agents and Agentic Interaction”), which differs from common industry usage where “agent” may mean an LLM wrapper or in‑process automation.

3.1. Executive Summary

The SW4RM protocol addresses the fundamental challenges of distributed agentic systems by providing a complete communication framework with the following core capabilities:

  • Guaranteed Message Delivery: At-least-once delivery semantics with configurable consistency levels
  • Comprehensive State Management: Persistent state across failures with automatic recovery mechanisms
  • Enterprise Security: Zero-trust architecture with mutual TLS and role-based access control
  • Production Observability: Complete distributed tracing, metrics, and audit logging
  • Horizontal Scalability: Linear scaling with no single points of failure
  • Multi-Tenancy Support: Secure isolation between different agent workloads

3.2. Architectural Foundation and Design Principles

3.2.1. Service-Oriented Architecture (SOA) Implementation

SW4RM implements a microservices architecture with clear service boundaries, standardized communication protocols, and comprehensive fault tolerance mechanisms. The architecture is designed for:

  • Independent Service Scaling: Each service can be scaled independently based on workload requirements
  • Fault Isolation: Service failures are contained and do not cascade to other components
  • Technology Diversity: Services can be implemented in different technologies while maintaining protocol compatibility
  • Operational Independence: Services can be deployed, monitored, and managed independently
graph TB
    subgraph "Client Layer [gRPC/TLS]"
        AGENT[Agent Applications<br/>Business Logic Layer]
        SDK[SW4RM SDK<br/>Runtime Library]
    end

    subgraph "Core Infrastructure Services"
        REGISTRY[Registry Service<br/>:50051<br/>Agent Discovery & Health]
        ROUTER[Router Service<br/>:50052<br/>Message Delivery & Routing]
        SCHEDULER[Scheduler Service<br/>:50053<br/>Task Distribution & Load Balancing]
    end

    subgraph "Extended Capability Services"
        HITL[Human-in-the-Loop Service<br/>:50061<br/>Approval Workflows & Escalation]
        WORKTREE[Worktree Service<br/>:50062<br/>Git Integration & Repository Management]
        TOOLS[Tool Service<br/>:50063<br/>External System Integration]
        NEGOTIATE[Negotiation Service<br/>:50064<br/>Multi-Agent Consensus & Coordination]
        REASON[Reasoning Service<br/>:50065<br/>Decision Support & Analytics]
        AUDIT[Audit Service<br/>:50066<br/>Compliance & Security Logging]
        CONNECT[Connector Service<br/>:50067<br/>External API Integration]
    end

    subgraph "Data & Storage Layer"
        POSTGRES[(PostgreSQL Cluster<br/>Transactional State)]
        REDIS[(Redis Cluster<br/>Session & Cache)]
        S3[(Object Storage<br/>Large Payloads & Archives)]
        GIT[(Git Repositories<br/>Source Code & Configuration)]
    end

    subgraph "Observability & Security"
        PROMETHEUS[Prometheus<br/>Metrics Collection]
        JAEGER[Jaeger<br/>Distributed Tracing]
        VAULT[HashiCorp Vault<br/>Secrets Management]
        CONSUL[Consul<br/>Service Discovery]
    end

    AGENT -->|gRPC/TLS| SDK
    SDK -->|Load Balanced| REGISTRY
    SDK -->|Message Flow| ROUTER
    SDK -->|Task Requests| SCHEDULER

    SDK -->|Approval Requests| HITL
    SDK -->|Repository Operations| WORKTREE
    SDK -->|External Calls| TOOLS
    SDK -->|Coordination| NEGOTIATE
    SDK -->|Analytics| REASON
    SDK -->|Audit Events| AUDIT
    SDK -->|API Integrations| CONNECT

    REGISTRY --> POSTGRES
    ROUTER --> POSTGRES
    ROUTER --> REDIS
    SCHEDULER --> POSTGRES
    SCHEDULER --> REDIS

    WORKTREE --> GIT
    TOOLS --> S3
    AUDIT --> S3

    REGISTRY -.-> PROMETHEUS
    ROUTER -.-> JAEGER
    SCHEDULER -.-> VAULT
    HITL -.-> CONSUL

3.2.2. Fundamental Protocol Design Principles

3.2.2.1. Message-Driven Communication Model

Event Sourcing Architecture: All system interactions are represented as immutable events (messages) that form an event log, enabling complete system state reconstruction and audit trails.

Technical Implementation:

  • Message Persistence: All messages are durably persisted before acknowledgment using write-ahead logging
  • Event Ordering: Global message ordering using hybrid logical clocks (HLC) for causal consistency
  • Message Deduplication: SHA-256 based content hashing prevents duplicate message processing
  • Delivery Semantics: Configurable delivery guarantees (at-most-once, at-least-once, exactly-once)

3.2.2.2. Distributed System Consistency Model

Implementation of Eventual Consistency with Strong Consistency Options:

  • Eventual Consistency (Default): Default mode with eventual convergence guarantees
  • Strong Consistency: Configurable strong consistency for critical operations using distributed consensus
  • Causal Consistency: Maintains causal relationships between related messages using vector clocks
  • Session Consistency: Guarantees consistency within agent session boundaries

Consistency Configuration:

message ConsistencyConfig {
  ConsistencyLevel default_level = 1;
  map<string, ConsistencyLevel> operation_overrides = 2;
  uint32 eventual_consistency_timeout_ms = 3;  // Default: 5000ms
  uint32 strong_consistency_timeout_ms = 4;    // Default: 30000ms
}

enum ConsistencyLevel {
  EVENTUAL = 0;      // Best performance, eventual convergence
  CAUSAL = 1;        // Maintains causal relationships
  SESSION = 2;       // Consistency within agent sessions
  STRONG = 3;        // Distributed consensus, highest latency
}

3.2.2.3. Comprehensive Security Architecture

Zero-Trust Network Model: Every service interaction requires authentication and authorization, with no implicit trust relationships.

Security Implementation Layers:

  1. Transport Security:
  2. Mutual TLS (mTLS) for all inter-service communication
  3. TLS 1.3 with forward secrecy using ECDHE key exchange
  4. Certificate rotation with 24-hour certificate lifetime
  5. Certificate pinning for critical service connections

  6. Authentication & Authorization:

  7. OAuth 2.0 / OpenID Connect integration for external authentication
  8. JWT tokens with configurable expiration (default: 1 hour)
  9. Role-Based Access Control (RBAC) with fine-grained permissions
  10. Attribute-Based Access Control (ABAC) for complex authorization scenarios

  11. Data Protection:

  12. AES-256-GCM encryption for sensitive payloads
  13. Field-level encryption for PII and sensitive data
  14. Cryptographic signatures for message integrity verification
  15. Key management integration with HashiCorp Vault or AWS KMS

Security Configuration Example:

message SecurityConfig {
  TLSConfig tls_config = 1;
  AuthenticationConfig auth_config = 2;
  EncryptionConfig encryption_config = 3;
  AuditConfig audit_config = 4;
}

message TLSConfig {
  string ca_cert_path = 1;
  string client_cert_path = 2;
  string client_key_path = 3;
  repeated string cipher_suites = 4;
  uint32 handshake_timeout_seconds = 5;  // Default: 10
  bool enable_cert_pinning = 6;
}

message AuthenticationConfig {
  string jwt_secret_key = 1;
  uint32 token_expiry_seconds = 2;       // Default: 3600
  repeated string allowed_issuers = 3;
  bool enable_service_accounts = 4;
  string service_account_key_path = 5;
}

3.2.2.4. Enterprise-Grade Observability Framework

Three Pillars of Observability Implementation:

  1. Comprehensive Metrics Collection:
  2. Business metrics: Message processing rates, success/failure ratios, processing latencies
  3. System metrics: CPU, memory, network, disk utilization per service
  4. Custom metrics: Domain-specific KPIs and performance indicators
  5. Real-time alerting with configurable thresholds and escalation policies

  6. Distributed Tracing:

  7. OpenTelemetry-compliant distributed tracing across all service boundaries
  8. Trace sampling strategies: Always, never, probabilistic, adaptive
  9. Trace correlation across message processing pipelines
  10. Performance bottleneck identification and optimization recommendations

  11. Structured Audit Logging:

  12. Immutable audit logs with cryptographic integrity verification
  13. Comprehensive security event logging (authentication, authorization, data access)
  14. Business process audit trails for compliance requirements
  15. Log retention policies with automated archival to cold storage

Observability Configuration:

message ObservabilityConfig {
  MetricsConfig metrics = 1;
  TracingConfig tracing = 2;
  LoggingConfig logging = 3;
}

message TracingConfig {
  bool enabled = 1;
  string jaeger_endpoint = 2;
  SamplingStrategy sampling = 3;
  map<string, string> tags = 4;
}

enum SamplingStrategy {
  ALWAYS = 0;
  NEVER = 1;
  PROBABILISTIC = 2;  // Requires sampling_rate
  ADAPTIVE = 3;       // AI-based sampling
}

3.3. Core Concepts

3.3.1. Message Envelope

Every message is wrapped in a standard envelope providing:

message Envelope {
  string message_id = 1;                // UUIDv4 per attempt
  string idempotency_token = 2;         // Stable across retries
  string producer_id = 3;               // Source agent identifier
  string correlation_id = 4;            // Request/response correlation
  uint64 sequence_number = 5;           // Ordering within conversation
  uint32 retry_count = 6;               // Retry attempt number
  MessageType message_type = 7;         // Message classification
  string content_type = 8;              // Payload format (MIME type)
  uint64 content_length = 9;            // Payload size in bytes
  string repo_id = 10;                  // Repository context (optional)
  string worktree_id = 11;              // Worktree context (optional)  
  string hlc_timestamp = 12;            // Hybrid logical clock
  uint64 ttl_ms = 13;                   // Time-to-live in milliseconds
  google.protobuf.Timestamp timestamp = 14; // Delivery timestamp
  bytes payload = 15;                   // Message content
}

3.3.2. Message Types

Type Value Description Use Case
DATA 2 Application payload Business logic, responses, content
CONTROL 1 System commands Status requests, configuration
ACKNOWLEDGEMENT 5 Message confirmations Delivery receipts, error reports
HITL_INVOCATION 6 Human-in-the-loop requests Approval workflows, escalations
WORKTREE_CONTROL 7 Repository operations Bind, unbind, switch contexts
NEGOTIATION 8 Multi-party coordination Consensus, resource allocation
TOOL_CALL 9 External tool execution API calls, system commands
TOOL_RESULT 10 Tool execution results Success responses, data returns
TOOL_ERROR 11 Tool execution failures Error conditions, exceptions

3.3.3. Acknowledgment Lifecycle

Every message follows a predictable ACK progression:

sequenceDiagram
    participant S as Sender
    participant R as Router  
    participant T as Target

    S->>R: SendMessage(envelope)
    R-->>S: SendMessageResponse{accepted: true}

    R->>T: Deliver envelope
    T-->>R: ACK{stage: RECEIVED}

    T->>T: Parse and validate
    T-->>R: ACK{stage: READ}

    T->>T: Process message
    alt Success
        T-->>R: ACK{stage: FULFILLED}
    else Error  
        T-->>R: ACK{stage: FAILED, error_code: X}
    end

    R->>S: Forward ACKs

ACK Stages:

  • RECEIVED (1): Message delivered to target
  • READ (2): Message parsed and validated
  • FULFILLED (3): Processing completed successfully
  • REJECTED (4): Message rejected due to policy/validation
  • FAILED (5): Processing failed due to error
  • TIMED_OUT (6): Processing exceeded time limits

3.4. Service Architecture

3.4.1. Core Services

Registry Service - Agent lifecycle management

  • Agent registration and discovery
  • Health monitoring and heartbeats
  • Capability advertisement

Router Service - Message delivery

  • Reliable message routing between agents
  • Message streaming and buffering
  • Load balancing and failover

Scheduler Service - Work coordination

  • Task distribution and prioritization
  • Resource allocation and preemption
  • Activity buffer management

3.4.2. Extended Services

HITL Service - Human oversight

  • Escalation workflows and approvals
  • Decision points and manual overrides
  • Audit trails and compliance

Worktree Service - Repository context

  • Git repository binding and switching
  • Branch and commit management
  • Workspace isolation

Tool Service - External integrations

  • API and system command execution
  • Result capture and error handling
  • Permission and security policies

3.5. Message Patterns

3.5.1. Request-Response

// Request
message: {
  message_type: DATA,
  correlation_id: "req-123",
  payload: {...}
}

// Response  
message: {
  message_type: DATA,
  correlation_id: "req-123",
  payload: {...}
}

3.5.2. Fire-and-Forget

message: {
  message_type: NOTIFICATION,
  correlation_id: "",  // No response expected
  payload: {...}
}

3.5.3. Command Pattern

message: {
  message_type: CONTROL,
  payload: {
    "command": "status",
    "parameters": {...}
  }
}

3.6. Error Handling

3.6.1. Error Codes

Code Name Description
0 UNSPECIFIED No error or unknown error
1 BUFFER_FULL Message queue capacity exceeded
2 NO_ROUTE No path to destination agent
3 ACK_TIMEOUT Acknowledgment not received in time
6 VALIDATION_ERROR Message format or content invalid
7 PERMISSION_DENIED Insufficient privileges for operation
9 OVERSIZE_PAYLOAD Message exceeds size limits
99 INTERNAL_ERROR Unexpected system failure

3.6.2. Error Response Pattern

ack: {
  ack_for_message_id: "original-msg-id",
  ack_stage: FAILED,
  error_code: VALIDATION_ERROR,
  note: "Required field 'agent_id' missing"
}

3.7. Security Model

3.7.1. Authentication

  • Service-to-service authentication via mutual TLS
  • Agent identity verification through public key cryptography
  • Token-based session management

3.7.2. Authorization

  • Role-based access control (RBAC) for service operations
  • Message-level permissions based on sender/receiver identity
  • Policy-based filtering and transformation

3.7.3. Data Protection

  • End-to-end encryption for sensitive payloads
  • Audit logging of all security-relevant operations
  • Compliance with data residency and retention policies

3.8. Deployment Considerations

3.8.1. Scalability

  • Horizontal scaling of all services
  • Message partitioning and sharding
  • Load balancing with session affinity

3.8.2. Reliability

  • At-least-once message delivery guarantees
  • Circuit breakers and retry policies
  • Graceful degradation during partial failures

3.8.3. Observability

  • Distributed tracing across service boundaries
  • Metrics for throughput, latency, and error rates
  • Structured logging with correlation IDs

3.9. Comparison with Google's Agent-to-Agent Protocol

Overview of Google's A2A Protocol

Google's Agent2Agent (A2A) is an open standard designed for enterprise-grade interoperability among AI agents. The core aspects include:

  • Agent Discovery via "Agent Cards"—agents advertise capabilities in a JSON Agent Card format to help other agents find the best fit for a task.

  • Task-Oriented Communication—client agents send tasks to remote agents, which respond with artifacts and real‑time status updates; long‑running tasks and streaming support are first-class features.

  • Secure, Standard Protocol—built on HTTP, JSON-RPC, and Server-Sent Events; enterprise-ready authentication and authorization (aligned with OpenAPI schemes) are built in.

  • Modality Agnostic—supports text, audio, video, and multi-part content negotiation through "parts" attached to each message.

  • Interoperability with MCP—A2A complements Anthropic's Model Context Protocol (MCP), which focuses on tool invocation, creating a full-stack agent interoperability ecosystem.

Architectural Comparison

Aspect A2A Protocol SW4RM Framework
Discovery Mechanism Agent Card-based (via well-known URLs) Internal registry with broadcast discovery
Underlying Transport HTTP / JSON-RPC / SSE gRPC-native
Modality Handling "Parts" with content-type negotiation content_type + modality capabilities in registry
Task Focus External task outsourcing and artifact return Intra-system task scheduling and messaging
Preemption & Scheduling Not addressed Preemption, priorities, safe-points deeply defined
Preemptible Sections Fully specified (§7)
Idempotency / Retry Logic Robust idempotency_token model with cache (§11.1)
Negotiation & HITL Escalation Built-in negotiation framework (§17) and escalation via HITL (§15)
Worktree & Confinement Explicit Git worktree isolation (§16)
Observability & Logging Enterprise-secure flow, but spec is lightweight Strict structured logs and audit trails (§19)

Overlap with SW4RM Specification

Similarities:

  • Agent Discovery & Registration: A2A's Agent Cards align closely with our Registry & Discovery module—even though we haven't explicitly structured Agent Card, our system supports name, capabilities, modality, and description in the registry (§14).

  • Secure Communication & Modality Support: Our use of gRPC with optional signing and multi-modal content types corresponds to A2A's modality-agnostic design and enterprise security foundation.

  • Long-Running Tasks & States: A2A's support for long-running workflows parallels our task lifecycle, message states, and streaming tool calls.

Complementary Design Philosophy:

Google's A2A focuses on secure, interoperable agent messaging across enterprise boundaries, emphasizing discovery, modality negotiation, and long-running tasks. SW4RM defines the deeper machinery—scheduling, cancellation, preemption, idempotency, negotiation, worktree confinement, logs, and tool integration. These two can co-exist: use A2A for agent-to-agent orchestration and SW4RM for a resilient internal engine.

3.10. Next Steps