3. SW4RM Protocol Specification¶

SW4RM Protocol v0.1 | Status: Production Ready | Last Updated: 2025-08-09

This comprehensive protocol specification defines the complete SW4RM message-driven agent communication system. The protocol is built on industry-standard gRPC and Protocol Buffers, providing a robust foundation for enterprise-grade distributed agentic systems with guaranteed message delivery, comprehensive observability, and enterprise security features.

Terminology: “Agent” in this specification follows the supervised, process‑isolated definition in documentation/index.md (see “Agents and Agentic Interaction”), which differs from common industry usage where “agent” may mean an LLM wrapper or in‑process automation.

3.1. Executive Summary¶

The SW4RM protocol addresses the fundamental challenges of distributed agentic systems by providing a complete communication framework with the following core capabilities:

Guaranteed Message Delivery: At-least-once delivery semantics with configurable consistency levels
Comprehensive State Management: Persistent state across failures with automatic recovery mechanisms
Enterprise Security: Zero-trust architecture with mutual TLS and role-based access control
Production Observability: Complete distributed tracing, metrics, and audit logging
Horizontal Scalability: Linear scaling with no single points of failure
Multi-Tenancy Support: Secure isolation between different agent workloads

3.2. Architectural Foundation and Design Principles¶

3.2.1. Service-Oriented Architecture (SOA) Implementation¶

SW4RM implements a microservices architecture with clear service boundaries, standardized communication protocols, and comprehensive fault tolerance mechanisms. The architecture is designed for:

Independent Service Scaling: Each service can be scaled independently based on workload requirements
Fault Isolation: Service failures are contained and do not cascade to other components
Technology Diversity: Services can be implemented in different technologies while maintaining protocol compatibility
Operational Independence: Services can be deployed, monitored, and managed independently

graph TB
    subgraph "Client Layer [gRPC/TLS]"
        AGENT[Agent Applications<br/>Business Logic Layer]
        SDK[SW4RM SDK<br/>Runtime Library]
    end

    subgraph "Core Infrastructure Services"
        REGISTRY[Registry Service<br/>:50051<br/>Agent Discovery & Health]
        ROUTER[Router Service<br/>:50052<br/>Message Delivery & Routing]
        SCHEDULER[Scheduler Service<br/>:50053<br/>Task Distribution & Load Balancing]
    end

    subgraph "Extended Capability Services"
        HITL[Human-in-the-Loop Service<br/>:50061<br/>Approval Workflows & Escalation]
        WORKTREE[Worktree Service<br/>:50062<br/>Git Integration & Repository Management]
        TOOLS[Tool Service<br/>:50063<br/>External System Integration]
        NEGOTIATE[Negotiation Service<br/>:50064<br/>Multi-Agent Consensus & Coordination]
        REASON[Reasoning Service<br/>:50065<br/>Decision Support & Analytics]
        AUDIT[Audit Service<br/>:50066<br/>Compliance & Security Logging]
        CONNECT[Connector Service<br/>:50067<br/>External API Integration]
    end

    subgraph "Data & Storage Layer"
        POSTGRES[(PostgreSQL Cluster<br/>Transactional State)]
        REDIS[(Redis Cluster<br/>Session & Cache)]
        S3[(Object Storage<br/>Large Payloads & Archives)]
        GIT[(Git Repositories<br/>Source Code & Configuration)]
    end

    subgraph "Observability & Security"
        PROMETHEUS[Prometheus<br/>Metrics Collection]
        JAEGER[Jaeger<br/>Distributed Tracing]
        VAULT[HashiCorp Vault<br/>Secrets Management]
        CONSUL[Consul<br/>Service Discovery]
    end

    AGENT -->|gRPC/TLS| SDK
    SDK -->|Load Balanced| REGISTRY
    SDK -->|Message Flow| ROUTER
    SDK -->|Task Requests| SCHEDULER

    SDK -->|Approval Requests| HITL
    SDK -->|Repository Operations| WORKTREE
    SDK -->|External Calls| TOOLS
    SDK -->|Coordination| NEGOTIATE
    SDK -->|Analytics| REASON
    SDK -->|Audit Events| AUDIT
    SDK -->|API Integrations| CONNECT

    REGISTRY --> POSTGRES
    ROUTER --> POSTGRES
    ROUTER --> REDIS
    SCHEDULER --> POSTGRES
    SCHEDULER --> REDIS

    WORKTREE --> GIT
    TOOLS --> S3
    AUDIT --> S3

    REGISTRY -.-> PROMETHEUS
    ROUTER -.-> JAEGER
    SCHEDULER -.-> VAULT
    HITL -.-> CONSUL

3.2.2. Fundamental Protocol Design Principles¶

3.2.2.1. Message-Driven Communication Model¶

Event Sourcing Architecture: All system interactions are represented as immutable events (messages) that form an event log, enabling complete system state reconstruction and audit trails.

Technical Implementation:

Message Persistence: All messages are durably persisted before acknowledgment using write-ahead logging
Event Ordering: Global message ordering using hybrid logical clocks (HLC) for causal consistency
Message Deduplication: SHA-256 based content hashing prevents duplicate message processing
Delivery Semantics: Configurable delivery guarantees (at-most-once, at-least-once, exactly-once)

3.2.2.2. Distributed System Consistency Model¶

Implementation of Eventual Consistency with Strong Consistency Options:

Eventual Consistency (Default): Default mode with eventual convergence guarantees
Strong Consistency: Configurable strong consistency for critical operations using distributed consensus
Causal Consistency: Maintains causal relationships between related messages using vector clocks
Session Consistency: Guarantees consistency within agent session boundaries

Consistency Configuration:

message ConsistencyConfig {
  ConsistencyLevel default_level = 1;
  map<string, ConsistencyLevel> operation_overrides = 2;
  uint32 eventual_consistency_timeout_ms = 3;  // Default: 5000ms
  uint32 strong_consistency_timeout_ms = 4;    // Default: 30000ms
}

enum ConsistencyLevel {
  EVENTUAL = 0;      // Best performance, eventual convergence
  CAUSAL = 1;        // Maintains causal relationships
  SESSION = 2;       // Consistency within agent sessions
  STRONG = 3;        // Distributed consensus, highest latency
}

3.2.2.3. Comprehensive Security Architecture¶

Zero-Trust Network Model: Every service interaction requires authentication and authorization, with no implicit trust relationships.

Security Implementation Layers:

Transport Security:
Mutual TLS (mTLS) for all inter-service communication
TLS 1.3 with forward secrecy using ECDHE key exchange
Certificate rotation with 24-hour certificate lifetime
Certificate pinning for critical service connections
Authentication & Authorization:
OAuth 2.0 / OpenID Connect integration for external authentication
JWT tokens with configurable expiration (default: 1 hour)
Role-Based Access Control (RBAC) with fine-grained permissions
Attribute-Based Access Control (ABAC) for complex authorization scenarios
Data Protection:
AES-256-GCM encryption for sensitive payloads
Field-level encryption for PII and sensitive data
Cryptographic signatures for message integrity verification
Key management integration with HashiCorp Vault or AWS KMS

Security Configuration Example:

message SecurityConfig {
  TLSConfig tls_config = 1;
  AuthenticationConfig auth_config = 2;
  EncryptionConfig encryption_config = 3;
  AuditConfig audit_config = 4;
}

message TLSConfig {
  string ca_cert_path = 1;
  string client_cert_path = 2;
  string client_key_path = 3;
  repeated string cipher_suites = 4;
  uint32 handshake_timeout_seconds = 5;  // Default: 10
  bool enable_cert_pinning = 6;
}

message AuthenticationConfig {
  string jwt_secret_key = 1;
  uint32 token_expiry_seconds = 2;       // Default: 3600
  repeated string allowed_issuers = 3;
  bool enable_service_accounts = 4;
  string service_account_key_path = 5;
}

3.2.2.4. Enterprise-Grade Observability Framework¶

Three Pillars of Observability Implementation:

Comprehensive Metrics Collection:
Business metrics: Message processing rates, success/failure ratios, processing latencies
System metrics: CPU, memory, network, disk utilization per service
Custom metrics: Domain-specific KPIs and performance indicators
Real-time alerting with configurable thresholds and escalation policies
Distributed Tracing:
OpenTelemetry-compliant distributed tracing across all service boundaries
Trace sampling strategies: Always, never, probabilistic, adaptive
Trace correlation across message processing pipelines
Performance bottleneck identification and optimization recommendations
Structured Audit Logging:
Immutable audit logs with cryptographic integrity verification
Comprehensive security event logging (authentication, authorization, data access)
Business process audit trails for compliance requirements
Log retention policies with automated archival to cold storage

Observability Configuration:

message ObservabilityConfig {
  MetricsConfig metrics = 1;
  TracingConfig tracing = 2;
  LoggingConfig logging = 3;
}

message TracingConfig {
  bool enabled = 1;
  string jaeger_endpoint = 2;
  SamplingStrategy sampling = 3;
  map<string, string> tags = 4;
}

enum SamplingStrategy {
  ALWAYS = 0;
  NEVER = 1;
  PROBABILISTIC = 2;  // Requires sampling_rate
  ADAPTIVE = 3;       // AI-based sampling
}

3.3. Core Concepts¶

3.3.1. Message Envelope¶

Every message is wrapped in a standard envelope providing:

message Envelope {
  string message_id = 1;                // UUIDv4 per attempt
  string idempotency_token = 2;         // Stable across retries
  string producer_id = 3;               // Source agent identifier
  string correlation_id = 4;            // Request/response correlation
  uint64 sequence_number = 5;           // Ordering within conversation
  uint32 retry_count = 6;               // Retry attempt number
  MessageType message_type = 7;         // Message classification
  string content_type = 8;              // Payload format (MIME type)
  uint64 content_length = 9;            // Payload size in bytes
  string repo_id = 10;                  // Repository context (optional)
  string worktree_id = 11;              // Worktree context (optional)  
  string hlc_timestamp = 12;            // Hybrid logical clock
  uint64 ttl_ms = 13;                   // Time-to-live in milliseconds
  google.protobuf.Timestamp timestamp = 14; // Delivery timestamp
  bytes payload = 15;                   // Message content
}

3.3.2. Message Types¶

Type	Value	Description	Use Case
`DATA`	2	Application payload	Business logic, responses, content
`CONTROL`	1	System commands	Status requests, configuration
`ACKNOWLEDGEMENT`	5	Message confirmations	Delivery receipts, error reports
`HITL_INVOCATION`	6	Human-in-the-loop requests	Approval workflows, escalations
`WORKTREE_CONTROL`	7	Repository operations	Bind, unbind, switch contexts
`NEGOTIATION`	8	Multi-party coordination	Consensus, resource allocation
`TOOL_CALL`	9	External tool execution	API calls, system commands
`TOOL_RESULT`	10	Tool execution results	Success responses, data returns
`TOOL_ERROR`	11	Tool execution failures	Error conditions, exceptions

3.3.3. Acknowledgment Lifecycle¶

Every message follows a predictable ACK progression:

sequenceDiagram
    participant S as Sender
    participant R as Router  
    participant T as Target

    S->>R: SendMessage(envelope)
    R-->>S: SendMessageResponse{accepted: true}

    R->>T: Deliver envelope
    T-->>R: ACK{stage: RECEIVED}

    T->>T: Parse and validate
    T-->>R: ACK{stage: READ}

    T->>T: Process message
    alt Success
        T-->>R: ACK{stage: FULFILLED}
    else Error  
        T-->>R: ACK{stage: FAILED, error_code: X}
    end

    R->>S: Forward ACKs

ACK Stages:

RECEIVED (1): Message delivered to target
READ (2): Message parsed and validated
FULFILLED (3): Processing completed successfully
REJECTED (4): Message rejected due to policy/validation
FAILED (5): Processing failed due to error
TIMED_OUT (6): Processing exceeded time limits

3.4. Service Architecture¶

3.4.1. Core Services¶

Registry Service - Agent lifecycle management

Agent registration and discovery
Health monitoring and heartbeats
Capability advertisement

Router Service - Message delivery

Reliable message routing between agents
Message streaming and buffering
Load balancing and failover

Scheduler Service - Work coordination

Task distribution and prioritization
Resource allocation and preemption
Activity buffer management

3.4.2. Extended Services¶

HITL Service - Human oversight

Escalation workflows and approvals
Decision points and manual overrides
Audit trails and compliance

Worktree Service - Repository context

Git repository binding and switching
Branch and commit management
Workspace isolation

Tool Service - External integrations

API and system command execution
Result capture and error handling
Permission and security policies

3.5. Message Patterns¶

3.5.1. Request-Response¶

// Request
message: {
  message_type: DATA,
  correlation_id: "req-123",
  payload: {...}
}

// Response  
message: {
  message_type: DATA,
  correlation_id: "req-123",
  payload: {...}
}

3.5.2. Fire-and-Forget¶

message: {
  message_type: NOTIFICATION,
  correlation_id: "",  // No response expected
  payload: {...}
}

3.5.3. Command Pattern¶

message: {
  message_type: CONTROL,
  payload: {
    "command": "status",
    "parameters": {...}
  }
}

3.6. Error Handling¶

3.6.1. Error Codes¶

Code	Name	Description
0	`UNSPECIFIED`	No error or unknown error
1	`BUFFER_FULL`	Message queue capacity exceeded
2	`NO_ROUTE`	No path to destination agent
3	`ACK_TIMEOUT`	Acknowledgment not received in time
6	`VALIDATION_ERROR`	Message format or content invalid
7	`PERMISSION_DENIED`	Insufficient privileges for operation
9	`OVERSIZE_PAYLOAD`	Message exceeds size limits
99	`INTERNAL_ERROR`	Unexpected system failure

3.6.2. Error Response Pattern¶

ack: {
  ack_for_message_id: "original-msg-id",
  ack_stage: FAILED,
  error_code: VALIDATION_ERROR,
  note: "Required field 'agent_id' missing"
}

3.7. Security Model¶

3.7.1. Authentication¶

Service-to-service authentication via mutual TLS
Agent identity verification through public key cryptography
Token-based session management

3.7.2. Authorization¶

Role-based access control (RBAC) for service operations
Message-level permissions based on sender/receiver identity
Policy-based filtering and transformation

3.7.3. Data Protection¶

End-to-end encryption for sensitive payloads
Audit logging of all security-relevant operations
Compliance with data residency and retention policies

3.8. Deployment Considerations¶

3.8.1. Scalability¶

Horizontal scaling of all services
Message partitioning and sharding
Load balancing with session affinity

3.8.2. Reliability¶

At-least-once message delivery guarantees
Circuit breakers and retry policies
Graceful degradation during partial failures

3.8.3. Observability¶

Distributed tracing across service boundaries
Metrics for throughput, latency, and error rates
Structured logging with correlation IDs

3.9. Comparison with Google's Agent-to-Agent Protocol¶

Overview of Google's A2A Protocol¶

Google's Agent2Agent (A2A) is an open standard designed for enterprise-grade interoperability among AI agents. The core aspects include:

Agent Discovery via "Agent Cards"—agents advertise capabilities in a JSON Agent Card format to help other agents find the best fit for a task.
Task-Oriented Communication—client agents send tasks to remote agents, which respond with artifacts and real‑time status updates; long‑running tasks and streaming support are first-class features.
Secure, Standard Protocol—built on HTTP, JSON-RPC, and Server-Sent Events; enterprise-ready authentication and authorization (aligned with OpenAPI schemes) are built in.
Modality Agnostic—supports text, audio, video, and multi-part content negotiation through "parts" attached to each message.
Interoperability with MCP—A2A complements Anthropic's Model Context Protocol (MCP), which focuses on tool invocation, creating a full-stack agent interoperability ecosystem.

Architectural Comparison¶

Aspect	A2A Protocol	SW4RM Framework
Discovery Mechanism	Agent Card-based (via well-known URLs)	Internal registry with broadcast discovery
Underlying Transport	HTTP / JSON-RPC / SSE	gRPC-native
Modality Handling	"Parts" with content-type negotiation	content_type + modality capabilities in registry
Task Focus	External task outsourcing and artifact return	Intra-system task scheduling and messaging
Preemption & Scheduling	Not addressed	Preemption, priorities, safe-points deeply defined
Preemptible Sections	─	Fully specified (§7)
Idempotency / Retry Logic	─	Robust idempotency_token model with cache (§11.1)
Negotiation & HITL Escalation	─	Built-in negotiation framework (§17) and escalation via HITL (§15)
Worktree & Confinement	─	Explicit Git worktree isolation (§16)
Observability & Logging	Enterprise-secure flow, but spec is lightweight	Strict structured logs and audit trails (§19)

Overlap with SW4RM Specification¶

Similarities:

Agent Discovery & Registration: A2A's Agent Cards align closely with our Registry & Discovery module—even though we haven't explicitly structured Agent Card, our system supports name, capabilities, modality, and description in the registry (§14).
Secure Communication & Modality Support: Our use of gRPC with optional signing and multi-modal content types corresponds to A2A's modality-agnostic design and enterprise security foundation.
Long-Running Tasks & States: A2A's support for long-running workflows parallels our task lifecycle, message states, and streaming tool calls.

Complementary Design Philosophy:

Google's A2A focuses on secure, interoperable agent messaging across enterprise boundaries, emphasizing discovery, modality negotiation, and long-running tasks. SW4RM defines the deeper machinery—scheduling, cancellation, preemption, idempotency, negotiation, worktree confinement, logs, and tool integration. These two can co-exist: use A2A for agent-to-agent orchestration and SW4RM for a resilient internal engine.

3.10. Next Steps¶

Message Types - Detailed message specifications
Services - Complete service API reference
ACK Lifecycle - Acknowledgment handling patterns