5. Architecture¶
Deep dive into the SW4RM SDK architecture, design patterns, and extensibility. This page complements the Protocol Specification and Getting Started guides, and links into deeper sections where appropriate.
5.1. Overview¶
The SDK follows a layered architecture that keeps the runtime simple while enabling robust interop with core services and protocol primitives.
Note: “Agent” in this documentation follows the supervised, process‑isolated definition in the main index (see “Agents and Agentic Interaction” in documentation/index.md) rather than the colloquial “LLM wrapper” usage.
graph TB
subgraph "Application Layer"
A[Your Agent Code]
B[Message Handlers]
end
subgraph "SDK Layer"
C[MessageProcessor]
D[ACKLifecycleManager]
E[Policy Hooks]
end
subgraph "Runtime State"
F[Activity Buffer]
G[Worktree State]
end
subgraph "Client Layer"
H[Router Client]
I[Registry Client]
J[Other Clients]
end
subgraph "Protocol Layer"
K[Protobuf Stubs]
L[gRPC Channels]
end
A --> C
B --> C
C --> D
C --> F
C --> G
D --> H
F --> E
G --> E
H --> K
I --> K
J --> K
K --> L
ACK Legend
- RECEIVED: Router durably accepted and persisted the message; origin records receipt and retries if this is not observed before timeout.
- READ: Target parsed and validated the message and accepted it for processing.
- FULFILLED: Target completed processing successfully; origin finalizes lifecycle.
For a system-wide view including services, data stores, and ports, see Detailed System Architecture in the main index.
- Detailed system diagram: Detailed System Architecture
- Persistent state model: Persistent State Management Architecture
- Protocol services and messages: Services and Messages
5.2. Core Principles¶
- Persistence by design: All stateful runtime components (Activity Buffer, Worktree State, configuration) persist across restarts using pluggable backends. Writes are made durable before emitting downstream ACKs, and on startup the runtime replays the log and reconciles ordering (e.g., vector clocks) to guarantee consistent recovery.
- Policy-driven behavior: Policy hooks intercept messages at defined phases (ingress, pre-dispatch, post-handler, egress) for validation, transformation, routing, and quota enforcement. Hooks execute deterministically with priority, timeouts, and error isolation to ensure predictable outcomes without compromising throughput.
- Explicit ACK lifecycle: Delivery confirmation is explicit (RECEIVED → READ → FULFILLED) with
idempotency_tokens, bounded retries with exponential backoff, and DLQ on exhaustion. This yields at-least-once delivery while preventing duplicate side effects in handlers via idempotent processing. - Composability: SDK components are modular and replaceable—use only MessageProcessor, ACKLifecycleManager, ActivityBuffer, or Worktree integrations you need. Clients are loosely coupled through protobuf contracts, enabling incremental adoption and component swaps behind stable interfaces.
5.3. Runtime Data Flow¶
The sequence below shows a typical end-to-end message journey with ACK lifecycle and persistence.
sequenceDiagram
autonumber
participant APP as Agent App
participant SDK as SDK Runtime
participant AB as Activity Buffer
participant RTR as Router Service
participant TGT as Target Agent
APP->>SDK: send(message)
SDK->>AB: persist outbound intent
SDK->>RTR: deliver(message)
RTR-->>SDK: ACK: RECEIVED
SDK->>AB: record delivery receipt
RTR->>TGT: forward(message)
TGT-->>RTR: ACK: FULFILLED
RTR-->>SDK: ACK: FULFILLED
SDK->>AB: finalize message state
ACK Legend
- RECEIVED: Router durably accepted and persisted the message at ingress to the Router.
- READ: Target validated and accepted the message for processing (omitted from the diagram for brevity).
- FULFILLED: Emitted by the target after successful handling; Router relays to origin which then finalizes lifecycle state.
Explanation
- send(message): The application calls the SDK, which assigns a
message_idandidempotency_token, stamps causal metadata (e.g., vector clock), and prepares delivery options (TTL, priority, retries). - persist outbound intent: The SDK appends an immutable record to the Activity Buffer before any network I/O so a crash cannot lose intent; on restart, the SDK can safely re‑emit the same message-id.
- deliver(message): The SDK transmits to the Router. The Router durably persists and enqueues the message; only after fsync/durable write does it emit ACK: RECEIVED.
- ACK: RECEIVED: Confirms the Router accepted responsibility for delivery. If this isn’t received before timeout, the SDK retries with exponential backoff; duplicates are harmless due to idempotency.
- forward(message): The Router routes to the target agent based on addressing/capabilities. The target SDK pulls/receives and dispatches to the appropriate handler.
- ACK: FULFILLED: After the target handler completes successfully (side effects committed), the target SDK emits FULFILLED back via the Router.
- finalize message state: The origin SDK records completion in the Activity Buffer and closes the lifecycle. On repeated failures, the Router moves the message to a DLQ with reason and trace context.
Notes
- Semantics: RECEIVED = durably accepted by Router; READ = validated by the target; FULFILLED = successfully handled by the target. This supports at‑least‑once delivery with idempotent handlers.
- Ordering: FIFO within a conversation is preserved using causal metadata; cross‑conversation ordering is not guaranteed.
- Recovery: On restart, the origin SDK replays unsatisfied intents from the Activity Buffer and resumes delivery using the same ids, preventing duplicate side effects.
Inbound Processing Flow
sequenceDiagram
autonumber
participant RTR as Router Service
participant SDK as Target SDK
participant AB as Activity Buffer
participant POL as Policy Hooks
participant HND as Handler
participant ACK as ACKLifecycleManager
RTR->>SDK: deliver(message)
SDK->>AB: persist inbound receipt
SDK->>POL: ingress checks (validate, transform, route)
SDK->>ACK: dedup check (idempotency_token)
SDK->>HND: dispatch(message)
HND-->>SDK: success(side effects committed)
SDK->>POL: post-handler hooks
SDK->>ACK: emit ACK: FULFILLED
ACK-->>RTR: ack(processed)
SDK->>AB: finalize message state
alt handler error or policy violation
SDK->>ACK: retry or DLQ with reason
ACK-->>RTR: negative outcome (requeue/park)
end
Explanation
- Ingress persistence: The target SDK writes a durable receipt before any user code executes, enabling safe resumption after crashes with no message loss or double-processing.
- Dedup/idempotency: The
idempotency_token(plus causal metadata) prevents re-applying side effects when redeliveries occur due to retries or network partitions. - Policy enforcement: Ingress and post-handler hooks enforce validation, transformation, and routing with deterministic priorities and bounded execution to protect throughput.
- Handler execution: Handlers run with full context and should commit side effects atomically or implement compensations, ensuring retry safety and correctness.
- Acknowledgment: The SDK sends FULFILLED only after successful handler completion and post-hooks, preserving at-least-once semantics.
- Failure path: On errors, the ACK manager applies bounded retries with backoff; persistent failures are DLQ’d with structured reason and trace correlation for analysis.
Learn more about message types and services in the Protocol Specification: Messages and Services.
5.4. Components and Services¶
- MessageProcessor: Registers typed handlers, applies the policy chain, and routes inbound/outbound messages based on conversation context and addressing. It executes handlers in bounded worker pools with timeouts, propagates trace/causal metadata, and enforces backpressure to protect latency targets.
- ACKLifecycleManager: Tracks message state transitions (RECEIVED → READ → FULFILLED) with
idempotency_tokens and emits retries using exponential backoff with jitter and caps. It deduplicates redeliveries, records DLQ handoff on exhaustion, and exposes per-message outcome metrics for observability. - Activity Buffer: Provides an append-only, fsync-backed transaction log for outbound intents and inbound processing results. On startup it replays incomplete lifecycles and reconciles order using causal metadata (e.g., vector clocks) to guarantee consistent recovery after crashes.
- Worktree State: Maintains repository/workspace context with deep Git integration, including branch tracking, file diffs, and isolated working directories per conversation or task. It supports background sync, conflict handling with policy-driven strategies, and controlled side effects for code-aware agents.
- Core services: The Router durably persists and routes messages with FIFO ordering within a conversation and delivery retries; the Registry provides agent discovery, health, and capability advertisement; the Scheduler allocates work via priority queues and lease-based dispatch. All services expose gRPC endpoints secured with mTLS; see ../protocol/services.md for API details.
Deeper system context and diagrams: Detailed System Architecture
5.5. State and Persistence¶
The SDK provides multi-level persistence for robustness and recovery:
- Activity Buffer: Uses atomic, append-only writes with periodic compaction and retention policies to bound disk usage. Recovery reprocesses only incomplete lifecycles and preserves idempotency by resending the same
message_idandidempotency_token. - Worktree State: Persists workspace bindings and file metadata, isolating tenant/conversation state in sandboxed directories. Git operations (fetch/merge/commit) run under policy control with explicit conflict resolution and audit trails.
- Configuration State: Stores versioned configuration with JSON Schema validation, hot-reload, and automatic rollback on failed validations. Changes are recorded with actor, timestamp, and diff to support audit and rapid recovery.
Design details and recovery strategies: Persistent State Management Architecture
5.6. Reliability and Failure Modes¶
- Delivery semantics: Implements at-least-once delivery with bounded retries, exponential backoff with jitter, and DLQ on exhaustion, while preserving the same
idempotency_tokenacross attempts. Timeouts are enforced with monotonic clocks and traced so operators can correlate redeliveries with upstream failures. - Ordering: Guarantees FIFO within a conversation using causal metadata and queue discipline at the Router, but does not impose global ordering across conversations. Handlers must be idempotent and side-effect safe to tolerate retries and reordering in distributed conditions.
- Backpressure: Applies credit- or queue-depth-based flow control and per-handler concurrency limits to prevent receiver overload. When limits are exceeded, the SDK sheds load via timeouts and the circuit breaker, signaling upstream to slow down while preserving system stability.
- Degraded operation: Circuit breakers isolate failing dependencies and switch components into a constrained feature set (e.g., read-only operations or cached responses). Recovery uses exponential probe intervals, and fallbacks are governed by explicit policies to avoid silent data loss.
See system-wide guarantees and tradeoffs: Enterprise Problem Resolution
5.7. Security¶
- Transport: Enforces mutual TLS (TLS 1.3) with certificate rotation, strict cipher suites, and optional certificate pinning/SPKI checks. Handshake and session lifetimes are configurable, and all channels require authenticated peers before any data exchange.
- Authn/Authz: Supports JWT/OIDC and service accounts with RBAC/ABAC enforcement at the service boundary and within the SDK. Authorization decisions are policy-driven and evaluated per message, with scoped tokens and expirations to minimize blast radius.
- Audit: Emits structured, immutable audit logs including actor, message IDs, decisions, and payload redaction status. Logs are timestamped, trace-correlated, retained per policy, and exportable to SIEM systems for compliance.
Security and audit considerations appear throughout the protocol and runtime sections.
5.8. Scaling Considerations¶
- Horizontal scaling: Scale stateless services with additional replicas; partition work by tenant/conversation/shard to maintain locality where appropriate.
- Resource isolation: Use bulkheads, timeouts, and quotas to prevent cascading failures; configure pools and queues per deployment context.
- Caching: Apply bounded caches with explicit TTLs where they simplify repeated lookups.
5.9. Extensibility¶
- Policy hooks: Extensible hook points run at ingress, pre-dispatch, post-handler, and egress with deterministic priority, deadlines, and failure isolation. Plugins are expected to be side-effect aware and can mutate headers/payloads under explicit policies with full tracing.
- Handler model: Handlers are registered per message type with typed payloads and concurrency controls, supporting sync or async execution. The model encourages idempotent operations, retry-safety, and structured error signaling to integrate cleanly with ACK semantics.
- Versioning: Protobuf evolution guidelines (field reservations,
oneof, additive changes) maintain wire compatibility across SDK and services. Semantic versioning signals breaking changes, and schema migration tools assist with staged rollouts.
5.10. Deployment Topologies¶
- Local development: Runs all core services on localhost with in-memory or file-backed storage for rapid feedback, optional insecure transport, and hot-reload of configuration. Make targets or Compose files bootstrap the stack with seeded data and example agents.
- Single-node production: Consolidates services on one host or VM with persistent volumes, TLS enabled, and systemd/container supervision. Backups, resource limits, and rolling binary restarts are configured to achieve predictable recovery and maintainability.
- Multi-node: Deploys HA replicas behind load balancers with leader election where required, and clustered PostgreSQL/Redis for durable state and coordination. Canary releases and rolling upgrades preserve availability while SLOs are enforced via autoscaling and alerting.
See examples and patterns: Deployment Patterns
5.11. What’s Next¶
- Protocol Specification: Dive into the gRPC services, message envelopes, and ACK semantics to design custom integrations or services. Start here when you need authoritative contract details and behavior guarantees; see Protocol.
- Examples: Explore end-to-end agent samples that showcase handler registration, persistence, retries, and deployment patterns. Use these as blueprints to bootstrap your own agents; see Examples.
- Quickstart: Install the SDK, run a local stack, and send your first messages with sensible defaults. Ideal for validating your environment and wiring before deeper customization; see Getting Started.