4. Deployment Patterns¶
This guide is a pragmatic, step-by-step walkthrough for deploying SW4RM-based systems from local development to production. It includes actionable commands for macOS and Debian Linux, Python SDK usage snippets, and Infrastructure-as-Code examples with Docker Compose and Kubernetes. When a detail is ambiguous or environment-specific, we call it out explicitly and suggest a safe default. We also record outstanding ambiguities and assumptions in DEPLOYMENT_PATTERNS.md.
Important defaults used by the Python SDK (see sdks/py_sdk/sw4rm/constants.py):
- Router address env var:
SW4RM_ROUTER_ADDR(defaultlocalhost:50051) - Registry address env var:
SW4RM_REGISTRY_ADDR(defaultlocalhost:50052) - Optional:
AGENT_ID,AGENT_NAMEfor client identity; sensible defaults apply.
Tip: After setup, validate SDK wiring:
4.1. Prerequisites¶
Choose your OS to prepare a baseline environment for local development and operator workflows.
macOS (Apple Silicon or Intel):
- Install Homebrew if needed:
- Install core tools and Docker Desktop:
- Enable Docker Desktop and ensure itβs running (for Compose/Kubernetes demos).
- Optional docs/dev tooling (for make protos, docs, etc.):
Debian/Ubuntu (root or sudo):
- Install Python and build tools:
- Install Docker Engine and Compose plugin:
- Install kubectl (vendor instructions may vary):
- Use Docker without sudo (new shell required):
Notes and caveats:
grpcurlis optional but useful for quick gRPC health checks.- If your environment uses a proxy or corporate CA, ensure Docker and Python trust your CA (see
DEPLOYMENT_PATTERNS.mdfor notes on custom CAs and mTLS).
4.2. Local Development (Single Node)¶
Run the core control plane services (Router, Registry, Scheduler) on localhost for fast iteration. Use file-backed persistence and allow insecure transport locally unless you are explicitly testing mTLS.
Option A β Python SDK only (connect to an existing control plane):
- Create a virtual environment:
- Install SDK + dev dependencies:
- Generate Python protobuf stubs:
- Sanity check SDK wiring:
Option B β Start local control plane via Docker Compose:
- Create
docker-compose.ymlwith the example below (adjust tags/paths as needed). - Start the stack:
- Verify ports are open (router and registry):
- Point SDK to local endpoints via environment variables (see the env block below).
Python: connecting to Router and Registry¶
Ensure stubs are generated before running clients:
The example below demonstrates:
- Reading endpoints from env via
sw4rm.config.from_env(). - Creating gRPC channels with an optional correlation-id interceptor.
- Registering an agent, sending a message, and starting an incoming stream.
import os
from sw4rm.config import from_env
from sw4rm.interceptors import channel_with_interceptors, CorrelationIdClientInterceptor
from sw4rm.clients.router import RouterClient
from sw4rm.clients.registry import RegistryClient
# Optionally set env vars before import or in your shell:
# export SW4RM_ROUTER_ADDR=localhost:50051
# export SW4RM_REGISTRY_ADDR=localhost:50052
# export AGENT_ID=agent-1
# export AGENT_NAME="Agent One"
cfg = from_env()
# Create channels (plaintext for local dev). Use secure=True with proper TLS creds in production.
router_ch = channel_with_interceptors(
cfg.endpoints.router_addr,
CorrelationIdClientInterceptor(correlation_id="local-dev-123"),
secure=False,
)
registry_ch = channel_with_interceptors(cfg.endpoints.registry_addr, secure=False)
router = RouterClient(router_ch)
registry = RegistryClient(registry_ch)
# Register the agent (shape must match your .proto definition)
agent_desc = {
"agent_id": cfg.agent_id,
"name": cfg.name,
"description": "Local dev agent",
}
try:
registry.register(agent_desc)
print("Registered agent:", agent_desc["agent_id"])
except Exception as e:
print("Register failed; is Registry running?", e)
# Send a test message to the Router (shape must match Envelope in your .proto)
envelope = {
"message_type": 1, # CONTROL (see sw4rm.constants)
"source": cfg.agent_id,
"destination": "router",
"payload": b"hello from local dev",
}
try:
router.send_message(envelope)
print("Message sent")
except Exception as e:
print("Send failed; is Router running?", e)
# Stream incoming messages for this agent (runs until interrupted)
try:
for msg in router.stream_incoming(cfg.agent_id):
print("incoming:", msg)
except KeyboardInterrupt:
pass
except Exception as e:
print("Stream failed; is Router running?", e)
4.2.1. Build Your Own Images (Local Development)¶
If you don't have published container images yet, build local images and reference them in Compose. Below are templates; replace paths/names with your implementation.
Example Dockerfile for a service (Python-based):
FROM python:3.11-slim
WORKDIR /app
COPY . /app
RUN pip install --no-cache-dir -e .
EXPOSE 50051
CMD ["python", "-m", "your_service.entrypoint", "--addr", "0.0.0.0:50051"]
Compose referencing local builds:
version: '3.9'
services:
router:
build: ./examples/reference-services/ # path to your router source
image: sw4rm-router:dev
ports: ["50051:50051"]
volumes: ["router-data:/var/lib/sw4rm"]
restart: unless-stopped
registry:
build: ./examples/reference-services/
image: sw4rm-registry:dev
ports: ["50052:50052"]
restart: unless-stopped
scheduler:
build: ./examples/reference-services/
image: sw4rm-scheduler:dev
ports: ["50053:50053"]
restart: unless-stopped
volumes:
router-data:
Build and start:
Docker Compose: local dev stack (single-node)¶
The following Compose file builds Router, Registry, and Scheduler from your local sources, publishes standard ports, and persists Router state to a local volume.
version: '3.9'
services:
router:
build: ./examples/reference-services/
image: sw4rm-router:dev
container_name: sw4rm-router
ports:
- "50051:50051" # Router
volumes:
- router-data:/var/lib/sw4rm
restart: unless-stopped
registry:
build: ./examples/reference-services/
image: sw4rm-registry:dev
container_name: sw4rm-registry
ports:
- "50052:50052" # Registry
restart: unless-stopped
scheduler:
build: ./examples/reference-services/
image: sw4rm-scheduler:dev
container_name: sw4rm-scheduler
ports:
- "50053:50053" # Scheduler
restart: unless-stopped
volumes:
router-data:
Environment variables for your Python clients:
export SW4RM_ROUTER_ADDR=localhost:50051
export SW4RM_REGISTRY_ADDR=localhost:50052
export AGENT_ID=my-agent
export AGENT_NAME="My Agent"
macOS specific notes:
- Use Docker Desktop;
docker composeis available as part of it. - For grpcurl installs via Homebrew, omit
sudo. Usegrpcurl -plaintextfor local.
Debian specific notes:
- If
docker composesubcommand is not present, usedocker-composeor install the compose plugin (docker-compose-plugin). - If
grpcurlis not available via apt, use the binary release orgo install.
4.3. Single-Node Production (VM/Bare metal)¶
Consolidate services on one host/VM for a small production footprint. Add supervision, persistence, and security hardening.
Recommended practices:
- Security: enable TLS/mTLS across service-to-service traffic. Store certs/keys in a secure location and mount read-only. Rotate credentials regularly.
- Persistence: configure volumes for Router state and any local WAL/logs. Set up periodic backups and verify restores.
- Resource isolation: set per-service CPU/memory limits. Use systemd (or Docker restart policies) for restarts.
- Observability: ship logs to your SIEM; export metrics/traces. Establish alerting on SLOs.
Example: systemd unit for Router (non-containerized)
[Unit]
Description=SW4RM Router
After=network.target
[Service]
User=sw4rm
Group=sw4rm
ExecStart=/usr/local/bin/sw4rm-router --addr 0.0.0.0:50051 --state-dir /var/lib/sw4rm
Restart=always
RestartSec=5
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
Note: Binary names and flags may differ depending on your distribution or packaging. If you rely on containers even on single-node hosts, prefer the Docker Compose approach with pinned image tags.
4.4. Multi-Node, Highly Available¶
Scale out the control plane horizontally behind L4/L7 load balancers; use HA backends for shared state where applicable.
Checklist:
- State: If Router/Scheduler depend on external stores (e.g., Postgres/Redis), deploy them in HA (primary/replica or cluster) with backups.
- Upgrades: perform rolling or canary deployments; validate with synthetic checks before 100% rollout.
- Networking: enforce mTLS for all east-west traffic; segment networks by environment/tenant. Use security groups and network policies.
- Capacity: set HPA or autoscaling policies and budget headroom for failover.
4.5. Kubernetes Reference Manifests¶
The snippet below targets a minimal, non-mTLS dev cluster. Replace image tags with your approved versions. For production, add PodSecurity, NetworkPolicy, Secrets for TLS, resource quotas, and PodDisruptionBudgets.
apiVersion: v1
kind: Namespace
metadata:
name: sw4rm
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: router
namespace: sw4rm
spec:
replicas: 1
selector: { matchLabels: { app: router } }
template:
metadata:
labels: { app: router }
spec:
containers:
- name: router
image: localhost:5000/sw4rm-router:dev
ports:
- containerPort: 50051
volumeMounts:
- name: router-data
mountPath: /var/lib/sw4rm
readinessProbe:
grpc:
port: 50051
initialDelaySeconds: 2
periodSeconds: 5
livenessProbe:
grpc:
port: 50051
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests: { cpu: "100m", memory: "128Mi" }
limits: { cpu: "500m", memory: "512Mi" }
volumes:
- name: router-data
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: router
namespace: sw4rm
spec:
selector: { app: router }
ports:
- name: grpc
port: 50051
targetPort: 50051
protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: registry
namespace: sw4rm
spec:
replicas: 1
selector: { matchLabels: { app: registry } }
template:
metadata:
labels: { app: registry }
spec:
containers:
- name: registry
image: localhost:5000/sw4rm-registry:dev
ports:
- containerPort: 50052
readinessProbe:
grpc:
port: 50052
initialDelaySeconds: 2
periodSeconds: 5
livenessProbe:
grpc:
port: 50052
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests: { cpu: "50m", memory: "64Mi" }
limits: { cpu: "250m", memory: "256Mi" }
---
apiVersion: v1
kind: Service
metadata:
name: registry
namespace: sw4rm
spec:
selector: { app: registry }
ports:
- name: grpc
port: 50052
targetPort: 50052
protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: scheduler
namespace: sw4rm
spec:
replicas: 1
selector: { matchLabels: { app: scheduler } }
template:
metadata:
labels: { app: scheduler }
spec:
containers:
- name: scheduler
image: localhost:5000/sw4rm-scheduler:dev
ports:
- containerPort: 50053
readinessProbe:
grpc:
port: 50053
initialDelaySeconds: 2
periodSeconds: 5
livenessProbe:
grpc:
port: 50053
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests: { cpu: "50m", memory: "64Mi" }
limits: { cpu: "250m", memory: "256Mi" }
---
apiVersion: v1
kind: Service
metadata:
name: scheduler
namespace: sw4rm
spec:
selector: { app: scheduler }
ports:
- name: grpc
port: 50053
targetPort: 50053
protocol: TCP
Client configuration inside the cluster (Python SDK):
# If your app runs in the cluster, target the ClusterIP services
export SW4RM_ROUTER_ADDR=router.sw4rm.svc.cluster.local:50051
export SW4RM_REGISTRY_ADDR=registry.sw4rm.svc.cluster.local:50052
Health checks:
- If your images expose gRPC health, prefer gRPC health probes over tcpSocket.
- Otherwise, use tcpSocket or HTTP-based probes where available.
TLS/mTLS in Kubernetes (outline; non-normative β for example only):
- Create a
Secretcontaining certs/keys; mount read-only into Pods. - Configure services to use TLS flags/env. The protocol does not mandate specific env names; example below shows one possible convention.
- Distribute CA bundle to clients; set
secure=Truewhen creating channels and provide credentials if needed.
Example TLS/mTLS env convention:
env:
- name: SW4RM_TLS_ENABLED
value: "true"
- name: SW4RM_TLS_MODE
value: "mtls" # tls|mtls
- name: SW4RM_TLS_CERT_FILE
value: "/etc/sw4rm/tls/tls.crt"
- name: SW4RM_TLS_KEY_FILE
value: "/etc/sw4rm/tls/tls.key"
- name: SW4RM_TLS_CA_FILE
value: "/etc/sw4rm/tls/ca.crt"
volumeMounts:
- name: tls
mountPath: /etc/sw4rm/tls
readOnly: true
volumes:
- name: tls
secret:
secretName: router-tls
Client-side (Python SDK) would set secure=True when creating the channel and load CA/cert/key as needed.
4.5.1. Build Your Own Images (Kubernetes)¶
For dev clusters without published images, build and push images to a local registry, then reference them in the manifests:
# Example: build and tag
docker build -t localhost:5000/sw4rm-router:dev ./examples/reference-services/
docker push localhost:5000/sw4rm-router:dev
# Update Deployment image: localhost:5000/sw4rm-router:dev
4.6. Environment Configuration Cheatsheet¶
Common environment variables used by the Python SDK and examples:
# Endpoints
export SW4RM_ROUTER_ADDR=localhost:50051
export SW4RM_REGISTRY_ADDR=localhost:50052
# Agent identity (used by sdks/py_sdk/sw4rm/config.py)
export AGENT_ID=agent-1
export AGENT_NAME="Agent"
# Example: set higher log verbosity for your app (if applicable)
export LOG_LEVEL=DEBUG
4.7. Rollout and Operations Checklist¶
- Identity and transport: TLS/mTLS configured, cert rotation scheduled, authorization policies validated.
- Persistence: backups tested end-to-end; retention and compaction policies defined.
- Capacity planning: concurrency limits, queue depths, and connection pools tuned for peak load with headroom.
- Observability: traces/metrics/logs wired; actionable alerts on SLOs and error budgets.
- Resilience: retry, DLQ, and backoff policies verified; chaos tests performed; runbooks documented and discoverable.
- Release hygiene: pinned image tags; reproducible environment; automated rollbacks and change audit.
4.8. Troubleshooting Quick Wins¶
- Python SDK errors about missing protobuf stubs:
SW4RM_* env vars point to reachable addresses.
- gRPC SSL/TLS failures: verify cert chain, hostname/SANs, and that clients trust your CA. For local dev, use plaintext.
- Kubernetes readiness flaps: relax initial delays, check CPU throttling, and ensure probes match actual listening ports.
- Compose + macOS: ensure Docker Desktop is running and file sharing is permitted for the volume mount path.
4.9. Storage Backends (Pluggable)¶
Implementers may choose storage backends per component using URI-style configuration. Examples (non-normative):
- Router store:
SW4RM_ROUTER_STORE_URL file:///var/lib/sw4rm(local dev; single-node)postgres://user:pass@host:5432/dbnames3://bucket/prefix(object store, large payloads)-
nfs:///mnt/share/sw4rm(shared POSIX) -
Scheduler store/queue:
SW4RM_SCHEDULER_STORE_URL redis://host:6379/0postgres://...
Document which backends each implementation supports; the protocol remains storage-agnostic.
4.10. Observability (Optional)¶
OpenTelemetry exporters (recommended neutral default):
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf # or grpc
export OTEL_SERVICE_NAME=sw4rm-router
export OTEL_RESOURCE_ATTRIBUTES=env=dev,service.role=router
Prometheus metrics (optional; if your implementation exposes /metrics):
- Scrape target example (Kubernetes ServiceMonitor or Pod annotations).
- Keep disabled by default unless explicitly enabled.