SW4-001: Failure Semantics Extension¶
Status: Draft Version: 0.1.0 Date: 2026-01-10 Extends: Core Spec §17.5 (Negotiation Room Pattern)
Abstract¶
This extension defines normative failure semantics for SW4RM coordination primitives, specifically addressing quorum rules, partition handling, and timeout behavior for the Negotiation Room pattern. These semantics are OPTIONAL for implementations but REQUIRED for production deployments requiring deterministic behavior under partial failure.
Motivation¶
The core specification (§17.5) defines the Negotiation Room pattern but leaves failure behavior undefined:
- What happens when a requested critic is unreachable?
- How long does the Coordinator wait for votes before deciding?
- What constitutes a valid decision when not all critics respond?
- How should implementations handle network partitions?
Without answers, implementations exhibit nondeterministic behavior under real-world conditions.
Terminology¶
Quorum: The minimum number of votes required to render a valid decision.
Vote Collection Window: The maximum duration the Coordinator waits for votes after proposal submission.
Abstain: A synthetic vote indicating a critic did not respond within the collection window.
Fail-Closed: Behavior where insufficient votes result in automatic rejection/escalation rather than proceeding with partial information.
Fail-Open: Behavior where insufficient votes result in proceeding with available votes (not recommended for production).
1. Quorum Semantics¶
1.1. Quorum Configuration¶
Implementations conforming to this extension MUST support configurable quorum rules per negotiation room:
message QuorumPolicy {
oneof rule {
uint32 minimum_votes = 1; // Absolute minimum vote count
float minimum_fraction = 2; // Fraction of requested critics (0.0-1.0)
bool require_all = 3; // All requested critics must vote
}
QuorumFailureAction on_failure = 4;
}
enum QuorumFailureAction {
QUORUM_FAIL_CLOSED = 0; // Escalate to HITL if quorum not met
QUORUM_FAIL_WITH_ABSTAIN = 1; // Treat missing votes as abstain (score=0, confidence=0)
QUORUM_FAIL_WITH_AVAILABLE = 2; // Decide with available votes (fail-open, NOT RECOMMENDED)
}
1.2. Default Quorum Behavior¶
Implementations MUST default to:
minimum_fraction = 0.5(majority of requested critics)on_failure = QUORUM_FAIL_CLOSED
1.3. Quorum Evaluation¶
The Coordinator MUST evaluate quorum after the vote collection window closes:
- Count received votes from requested critics
- Compare against quorum threshold
- If quorum met: proceed to decision
- If quorum not met: apply
QuorumFailureAction
2. Vote Collection Timeout¶
2.1. Collection Window Configuration¶
Implementations MUST support a configurable vote collection window:
message NegotiationProposal {
// ... existing fields ...
// Extension fields (SW4-001)
uint32 vote_collection_timeout_s = 100; // Default: 300 (5 minutes)
QuorumPolicy quorum_policy = 101;
}
2.2. Timeout Behavior¶
When the vote collection window expires:
- The Coordinator MUST stop accepting new votes for the proposal
- The Coordinator MUST evaluate quorum with votes received
- Late-arriving votes MUST be logged but NOT included in the decision
- The decision MUST include
collection_timeout_reached = trueif window expired
2.3. Recommended Timeout Values¶
| Artifact Type | Recommended Timeout | Rationale |
|---|---|---|
| REQUIREMENTS | 600s (10 min) | Complex review, multiple stakeholders |
| PLAN | 600s (10 min) | Architectural review requires thought |
| CODE | 300s (5 min) | Automated analysis + human spot-check |
| DEPLOYMENT | 180s (3 min) | Often automated validation |
3. Partition Handling¶
3.1. Critic Unavailability Detection¶
Implementations MUST detect critic unavailability through:
- Registration check: Critic not registered in Registry
- Health check failure: Critic registered but unhealthy
- Vote timeout: Critic registered and healthy but did not respond
3.2. Unavailability Response¶
For each unavailable critic, the Coordinator MUST:
- Log the unavailability with reason and timestamp
- Apply the configured
QuorumFailureAction - Include unavailability details in the decision audit trail
3.3. Partition Recovery¶
If a critic becomes available after collection window closes:
- The critic MAY submit a vote (for audit purposes)
- The vote MUST NOT affect the rendered decision
- The vote SHOULD be logged as
late_votefor post-hoc analysis
4. Decision Consistency Guarantees¶
4.1. Coordinator Guarantees¶
The Coordinator MUST provide the following consistency guarantees:
- At-most-once decision: Each proposal receives at most one decision
- Decision immutability: Once rendered, a decision MUST NOT change
- Vote inclusion: All votes received before collection window close MUST be included
- Audit completeness: The decision record MUST include all inputs and the decision rationale
4.2. Decision Record Extension¶
message NegotiationDecision {
// ... existing fields ...
// Extension fields (SW4-001)
bool quorum_met = 100;
uint32 votes_received = 101;
uint32 votes_expected = 102;
bool collection_timeout_reached = 103;
repeated string unavailable_critics = 104;
repeated LateVote late_votes = 105;
}
message LateVote {
string critic_id = 1;
NegotiationVote vote = 2;
google.protobuf.Timestamp received_at = 3;
}
5. Implementation Requirements¶
5.1. MUST Requirements¶
Implementations conforming to SW4-001 MUST:
- Support configurable quorum policies
- Enforce vote collection timeouts
- Detect and handle critic unavailability
- Provide complete audit trails for decisions
- Default to fail-closed behavior
5.2. SHOULD Requirements¶
Implementations SHOULD:
- Expose metrics for quorum success/failure rates
- Alert operators when quorum failures exceed thresholds
- Support dynamic quorum adjustment based on critic availability
- Provide health endpoints indicating coordination readiness
5.3. MAY Requirements¶
Implementations MAY:
- Support weighted quorum (some critics' votes count more)
- Implement optimistic early decision (if quorum met before timeout)
- Support hierarchical escalation (fail to senior critic before HITL)
6. Compatibility¶
This extension is backward-compatible with core spec §17.5. Implementations not conforming to SW4-001:
- Will exhibit undefined behavior under partial failure
- SHOULD document their failure behavior explicitly
- MUST NOT claim SW4-001 conformance
7. Security Considerations¶
Quorum manipulation attacks:
- Malicious actors could attempt to prevent quorum by DoS-ing critics
- Implementations SHOULD monitor for availability patterns indicating attack
- Fail-closed behavior limits damage from quorum manipulation
8. References¶
- Core Spec §17.5: Negotiation Room Pattern
- Core Spec §6: Identity and Security
- Core Spec §15: Human-In-The-Loop
This extension is part of the SW4RM protocol extension series.