Agentic AI Is Distributed Systems. Except Where It Isn't.
You cannot mathematically prove an LLM correct. But you can architect around it.
Part 2: Building on Uncertain Ground — Practical Semantic Correctness for Agentic Systems
Part 2 surveys the three research communities approaching semantic correctness for LLM outputs — formal verification, runtime guardrails, and post-condition verification — maps their results honestly, and delivers a concrete engineering playbook: four patterns engineers can implement today to build agentic systems that fail visibly, recover gracefully, and escalate correctly.
The System Was Working Fine
The monitoring dashboard was green. Latency within SLA. Error rate below threshold. The agent serving the company's internal knowledge base had processed 4,000 queries that week without a single infrastructure failure.
The problem surfaced in a quarterly review, not in an alert. A senior engineer noticed that three teams had made decisions based on the agent's answers — answers that were confidently phrased, internally consistent, and wrong. The agent had been hallucinating policy details that did not exist, synthesizing plausible-sounding procedures from fragments of real documents. Every fabricated answer had returned HTTP 200. Every request had succeeded. The system had been failing for six weeks without triggering a single alarm.
The post-mortem concluded with a question nobody in the room knew how to answer: how do you monitor for this?
That question is the subject of this article.
The Problem Stated Precisely
Part 1 established that the infrastructure layer of agentic AI maps almost entirely onto solved distributed systems patterns. But three problems have no distributed systems ancestor. The most urgent of these — the one with the most direct production consequences — is semantic failure invisibility.
In a traditional distributed system, failure is structural. A node crashes, a timeout fires, an exception propagates, an error code returns. Your observability stack — metrics, logs, traces — is designed to catch structural failures. It works because the failure signal and the infrastructure signal are the same type of thing.
In an LLM-based system, failure can be purely semantic. The infrastructure executes correctly. The model processes the request. A response is generated. HTTP 200 is returned. And the answer is wrong. Not wrong in a way that throws an exception. Wrong in the sense that a human reading it carefully would recognize the error — but the infrastructure layer has no mechanism to read carefully, or at all.
This is the non-deterministic compute boundary problem: you have a component in your system whose correctness cannot be verified by the same mechanisms you use to verify the rest of the system. To be precise about what makes this hard — stochasticity alone is not the issue. Distributed systems handle non-determinism well. The challenge is opacity combined with semantic ambiguity: you cannot inspect the model's reasoning, and the space of correct outputs is not formally definable. If you want the flexibility LLMs provide, you must accept some degree of this uncertainty — though constrained domains with structured outputs can reduce it substantially.
The question is: how do you engineer around it?
Traditional observability answers one question: is the system functioning structurally? Semantic observability asks a different one: is the system producing outputs that remain behaviorally correct relative to user intent and system goals? The monitoring stack for the opening story's agent answered the first question perfectly — green dashboards, healthy latency, zero errors. It had no way to answer the second. Building semantic observability into agentic systems is not optional. It is the central engineering challenge this article addresses.
What the Research Shows: Three Approaches, Honest Results
Before building, you need to know what the field has actually achieved. There are three distinct research communities working this problem. They do not talk to each other much. Each has made real progress and each has a real ceiling.
Camp 1: Formal Verification
The most academically rigorous approach applies formal methods — theorem provers, SMT solvers, model checkers — to LLM outputs. The goal is mathematical proof of correctness.
ProofWright (2025) represents the state of the art for code generation. It wraps LLM-generated CUDA kernels with a formal verification pipeline using Rocq and VerCors, establishing memory safety, thread safety, and semantic equivalence guarantees. The results are honest: memory and thread safety guaranteed for 74% of programs tested. Semantic equivalence — full functional correctness — established for 14%.
SEVerA (2026) takes a different angle: it couples stochastic LLM generation with formal behavioral constraints, using GRPO-style fine-tuning to improve performance while preserving formal correctness guarantees. Across constrained symbolic reasoning and policy-compliant tool use benchmarks, it achieves zero constraint violations.
The ceiling: formal verification requires a formal specification. You need to define, precisely and completely, what correct looks like. For code generation, this is tractable — a correct CUDA kernel has mathematically definable properties. For general-purpose natural language tasks — answer a customer's question, summarize this document, plan this workflow — there is no formal specification. Intent is ambiguous, correct answers are not unique, and the space of valid outputs cannot be enumerated.
Formal verification is the right tool for formalizable domains. It is not a general solution.
Camp 2: Runtime Guardrails
The production engineering community took a more pragmatic approach: wrap the LLM with validation layers that enforce structural and policy constraints on outputs.
NVIDIA NeMo Guardrails uses Colang — a purpose-built programming language — to define dialogue flows and constraints. It acts as an intermediary between the user and the LLM, using semantic similarity to detect off-topic or policy-violating inputs and outputs, then redirecting or blocking them. The constraints are user-defined, independent of the underlying model, and interpretable.
Guardrails AI implements Pydantic-style validation of LLM outputs. If you expect a structured JSON response with specific fields and types, Guardrails validates that the output conforms — and can trigger a retry if it does not. It also supports semantic validators: checking for bias in generated text, verifying that code compiles, confirming that a response is topically relevant.
The ceiling: guardrails are pattern filters. They are excellent at catching known bad outputs — policy violations, structural malformations, off-topic responses, outputs that fail explicit validation rules. They cannot catch semantically wrong but syntactically valid outputs. A factually incorrect answer in well-formed prose, a plausible but wrong procedure, a subtly misdirected recommendation — these pass through guardrail systems cleanly. The guardrail does not know the answer is wrong. It only knows the format is correct.
Guardrails solve the structural correctness problem. They do not solve the semantic correctness problem.
Camp 3: Post-Condition Verification (Most Aligned with Systems Reliability Discipline)
The approach that aligns most closely with classical systems reliability discipline frames the problem differently: do not try to verify the LLM. Treat it as an unreliable component and verify the system around it.
The Dual-State Action Pair (DSAP) framework, formalized in 2025, couples stochastic LLM generation with deterministic post-condition verification. The core insight: decompose system state into finite observable workflow state (what you can verify deterministically) and infinite opaque environment state (what the LLM produces). Every LLM action produces an output that is checked against a deterministic post-condition before the system advances. If the post-condition fails, the system does not proceed — it enters a structured recovery path.
The recovery hierarchy has three levels: context refinement (retry within the same step with additional context), informed backtracking (detect stagnation, invalidate downstream state, inject context into upstream steps), and human escalation (route to a human when automated recovery is exhausted). Three major jurisdictions are now requiring exactly this escalation path for high-risk AI systems — see The Global AI Risk Assessment Convergence for the regulatory landscape. Across 13 LLMs on diagnostic benchmarks, this approach produced reliability gains of up to 66 percentage points.
At the systems level, this behaves similarly to a circuit breaker applied to semantic correctness rather than infrastructure health. The difference from standard circuit breakers: the trip condition is not latency or error rate. The trip condition is post-condition failure — a semantic signal, not a structural one.
The ceiling: post-condition verification requires defining the post-condition. For structured tasks — code execution, data transformation, workflow steps with clear success criteria — this is tractable. For open-ended tasks, defining a post-condition requires reducing the problem to something verifiable, which means accepting that some semantic correctness is still probabilistic.
The Engineering Playbook: Four Patterns for Today
Given what the research shows, here is what engineers can implement now. These patterns are not theoretical. They are direct adaptations of distributed systems primitives to the semantic failure domain.
Pattern 1: The Verification Wrapper
The foundational pattern. Every LLM call is wrapped in a deterministic verification layer. The LLM generates. The verifier checks. The system only proceeds if the check passes.
from dataclasses import dataclass
from typing import Callable, TypeVar, Optional
import logging
T = TypeVar('T')
@dataclass
class VerificationResult:
passed: bool
output: any
failure_reason: Optional[str] = None
attempt: int = 1
class VerificationWrapper:
"""
Treats the LLM as an unreliable node.
Verification is deterministic even when generation is not.
Pattern: DSAP (Dual-State Action Pair)
"""
def __init__(
self,
llm_call: Callable,
post_condition: Callable[[any], tuple[bool, str]],
max_attempts: int = 3,
context_enricher: Optional[Callable[[str, int], str]] = None
):
self.llm_call = llm_call
self.post_condition = post_condition
self.max_attempts = max_attempts
self.context_enricher = context_enricher
def execute(self, prompt: str) -> VerificationResult:
current_prompt = prompt
for attempt in range(1, self.max_attempts + 1):
# Enrich prompt with failure context on retry
if attempt > 1 and self.context_enricher:
current_prompt = self.context_enricher(
current_prompt, attempt
)
output = self.llm_call(current_prompt)
passed, reason = self.post_condition(output)
if passed:
return VerificationResult(
passed=True,
output=output,
attempt=attempt
)
logging.warning(
f"Post-condition failed (attempt {attempt}/{self.max_attempts}): "
f"{reason}"
)
# All attempts exhausted — return failure for escalation
return VerificationResult(
passed=False,
output=output,
failure_reason=reason,
attempt=self.max_attempts
)
# Example: Structured output verification
def verify_json_schema(output: str) -> tuple[bool, str]:
"""Deterministic post-condition: output must parse and contain required fields."""
import json
try:
parsed = json.loads(output)
required = {"action", "confidence", "reasoning"}
missing = required - parsed.keys()
if missing:
return False, f"Missing required fields: {missing}"
if not 0.0 <= parsed["confidence"] <= 1.0:
return False, f"Confidence out of range: {parsed['confidence']}"
return True, ""
except json.JSONDecodeError as e:
return False, f"Invalid JSON: {e}"
def enrich_with_failure_context(prompt: str, attempt: int) -> str:
"""Reflexion pattern: inject failure context into retry prompt."""
return (
f"{prompt}\n\n"
f"[RETRY CONTEXT — attempt {attempt}]\n"
f"Previous response did not meet the required format. "
f"Ensure the response is valid JSON with fields: "
f"action (string), confidence (float 0-1), reasoning (string)."
)
Key design decision: the post-condition is always deterministic. It may be a schema check, a business rule validation, a unit test execution, a range check — whatever is appropriate for the domain. The LLM generates probabilistically. The verifier checks deterministically. This is the boundary.
Pattern 2: The Semantic Circuit Breaker
Standard circuit breakers trip on latency percentiles and error rates. This circuit breaker trips on semantic quality degradation — when a measurable proxy for output quality drops below a threshold.
import time
from collections import deque
from enum import Enum
from dataclasses import dataclass, field
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Blocking calls — quality too low
HALF_OPEN = "half_open" # Testing recovery
@dataclass
class SemanticCircuitBreaker:
"""
Trips not on latency or error rate, but on semantic quality score.
Quality score is domain-specific: verification pass rate,
eval score, human feedback signal, or post-condition success rate.
"""
failure_threshold: float = 0.3 # Trip if quality drops below 30%
recovery_threshold: float = 0.7 # Reset if quality recovers to 70%
window_size: int = 20 # Rolling window of last N calls
half_open_probe_count: int = 3 # Probe attempts before full reset
state: CircuitState = field(default=CircuitState.CLOSED, init=False)
quality_window: deque = field(init=False)
opened_at: Optional[float] = field(default=None, init=False)
probe_successes: int = field(default=0, init=False)
reset_timeout: float = 60.0 # Seconds before attempting half-open
def __post_init__(self):
self.quality_window = deque(maxlen=self.window_size)
def record(self, quality_score: float) -> None:
"""
Record a quality score (0.0 = failure, 1.0 = perfect).
For post-condition verification: 0 if failed, 1 if passed.
For eval-based scoring: the normalized eval score.
"""
self.quality_window.append(quality_score)
self._evaluate_state()
def _current_quality(self) -> float:
if not self.quality_window:
return 1.0
return sum(self.quality_window) / len(self.quality_window)
def _evaluate_state(self) -> None:
quality = self._current_quality()
if self.state == CircuitState.CLOSED:
if quality < self.failure_threshold:
self.state = CircuitState.OPEN
self.opened_at = time.time()
logging.error(
f"Semantic circuit breaker OPENED. "
f"Quality score: {quality:.2f} below threshold {self.failure_threshold}"
)
elif self.state == CircuitState.OPEN:
if time.time() - self.opened_at > self.reset_timeout:
self.state = CircuitState.HALF_OPEN
self.probe_successes = 0
logging.info("Semantic circuit breaker entering HALF_OPEN state.")
elif self.state == CircuitState.HALF_OPEN:
if quality >= self.recovery_threshold:
self.probe_successes += 1
if self.probe_successes >= self.half_open_probe_count:
self.state = CircuitState.CLOSED
logging.info("Semantic circuit breaker CLOSED — quality recovered.")
else:
self.state = CircuitState.OPEN
self.opened_at = time.time()
logging.warning("Semantic circuit breaker re-OPENED during probe.")
def allow_request(self) -> bool:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
return False
# HALF_OPEN: allow probe requests
return True
@property
def current_quality(self) -> float:
return self._current_quality()
What trips this breaker: not infrastructure metrics, but semantic signal. The quality score can come from post-condition pass rates (Pattern 1), from a lightweight eval model running inline, from structured output conformance rates, or from any domain-specific correctness signal. The circuit breaker does not care what the signal is — it cares about the trend. In practice, the usefulness of this pattern depends on the stability of the underlying quality signal; noisy evaluators can themselves become a source of operational instability.
When the breaker opens: route to a fallback (a simpler, more constrained model, a rule-based system, or human handling). Do not silently degrade.
Pattern 3: Structured Retry with Context Injection
This is the Reflexion pattern operationalized for production. Failed verification does not trigger a blind retry — it triggers a context-enriched retry. The failure reason becomes input to the next attempt.
@dataclass
class RetryContext:
original_prompt: str
attempt: int
failure_reasons: list[str] = field(default_factory=list)
def build_retry_prompt(self) -> str:
"""
Inject failure history into the retry prompt.
This is the Reflexion pattern: verbal reflection on failures
as input to the next attempt.
"""
if not self.failure_reasons:
return self.original_prompt
failure_summary = "\n".join(
f" Attempt {i+1}: {reason}"
for i, reason in enumerate(self.failure_reasons)
)
return (
f"{self.original_prompt}\n\n"
f"[PREVIOUS ATTEMPTS FAILED — FAILURE CONTEXT]\n"
f"{failure_summary}\n\n"
f"Correct the above issues in your response."
)
class StructuredRetryOrchestrator:
"""
Bounded retry with semantic context injection and escalation.
Three levels (matching DSAP recovery hierarchy):
1. Context refinement: retry with injected failure context
2. Informed backtracking: reduce task scope, retry simplified version
3. Human escalation: route to dead letter queue for human review
"""
def __init__(
self,
wrapper: VerificationWrapper,
escalation_queue: 'SemanticDeadLetterQueue',
max_refinement_attempts: int = 3
):
self.wrapper = wrapper
self.escalation_queue = escalation_queue
self.max_refinement_attempts = max_refinement_attempts
def execute(self, prompt: str, task_id: str) -> dict:
context = RetryContext(original_prompt=prompt, attempt=0)
for attempt in range(1, self.max_refinement_attempts + 1):
context.attempt = attempt
retry_prompt = context.build_retry_prompt()
result = self.wrapper.llm_call(retry_prompt)
passed, reason = self.wrapper.post_condition(result)
if passed:
return {
"status": "success",
"output": result,
"attempts": attempt,
"task_id": task_id
}
context.failure_reasons.append(reason)
logging.warning(f"Task {task_id} attempt {attempt} failed: {reason}")
# Exhausted refinement — escalate to human
logging.error(
f"Task {task_id} exhausted {self.max_refinement_attempts} attempts. "
f"Escalating to dead letter queue."
)
self.escalation_queue.enqueue(
task_id=task_id,
original_prompt=prompt,
failure_history=context.failure_reasons
)
return {
"status": "escalated",
"task_id": task_id,
"attempts": self.max_refinement_attempts,
"failure_history": context.failure_reasons
}
Critical design constraint: the retry count must be bounded. Unbounded retry on a non-deterministic system produces retry storms. Agentic systems introduce a nastier variant: semantic retry amplification — each retry generates a longer corrective context, longer corrective context increases reasoning drift, drift increases semantic variance, and variance causes more retries. This is a positive feedback loop unique to LLM systems, with no equivalent in traditional distributed retry logic. If the LLM cannot produce a passing output after three attempts with full failure context, adding more attempts is unlikely to help and will compound token costs, latency, and reasoning instability. Escalate.
Pattern 4: The Semantic Dead Letter Queue
In message queue systems, messages that cannot be processed after exhausting retries go to a dead letter queue — a holding area for human review and intervention. Agentic systems need the same primitive for semantic failures.
import json
from datetime import datetime
from dataclasses import dataclass, field
@dataclass
class FailedTask:
task_id: str
original_prompt: str
failure_history: list[str]
timestamp: str = field(
default_factory=lambda: datetime.utcnow().isoformat()
)
metadata: dict = field(default_factory=dict)
class SemanticDeadLetterQueue:
"""
Failed tasks that exhaust automated recovery go here.
Not a graveyard — an escalation path.
In production: back this with a real queue (SQS, RabbitMQ,
Azure Service Bus) and a human review interface.
This implementation is the contract, not the storage.
"""
def __init__(self, storage_backend=None):
self.storage = storage_backend or []
self._handlers = []
def enqueue(
self,
task_id: str,
original_prompt: str,
failure_history: list[str],
metadata: dict = None
) -> None:
task = FailedTask(
task_id=task_id,
original_prompt=original_prompt,
failure_history=failure_history,
metadata=metadata or {}
)
# Write to storage
if isinstance(self.storage, list):
self.storage.append(task)
else:
self.storage.put(task)
# Notify registered handlers (alerting, ticketing, etc.)
for handler in self._handlers:
handler(task)
logging.error(
f"[SEMANTIC DLQ] Task {task_id} escalated after "
f"{len(failure_history)} failed attempts. "
f"Last failure: {failure_history[-1] if failure_history else 'unknown'}"
)
def register_handler(self, handler: Callable[[FailedTask], None]) -> None:
"""
Register escalation handlers: PagerDuty, JIRA ticket creation,
Slack alert, human review dashboard notification.
"""
self._handlers.append(handler)
def review_and_reprocess(
self,
task_id: str,
human_correction: str
) -> None:
"""
Human reviews the failed task, provides correction or guidance,
and requeues for processing with enriched context.
This closes the feedback loop.
"""
# Implementation depends on storage backend
pass
Why this matters: silent discard is the worst possible failure mode for semantic errors. If a task fails verification and is silently dropped, the user never knows, the system appears healthy, and trust erodes invisibly — exactly what happened in the opening story. The semantic DLQ makes failure visible, routes it to humans, and creates a feedback loop. Crucially, the tasks in the DLQ are training data for improving your post-conditions and your prompts over time.
Assembling the Stack
These four patterns compose into a complete semantic correctness layer that wraps your LLM calls:
User Request
│
▼
┌─────────────────────────────────┐
│ Semantic Circuit Breaker │ ← Is quality above threshold?
│ (Pattern 2) │ If open → fallback / queue
└────────────────┬────────────────┘
│ CLOSED or HALF_OPEN
▼
┌─────────────────────────────────┐
│ Structured Retry Orchestrator │ ← Bounded retry with context
│ (Pattern 3) │ injection on each failure
└────────────────┬────────────────┘
│
▼
┌─────────────────────────────────┐
│ Verification Wrapper │ ← LLM call + deterministic
│ (Pattern 1) │ post-condition check
└────────────────┬────────────────┘
│
┌──────┴──────┐
│ │
PASSED FAILED (exhausted)
│ │
▼ ▼
Return Semantic DLQ
Output (Pattern 4)
│
▼
Human Review
+ Feedback Loop
This stack does not make your LLM correct. It makes your system fail visibly, recover gracefully, and escalate predictably. That is the correct engineering objective when the compute node is non-deterministic.
The Missing Primitive: What Still Needs to Be Invented
The four patterns above are what engineers can do today. They require one thing to work: a post-condition. You must be able to define, for each task, what a correct output looks like in a form that can be checked deterministically.
For structured outputs — JSON schemas, code that compiles, SQL that executes, API calls with defined contracts — this is tractable. You write the post-condition and the system works.
For open-ended natural language tasks — answer this question correctly, summarize this document accurately, give sound advice — a general-purpose post-condition does not exist yet. This is the unsolved primitive.
What would it look like? A runtime semantic contract — a specification that captures the intent of a natural language task in a form that is both human-interpretable and machine-checkable, without requiring full formalization. Something between a type signature and a formal spec. Lightweight enough to write for every task. Strong enough to catch the failures that matter.
The research directions that may converge on this: LLM-as-judge evaluation inline (using a second model to check the first — expensive but increasingly feasible; model selection for the verifier involves its own cost/accuracy tradeoff — see The Precision Paradox for the framework on matching model size to task demands), structured semantic constraints expressed in domain-specific languages (Colang is an early attempt), and behavioral contracts defined over example input-output pairs rather than formal specifications.
One critical caution on LLM-as-judge approaches: a verifier built on another stochastic model shifts rather than eliminates uncertainty, introducing the additional problems of verifier reliability and correlated failure — where generator and judge hallucinate in the same direction because they share similar training distributions. A stochastic verifier is better than no verifier, but it is not a deterministic post-condition.
How to act today: for every task your agentic system performs, ask: what is the minimal verifiable property I can check on the output? Start there. Even a weak post-condition — the response is not empty, the response references the correct entity, the response does not contain prohibited content — is better than no post-condition. Build the habit of defining verification before building generation. The practice will improve as the tooling improves.
The Call to Action
Two concrete things to do Monday morning.
One: audit your current agentic system for silent semantic failure. Find every LLM call that is not wrapped in a post-condition check. For each one, ask: if this call returns a plausible-sounding wrong answer, will any monitoring system catch it? If the answer is no, you have a silent failure path. Prioritize the ones where wrong answers have downstream consequences.
Two: implement the Verification Wrapper for your highest-stakes LLM calls first. Define the simplest post-condition you can for each. Even a schema check, a length check, or a required-field check is a start. Wire failures to a logging pipeline with structured output — not just an exception — so you can see the pattern of what fails and why. That data will tell you where to invest in stronger post-conditions next.
The infrastructure layer of agentic AI largely maps onto already-solved distributed systems patterns. Use the decades of distributed systems work that exists. The semantic layer is not solved. Do not pretend it is, do not ignore it, and do not wait for the field to converge on a general solution before building production systems. Build with the patterns that exist, instrument your semantic failure surface, and contribute to the feedback loop that will produce the primitives the field still needs.
The meeting room from Part 1 had a greybeard who recognized the solved problems. What the field needs next are engineers who also recognize the unsolved ones — and build accordingly.
References
| Source | URL |
|---|---|
| ProofWright: Agentic Formal Verification of CUDA (2025) | https://arxiv.org/abs/2511.12294 |
| SEVerA: Verified Synthesis of Self-Evolving Agents (2026) | https://arxiv.org/abs/2603.25111 |
| Reflexion: Language Agents with Verbal Reinforcement Learning | https://arxiv.org/abs/2303.11366 |
| Dual-State Action Pair Framework (DSAP, 2025) | https://arxiv.org/abs/2512.20660 |
| NeMo Guardrails (NVIDIA, 2023) | https://arxiv.org/abs/2310.10501 |
| Guardrails AI (open source) | https://github.com/guardrails-ai/guardrails |
| LLM+Formal Methods Roadmap (2026) | https://arxiv.org/abs/2412.06512 |
| RvLLM: Runtime Verification with Domain Knowledge | https://arxiv.org/abs/2505.18585 |
| Building Guardrails for LLMs | https://arxiv.org/abs/2402.01822 |
| ZenML: What 1,200 Production Deployments Reveal (2025) | https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025 |
| Dapr Actor Model | https://docs.dapr.io/developing-applications/building-blocks/actors/ |
| Temporal Workflow Engine | https://temporal.io |
Shaped in collaboration with Claude, an AI assistant by Anthropic, during sunny Pacific Northwest afternoons where engineering problems meet philosophical questions.
