

Security teams have spent years building detection and response capabilities around a failure mode they understood well enough to instrument for. Typically, a service misbehaves, an alert fires and an engineer investigates. This kind of model worked because the systems producing the failures were deterministic enough that misbehavior was visible, measurable and attributable to a cause that a runbook could address.
However, what agentic systems have introduced into that environment is a category of failure that looks nothing like the one the detection infrastructure was built to catch — a failure that completes successfully, logs nothing unusual, returns a clean status code and disappears into the transaction history while the damage it caused propagates quietly through every system the agent touched.
“The governance gap this creates is not a configuration problem that a new tool can close,” says Shahid Ali Khan, principal engineer – DevOps at TestMu AI, an AI-native software testing platform. It is structural, rooted in the assumption that reliability failures and security events are categorically distinct, happen through different mechanisms and require different response processes.
Agentic systems break that assumption, Khan explains, because the same root cause, a manipulated input, a drifted model, a misconfigured capability boundary, can produce either outcome depending on context. Organizations that route reliability and security to different teams with different runbooks will keep discovering that gap through incidents rather than through the architectural decisions that could have prevented them.
Testing Infrastructure That was Built for the Wrong Assumption
The testing problem that agentic systems create is not a harder version of the testing problem that deterministic systems create. It is a structurally different problem that requires a different kind of answer. Ihor Zakutynskyi, chief technology officer at FORMA by Universe Group, describes the shift his team made when they encountered the limits of deterministic test assertions against probabilistic systems.
“Rather than expecting exact outputs, we moved to constraint-based and statistical validation, asserting invariants and measuring distributions instead of matching outputs,” Zakutynskyi explains. “Hard guarantees, safety and schema contracts, monotonic side-effect rules, idempotency of repeated calls and bounded response times remain pass-fail invariants. Everything above that baseline moves to statistical validation, running Monte Carlo-style test suites over representative inputs and computing stability metrics from the semantic embeddings of responses rather than comparing strings.”
The shift from exact match to distribution-based validation is not a concession to imprecision. It is a more accurate representation of what reliability actually means for a probabilistic system. Moreover, teams that resist it in favor of deterministic assertions will find themselves maintaining tests that pass consistently while missing the regressions that matter most.
Ronak Desai, CEO and founder of Ciroos and formerly SVP and GM at Cisco, frames the same shift in terms that engineering leaders will find immediately actionable. “The question isn’t whether your test passed,” he notes. “It’s whether your system is reliably capable.” That reframe demands moving from assertion-based testing to distribution-based testing, asking across many independent runs on the same task how many produce a correct outcome and treating that ratio as the reliability signal rather than the result of any individual run. Variance stability, the consistency of an agent’s output distribution across runs, tells you whether an agent is reliable. A single passing run tells you almost nothing about the system’s actual capability envelope.
Before any new agentic component reaches production, it should be tested against real production sessions in parallel, not to match exact outputs but to measure how consistent the new agent’s output distribution is compared to the established baseline. That comparison is the test. Everything else is preparation for it.
Arun Anbumani, principal cloud infrastructure engineer at Oracle, adds the infrastructure dimension to the testing picture that pure model-focused approaches miss. “Replay-style testing against captured production traffic patterns and fault injection introducing controlled disruptions, resource contention, device resets and driver mismatches give teams visibility into how systems respond when the hardware paths underneath the models start behaving differently,” Anbumani explains. “The broader challenge is that most SRE tooling was built for predictable services, and as infrastructure becomes more heterogeneous and development workflows incorporate AI-assisted tooling, the testing and observability platforms are still evolving to keep up.”
Testing infrastructure for agentic systems therefore, has to be built with the assumption that variability is not the exception but the operating condition. Fault injection and replay testing are not edge case preparation but the core of a testing regime designed for an environment where the normal operating envelope is wider and less well-defined than any previous generation of infrastructure.
The Governance Layer Nobody Built
The hardest part of running agentic systems in production is not building them. Khan, speaking from his experience in running agentic infrastructure at TestMu AI (formerly LambdaTest), identifies the real difficulty with a precision that practitioners who have not yet operated agentic systems at scale may not have encountered. “Traditional runbooks assume failures are obvious,” he observes. “A service crashes, latency spikes, errors propagate. Agents fail subtly. They might complete successfully while doing something completely unintended.”
Detecting that failure mode without triggering false positives on every creative decision the agent makes is where existing governance frameworks fall short and building the controls that close that gap is the work most organizations have not done yet.
Khan’s approach is to build governance controls at the platform level that operates on behavioral boundaries rather than system metrics. Every agent has an explicitly defined capability envelope covering what tools it can invoke, what data it can access, what output formats are valid and what actions require human approval. These are not permissions in the traditional sense. These are runtime assertions, checked at the moment of execution rather than granted at deployment time and assumed to hold thereafter. When an agent invokes a tool outside its envelope or generates output that does not match expected schemas, the event is captured with full context — the input that triggered it, the reasoning chain the agent followed and the attempted action — and routed to a dedicated anomaly pipeline separate from standard incident management.
Building behavioral boundaries as runtime assertions rather than deployment-time permissions is the architectural decision that makes the governance layer enforceable rather than advisory. Permissions that are granted at deployment and never checked again are assumptions, and in an agentic system operating at machine speed, unverified assumptions are where the most consequential failures begin.
“It is also worth implementing circuit breakers specifically for agent autonomy,” Khan notes. If an agent exceeds a threshold of envelope violations within a time window, it is automatically downgraded to a supervised mode where all actions require human approval. This limits the blast radius of a compromised or misbehaving agent while the team investigates, and it does so without requiring a human to notice the problem first. The circuit breaker fires on the pattern, not on a human’s recognition of it, which is the only response mechanism fast enough to be meaningful when the agent is operating at the speed agentic systems operate.
When a Reliability Failure is Also a Security Event
The boundary between a reliability failure and a security event in an agentic system is not always a boundary at all. Khan explains how his team encountered this directly when building the classification system for envelope violations. An agent accessing an unauthorized API could be a misconfiguration, which is a reliability problem, or a prompt injection attack, which is a security problem. From the outside, the two events look identical. “We have adopted a classify later, capture now approach,” he explains. “Every envelope violation is logged with enough context for both SRE and security review.” A secondary classification system then tags events based on input source. If the anomaly correlates with user-provided content, it is flagged for security review. If it correlates with a model update or configuration change, it routes to SRE.
The classify-later-capture now approach resolves a real organizational tension that most incident response processes are not designed to handle. Forcing immediate classification of an event whose root cause is ambiguous leads either to misrouting, where a security event gets treated as a reliability problem until the damage is done, or to alert fatigue, where every ambiguous event gets escalated to both teams until neither team takes the escalation seriously.