{"id":3736,"date":"2026-03-30T13:14:33","date_gmt":"2026-03-30T13:14:33","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/03\/30\/agentic-systems-are-breaking-reliability-frameworks\/"},"modified":"2026-03-30T13:14:33","modified_gmt":"2026-03-30T13:14:33","slug":"agentic-systems-are-breaking-reliability-frameworks","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/03\/30\/agentic-systems-are-breaking-reliability-frameworks\/","title":{"rendered":"Agentic\u00a0Systems\u00a0are\u00a0Breaking\u00a0Reliability\u00a0Frameworks\u00a0"},"content":{"rendered":"<div><img data-opt-id=723960471  fetchpriority=\"high\" decoding=\"async\" width=\"770\" height=\"330\" src=\"https:\/\/devops.com\/wp-content\/uploads\/2025\/06\/AI-model.jpg\" class=\"attachment-large size-large wp-post-image\" alt=\"\" \/><\/div>\n<p><img data-opt-id=1667573176  fetchpriority=\"high\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/devops.com\/wp-content\/uploads\/2025\/06\/AI-model-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" \/><\/p>\n<p>Security\u00a0teams\u00a0have\u00a0spent\u00a0years\u00a0building\u00a0<a href=\"https:\/\/devops.com\/how-devops-can-benefit-from-a-managed-detection-and-response-mdr-solution\/\" target=\"_blank\" rel=\"noopener\">detection\u00a0and\u00a0response\u00a0capabilities<\/a>\u00a0around\u00a0a\u00a0failure\u00a0mode\u00a0they\u00a0understood\u00a0well\u00a0enough\u00a0to\u00a0instrument\u00a0for.\u00a0Typically,\u00a0a\u00a0service\u00a0misbehaves,\u00a0an\u00a0alert\u00a0fires\u00a0and\u00a0an\u00a0engineer\u00a0investigates.\u00a0This\u00a0kind\u00a0of\u00a0model\u00a0worked\u00a0because\u00a0the\u00a0systems\u00a0producing\u00a0the\u00a0failures\u00a0were\u00a0deterministic\u00a0enough\u00a0that\u00a0misbehavior\u00a0was\u00a0visible,\u00a0measurable\u00a0and\u00a0attributable\u00a0to\u00a0a\u00a0cause\u00a0that\u00a0a\u00a0runbook\u00a0could\u00a0address.<\/p>\n<p><span data-contrast=\"auto\">However,\u00a0what\u00a0agentic\u00a0systems\u00a0have\u00a0introduced\u00a0into\u00a0that\u00a0environment\u00a0is\u00a0a\u00a0category\u00a0of\u00a0failure\u00a0that\u00a0looks\u00a0nothing\u00a0like\u00a0the\u00a0one\u00a0the\u00a0detection\u00a0infrastructure\u00a0was\u00a0built\u00a0to\u00a0catch\u00a0\u2014\u00a0a\u00a0failure\u00a0that\u00a0completes\u00a0successfully,\u00a0logs\u00a0nothing\u00a0unusual,\u00a0returns\u00a0a\u00a0clean\u00a0status\u00a0code\u00a0and\u00a0disappears\u00a0into\u00a0the transaction\u00a0history\u00a0while\u00a0the\u00a0damage\u00a0it\u00a0caused\u00a0propagates\u00a0quietly\u00a0through\u00a0every\u00a0system\u00a0the agent\u00a0touched.<\/span><\/p>\n<p><span data-contrast=\"auto\">\u201cThe\u00a0governance\u00a0gap\u00a0this\u00a0creates\u00a0is\u00a0not\u00a0a\u00a0configuration\u00a0problem\u00a0that\u00a0a\u00a0new\u00a0tool\u00a0can\u00a0close,\u201d\u00a0says\u00a0<\/span><a href=\"https:\/\/in.linkedin.com\/in\/shahid-ali-khan-2aa084a3\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">Shahid\u00a0Ali\u00a0Khan<\/span><\/a><span data-contrast=\"auto\">, principal engineer \u2013 DevOps at <\/span><a href=\"https:\/\/www.testmuai.com\/\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">TestMu\u00a0AI,<\/span><\/a><span data-contrast=\"none\">\u00a0<\/span><span data-contrast=\"auto\">an AI-native software testing platform. It is structural, rooted in the assumption that reliability failures and security events are categorically distinct, happen through different mechanisms and require different response processes.<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":24,\"335559738\":244,\"335559740\":276}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Agentic systems break that assumption, Khan explains, because the same root cause, a manipulated input, a drifted model, a misconfigured capability boundary, can produce either outcome depending on context. Organizations that route reliability and security to different teams with different runbooks will keep discovering that gap through incidents rather than through the architectural decisions that could have prevented them.<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":24,\"335559738\":243,\"335559740\":276}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">Testing\u00a0Infrastructure\u00a0That\u00a0was\u00a0Built\u00a0for\u00a0the\u00a0Wrong\u00a0Assumption<\/span><span data-ccp-props='{\"335559685\":23,\"335559738\":241}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">The\u00a0testing\u00a0problem\u00a0that\u00a0agentic\u00a0systems\u00a0create\u00a0is\u00a0not\u00a0a\u00a0harder\u00a0version\u00a0of\u00a0the\u00a0testing\u00a0problem\u00a0that\u00a0deterministic\u00a0systems\u00a0create.\u00a0It\u00a0is\u00a0a\u00a0structurally\u00a0different\u00a0problem\u00a0that\u00a0requires\u00a0a\u00a0different\u00a0kind\u00a0of\u00a0answer.\u00a0<\/span><a href=\"https:\/\/ua.linkedin.com\/in\/zakutynskyi\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">Ihor\u00a0Zakutynskyi<\/span><\/a><span data-contrast=\"auto\">,\u00a0chief\u00a0technology\u00a0officer\u00a0at\u00a0FORMA\u00a0by\u00a0<\/span><a href=\"https:\/\/uni.tech\/en\/\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">Universe\u00a0Group<\/span><\/a><span data-contrast=\"auto\">, describes the shift his team made when they encountered the limits of deterministic test assertions against probabilistic systems.<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":24,\"335559738\":302,\"335559740\":276}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">\u201cRather than expecting exact outputs, we moved to constraint-based and statistical validation, asserting invariants and measuring distributions instead of matching outputs,\u201d Zakutynskyi explains. \u201cHard guarantees, safety and schema contracts, monotonic side-effect rules, idempotency of repeated calls and bounded response times remain pass-fail invariants. Everything above that baseline moves to statistical validation, running Monte Carlo-style test suites over representative inputs and computing stability metrics from the semantic embeddings of responses rather than comparing strings.\u201d<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":11,\"335559738\":244,\"335559740\":276}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The shift from exact match to distribution-based validation is not a concession to imprecision. It is a more accurate representation of what reliability actually means for a probabilistic system. Moreover, teams that resist it in favor of deterministic assertions will find themselves maintaining tests that pass consistently while missing the regressions that matter most.<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":11,\"335559738\":244,\"335559740\":276}'>\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/www.linkedin.com\/in\/rodesai-ln\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">Ronak\u00a0Desa<\/span><span data-contrast=\"none\">i<\/span><\/a><span data-contrast=\"auto\">,\u00a0CEO\u00a0and\u00a0founder\u00a0of\u00a0<\/span><a href=\"https:\/\/ciroos.ai\/\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">Ciroos<\/span><\/a><span data-contrast=\"none\">\u00a0<\/span><span data-contrast=\"auto\">and formerly SVP and GM at Cisco, frames the same shift in terms that engineering leaders will find immediately actionable. \u201cThe question isn\u2019t whether your test passed,\u201d he notes. \u201cIt\u2019s whether your system is reliably capable.\u201d That reframe demands moving from assertion-based testing to distribution-based testing, asking across many independent runs on the same task how many produce a correct outcome and treating that ratio as the reliability signal rather than the result of any individual run. Variance stability, the consistency of an agent\u2019s output distribution across runs, tells you whether an agent is reliable. A single passing run tells you almost nothing about the system\u2019s actual capability envelope.<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":24,\"335559738\":243,\"335559740\":276}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Before any new agentic component reaches production, it should be tested against real production sessions in parallel, not to match exact outputs but to measure how consistent the new agent\u2019s output distribution is compared to the established baseline. That comparison is the test. Everything else is preparation for it.<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":24,\"335559738\":246,\"335559740\":276}'>\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/www.linkedin.com\/in\/arunanbumani\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">Arun\u00a0Anbumani<\/span><\/a><span data-contrast=\"auto\">,\u00a0principal\u00a0cloud\u00a0infrastructure\u00a0engineer\u00a0at\u00a0<\/span><a href=\"https:\/\/www.oracle.com\/\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">Oracle<\/span><\/a><span data-contrast=\"auto\">, adds the infrastructure dimension to the testing picture that pure model-focused approaches miss. \u201cReplay-style testing against captured production traffic patterns and fault injection introducing controlled disruptions, resource contention, device resets and driver mismatches give teams visibility into how systems respond when the hardware paths underneath the models start behaving differently,\u201d Anbumani explains. \u201cThe broader challenge is that most SRE tooling was built for predictable services, and as infrastructure becomes more heterogeneous and development workflows incorporate AI-assisted tooling, the testing and observability platforms are still evolving to keep up.\u201d<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":24,\"335559738\":243,\"335559740\":276}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Testing infrastructure for agentic systems therefore, has to be built with the assumption that variability is not the exception but the operating condition. Fault injection and replay testing are not edge case preparation but the core of a testing regime designed for an environment where the normal operating envelope is wider and less well-defined than any previous generation of infrastructure.<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":24,\"335559738\":247,\"335559740\":276}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">The\u00a0Governance\u00a0Layer\u00a0Nobody\u00a0Built<\/span><span data-ccp-props='{\"335559685\":23,\"335559738\":241}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">The hardest part of running agentic systems in production is not building them. Khan, speaking from his experience in running agentic infrastructure at TestMu AI (formerly LambdaTest), identifies the real difficulty with a precision that practitioners who have not yet operated agentic systems at scale may not have encountered. \u201cTraditional runbooks assume failures are obvious,\u201d he observes. \u201cA service crashes, latency spikes, errors propagate. Agents fail subtly. They might complete successfully while doing something completely unintended.\u201d<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":24,\"335559738\":302,\"335559740\":276}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Detecting that failure mode without triggering false positives on every creative decision the agent makes is where existing governance frameworks fall short and building the controls that close that gap is the work most organizations have not done yet.<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":11,\"335559738\":73,\"335559740\":276}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Khan\u2019s approach is to build governance controls at the platform level that operates on behavioral boundaries rather than system metrics. Every agent has an explicitly defined capability envelope covering what tools it can invoke, what data it can access, what output formats are valid and what actions require human approval. These are not permissions in the traditional sense. These are runtime assertions, checked at the moment of execution rather than granted at deployment time and assumed to hold thereafter. When an agent invokes a tool outside its envelope or generates output that does not match expected schemas, the event is captured with full context \u2014 the input that triggered it, the reasoning chain the agent followed and the attempted action \u2014 and routed to a dedicated anomaly pipeline separate from standard incident management.<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":24,\"335559738\":242,\"335559740\":276}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Building behavioral boundaries as runtime assertions rather than deployment-time permissions is the architectural decision that makes the governance layer enforceable rather than advisory. Permissions that are granted at deployment and never checked again are assumptions, and in an agentic system operating at machine speed, unverified assumptions are where the most consequential failures begin.<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":117,\"335559738\":247,\"335559740\":276}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">\u201cIt is also worth implementing circuit breakers specifically for agent autonomy,\u201d Khan notes. If an agent exceeds a threshold of envelope violations within a time window, it is automatically downgraded to a supervised mode where all actions require human approval. This limits the blast radius of a compromised or misbehaving agent while the team investigates, and it does so without requiring a human to notice the problem first. The circuit breaker fires on the pattern, not on a human\u2019s recognition of it, which is the only response mechanism fast enough to be meaningful when the agent is operating at the speed agentic systems operate.<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":11,\"335559738\":244,\"335559740\":276}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">When\u00a0a\u00a0Reliability\u00a0Failure\u00a0is\u00a0Also\u00a0a\u00a0Security\u00a0Event<\/span><span data-ccp-props='{\"335559685\":23,\"335559738\":242}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">The boundary between a reliability failure and a security event in an agentic system is not always a boundary at all. Khan explains how his team encountered this directly when building the classification system for envelope violations. An agent accessing an unauthorized API could be a misconfiguration, which is a reliability problem, or a <\/span><a href=\"https:\/\/infosecrelations.com\/prompt-injection-is-now-a-geopolitical-problem\/\"><span data-contrast=\"none\">prompt\u00a0injection\u00a0attack<\/span><\/a><span data-contrast=\"auto\">, which is a security problem. From the outside, the two events look identical. \u201cWe have adopted a classify later, capture now approach,\u201d he explains. \u201cEvery envelope violation is logged with enough context for both SRE and security review.\u201d A secondary classification system then tags events based on input source. If the anomaly correlates with user-provided content, it is flagged for security review. If it correlates with a model update or configuration change, it routes to SRE.<\/span><span data-ccp-props='{\"201341983\":0,\"335559685\":23,\"335559737\":24,\"335559738\":302,\"335559740\":276}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The classify-later-capture now approach resolves a real organizational tension that most incident response processes are not designed to handle. Forcing immediate classification of an event whose root cause is ambiguous leads either to misrouting, where a security event gets treated as a reliability problem until the damage is done, or to alert fatigue, where every ambiguous event gets escalated to both teams until neither team takes the escalation seriously.<\/span><\/p>\n<p><a href=\"https:\/\/devops.com\/agentic-systems-are-breaking-reliability-frameworks\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\">Read More<\/a><\/p>\n<p>\u200b<\/p>","protected":false},"excerpt":{"rendered":"<p>Security\u00a0teams\u00a0have\u00a0spent\u00a0years\u00a0building\u00a0detection\u00a0and\u00a0response\u00a0capabilities\u00a0around\u00a0a\u00a0failure\u00a0mode\u00a0they\u00a0understood\u00a0well\u00a0enough\u00a0to\u00a0instrument\u00a0for.\u00a0Typically,\u00a0a\u00a0service\u00a0misbehaves,\u00a0an\u00a0alert\u00a0fires\u00a0and\u00a0an\u00a0engineer\u00a0investigates.\u00a0This\u00a0kind\u00a0of\u00a0model\u00a0worked\u00a0because\u00a0the\u00a0systems\u00a0producing\u00a0the\u00a0failures\u00a0were\u00a0deterministic\u00a0enough\u00a0that\u00a0misbehavior\u00a0was\u00a0visible,\u00a0measurable\u00a0and\u00a0attributable\u00a0to\u00a0a\u00a0cause\u00a0that\u00a0a\u00a0runbook\u00a0could\u00a0address. However,\u00a0what\u00a0agentic\u00a0systems\u00a0have\u00a0introduced\u00a0into\u00a0that\u00a0environment\u00a0is\u00a0a\u00a0category\u00a0of\u00a0failure\u00a0that\u00a0looks\u00a0nothing\u00a0like\u00a0the\u00a0one\u00a0the\u00a0detection\u00a0infrastructure\u00a0was\u00a0built\u00a0to\u00a0catch\u00a0\u2014\u00a0a\u00a0failure\u00a0that\u00a0completes\u00a0successfully,\u00a0logs\u00a0nothing\u00a0unusual,\u00a0returns\u00a0a\u00a0clean\u00a0status\u00a0code\u00a0and\u00a0disappears\u00a0into\u00a0the transaction\u00a0history\u00a0while\u00a0the\u00a0damage\u00a0it\u00a0caused\u00a0propagates\u00a0quietly\u00a0through\u00a0every\u00a0system\u00a0the agent\u00a0touched. \u201cThe\u00a0governance\u00a0gap\u00a0this\u00a0creates\u00a0is\u00a0not\u00a0a\u00a0configuration\u00a0problem\u00a0that\u00a0a\u00a0new\u00a0tool\u00a0can\u00a0close,\u201d\u00a0says\u00a0Shahid\u00a0Ali\u00a0Khan, principal engineer \u2013 DevOps at TestMu\u00a0AI,\u00a0an AI-native software testing platform. It is structural, rooted in [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":3737,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[5],"tags":[],"class_list":["post-3736","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/3736","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=3736"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/3736\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media\/3737"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=3736"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=3736"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=3736"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}