{"id":4468,"date":"2026-06-30T12:17:59","date_gmt":"2026-06-30T12:17:59","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/06\/30\/your-ai-testing-framework-might-be-passing-tests-it-should-be-failing\/"},"modified":"2026-06-30T12:17:59","modified_gmt":"2026-06-30T12:17:59","slug":"your-ai-testing-framework-might-be-passing-tests-it-should-be-failing","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/06\/30\/your-ai-testing-framework-might-be-passing-tests-it-should-be-failing\/","title":{"rendered":"Your AI Testing Framework Might Be Passing Tests It Should Be Failing"},"content":{"rendered":"<div><img data-opt-id=723960471  fetchpriority=\"high\" decoding=\"async\" width=\"770\" height=\"330\" src=\"https:\/\/devops.com\/wp-content\/uploads\/2025\/06\/AI-model.jpg\" class=\"attachment-large size-large wp-post-image\" alt=\"\" \/><\/div>\n<p><img data-opt-id=1667573176  fetchpriority=\"high\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/devops.com\/wp-content\/uploads\/2025\/06\/AI-model-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" \/><\/p>\n<p>The <a href=\"https:\/\/devops.com\/making-the-case-for-ai-in-testing-security-and-delivery-not-just-coding\/\" target=\"_blank\" rel=\"noopener\">integration of AI into software testing pipelines<\/a> has moved past the experimental phase in most engineering organizations that were going to try it.<\/p>\n<p>Engineering teams are not evaluating whether AI belongs in their testing workflows anymore, but are discovering, sometimes painfully, that the architectural decisions they made when integrating it are the ones that will determine whether it improves their release confidence or quietly erodes it.<\/p>\n<p>This is because the failure modes that are flaring up from production deployments are not coming from bad models or insufficient compute but from teams that did not draw a clear enough line between where AI is allowed to influence the pipeline and where deterministic controls have to stay in charge.<\/p>\n<p>The testing pipeline is one of the few places in software delivery where ambiguity is genuinely costly, comments Mayank Bhola, co-founder and head of products at TestMu AI, an AI-native software testing platform. \u201cA pass has to mean something; or a fail has to be trustworthy enough to block a deployment. The moment a non-deterministic system is allowed to make gate-keeping calls directly, without validation and without a deterministic layer sitting between the AI\u2019s judgment and the deployment decision, the pipeline loses the reliability guarantee that makes it worth running in the first place.\u201d<\/p>\n<p>This distinction, between where AI can assist and where deterministic controls must govern, is what separates engineering teams getting durable value from those accumulating hidden risks.<\/p>\n<h3>The False Promise of Replacing Locators<\/h3>\n<p>Visual regression testing was one of the first areas where multimodal AI entered the testing conversation with real credibility. It was quite straightforward: Replace brittle DOM-based locators, the XPath and CSS selectors that break every time a developer rearranges a component, with models that understand what a button does rather than where it sits in the tree. That\u2019s not wrong in principle. However, the implementation choices teams are making to get there determine whether they end up with a more resilient test suite or a more expensive source of false positives.<\/p>\n<p>Otso Virtanen, SQS product lead at Qt Group, takes a position that cuts against the default impulse to reach for the largest available foundation model. Qt\u2019s strategy prioritizes purpose-built computer vision and object tree analysis over general-purpose multimodal models, and the reasoning is specific rather than philosophical.<\/p>\n<p>General-purpose models introduce latency, cost and hallucination risk that purpose-built approaches avoid by design. However, the confidence threshold problem, preventing false positives when a button simply changes shade rather than changes function, is better solved through semantic understanding of elements than through raw pixel comparison or foundation model inference. \u201cOur models use a semantic understanding of elements rather than raw pixel comparison,\u201d Virtanen explains, \u201cwhich prevents false positives when minor aesthetic changes occur.\u201d A button that changes shade is still the same button performing the same function, and a model with semantic understanding of what that button does can make that judgment reliably and cheaply in ways that a general-purpose multimodal model cannot without careful threshold tuning that reintroduces the fragility you were trying to eliminate.<\/p>\n<p>Srikumar Ramanathan, chief solutions officer at Mphasis, has approached the same problem from the infrastructure side rather than the model selection side, and arrived at an implementation that demonstrates what semantic discovery actually looks like in practice. His team uses Playwright to capture the DOM and accessibility tree to provide the full structural context of a page to an LLM via Bedrock, enabling the automation to interact with elements based on their functional intent rather than their exact position in the code.<\/p>\n<p>As a consequence, the test framework remains resilient through layout shifts and brand updates because it is not coupled to the visual or positional properties of the elements it tests. \u201cSince we are not tied to pixel-perfect matching or specific CSS paths,\u201d Ramanathan shares, \u201cthe system focuses purely on whether the application is functionally sound.\u201d That reframe, from visual fidelity to functional intent, is the architectural shift that makes AI-driven visual testing practical at scale rather than impressive in a demo.<\/p>\n<h3>Self-Healing Tests and the Verification Question Nobody Asks<\/h3>\n<p>Self-healing test frameworks promise to keep pipelines green by automatically finding and updating broken element references when the underlying code changes. The promise is not exaggerated, it\u2019s real. However, the risk that receives less attention is that a self-healing framework can latch onto the wrong element just as confidently as the right one. If that fix commits to the repository without verification, the test now validates different behavior than it was originally written to catch. Ironically, the framework keeps the pipeline green but the bug it was supposed to catch is not being caught.<\/p>\n<p>Ramanathan\u2019s framework, he shared in our email interview, for addressing this operates across three verification layers that together prevent self-healing from becoming a mechanism for silently lowering the quality bar. The first is contextual runtime checks that cross-reference visual attributes and semantic context to ensure the framework found the intended element and not merely an element with a matching ID. The second is outcome validation that confirms the expected state change actually occurred, so that a delete action is verified to have removed the item rather than just verified to have clicked a button. The third is shift left for the fix itself, where the framework automatically generates pull requests to update the source code rather than applying a runtime patch that only the CI environment knows about.<\/p>\n<p>\u201cThis approach moves validation from did we interact with an element to did we achieve the intended outcome,\u201d Ramanathan argues, and that shift is what turns self-healing from a reliability crutch into a genuine quality feedback loop.<\/p>\n<p>Bhola, speaking from his experience building the product architecture at TestMu AI, identifies where this verification architecture has to connect to product design to be sustainable at scale. The approval model for self-heals cannot be universal without eliminating the productivity gain that makes self-healing worth building, but it also cannot be absent without losing the ability to catch the class of mistake the AI cannot recognize as a mistake. \u201cAt TestMu AI, we have found that the approval model needs to be risk-tiered rather than universal,\u201d he underscores. \u201cLow-risk self-heals, where the semantic intent of the original test is clearly preserved, can go straight to a pull request. Higher-risk changes, where a test now validates a different user flow than it was originally written for, need a human to confirm the intent before the fix becomes permanent.\u201d Building that risk classification into the framework itself rather than leaving it to individual engineer judgment is where the real product architecture work sits, and most teams that have implemented self-healing without it are discovering the gap through incidents rather than through design.<\/p>\n<h3>Synthetic Data and the Fidelity Debate<\/h3>\n<p>Staging environments that do not mirror production reality are testing environments that do not test what they claim to test. Synthetic data generation has emerged as the mechanism for closing that gap without exposing real user data in non-production systems, and the architectural decisions around how to generate it reveal a genuine and productive disagreement between practitioners who have thought about the problem carefully.<\/p>\n<p>Ramanathan\u2019s team has made a deliberate choice to build synthetic data generation on deep learning architectures, specifically GANs and VAEs, rather than LLM-based workflows, and the reasoning addresses failure modes that LLM-based approaches introduce at scale. LLMs operating at the token level tend to hallucinate character-level patterns and violate strict schema boundaries, producing format drift in generated data that does not match production schemas with the precision that reliability testing requires. At large scale, LLMs lose variance and default to repetitive, statistically safe patterns, which means the synthetic data set converges on the average case rather than preserving the full entropy of the source distribution. GANs and VAEs avoid both failure modes by design because the adversarial training process naturally converges on the sample data\u2019s probability distribution, and post-generation rejection sampling filters any generated points that deviate from the target density before they enter the data set.<\/p>\n<p>Ankit Awasthi, director of engineering at Twilio, takes a position that reframes what synthetic data is actually for and arrives at a different architectural choice as a result. A mirror image of production is not always the goal, and treating statistical fidelity as the primary objective misses the most valuable use case for synthetic data in testing environments. \u201cWe often intentionally skew synthetic data toward edge cases and outliers rather than just the happy path of average traffic,\u201d he explains. When the goal is to find the failure modes that normal production traffic will never exercise, deliberately breaking from the production distribution is not a shortcoming of the synthetic data. That is the point. When statistical fidelity is genuinely required, his team trains generative models on anonymized production data to reproduce the shape and distribution of real-world traffic without exposing content, but that is a specific use case rather than a universal principle.<\/p>\n<p>Both positions are defensible because they are solving for different objectives. Ramanathan\u2019s fidelity-first approach is right when the staging environment needs to reproduce production behavior accurately for reliability testing. Awasthi\u2019s edge-case skew is right when the goal is stress-testing the failure modes that production traffic would never surface on its own. The mistake is treating one as universally correct and the other as a workaround.<\/p>\n<h3>Determinism in a Non-Deterministic Pipeline<\/h3>\n<p>Integrating AI into a CI\/CD pipeline where flaky tests block deployment creates a tension that temperature controls and seed pinning cannot fully resolve. LLMs are non-deterministic by nature, and the mechanisms available to reduce variance, temperature set to zero, fixed prompts and model version pinning, reduce it without eliminating it. Model updates and context sensitivity introduce drift that these controls cannot anticipate. A pipeline that is supposed to give binary, trustworthy gate-keeping decisions cannot rest its reliability on probabilistic controls alone.<\/p>\n<p>The architectural answer that has emerged from teams who have worked through this is a clean separation between where AI is allowed to operate and where it is not. Virtanen\u2019s approach at Qt Group draws the line with particular clarity in regulated industries where zero-tolerance CI\/CD requirements leave no room for ambiguity. Non-deterministic foundation models are used for test script generation and maintenance, not for the execution runtime. The execution engine stays deterministic. \u201cA pass truly means the software is stable from a testing perspective,\u201d Virtanen argues, \u201cbecause the component of the pipeline making the gate-keeping decision is not the component with non-deterministic behavior.\u201d<\/p>\n<p>Awasthi pushes that architectural principle further into the evaluation model itself. Rather than brittle exact-match assertions that a non-deterministic system will inevitably fail at some rate, his team has adopted statistical quality control for AI agent evaluation. Batched evaluations establish baselines with specific error bars, and deployments only halt if aggregate metrics regress beyond the statistical tolerance rather than at the first deviation from an exact expected output. \u201cWhen testing agents, we validate the reasoning trace and internal state changes,\u201d Awasthi explains, \u201censuring the decision-making logic remains sound even if the specific generative text naturally drifts.\u201d That reframe separates what matters, the reasoning process staying consistent, from what does not \u2014 the exact wording of a generated response varying slightly between runs.<\/p>\n<p>Shahid Ali Khan, principal engineering DevOps at TestMu AI, frames the infrastructure boundary that makes this separation sustainable in production. \u201cAt TestMu AI, we draw a hard line between the generation layer and the execution layer,\u201d he highlights. \u201cAI can generate test scripts for our customers, suggest fixes for broken selectors and surface anomalies in test results, but the execution runtime has to stay deterministic. A pass or fail from the pipeline needs to mean something absolute.\u201d Every AI-generated output goes through schema validation and deterministic downstream logic before it can influence a pipeline decision. Temperature controls and model version pinning reduce variance, but they do not eliminate it, and any team relying on them as the primary stability mechanism is one model update away from an incident that the controls they thought were sufficient did not prevent.<\/p>\n<h3>The Bottleneck That Moved<\/h3>\n<p>Khan names a production reality that most coverage of AI in software testing does not address directly. Coding agents have become capable enough that developers are leaning on them heavily, and the consequence is a PR volume that the review and testing infrastructure was not built to handle. Developers who used to open two PRs per week are now opening two to five per day, each containing code that a coding agent generated at a pace no human reviewer can match. The quality problem compounds because AI-generated code does not produce fewer bugs per line than human-written code. It produces them at the same rate but at a dramatically higher volume.<\/p>\n<p>\u201cYou are in a problem,\u201d Khan argues. \u201cYou are a technical leader, and you are seeing your team leaning on AI and spitting out a lot of features, but then you are at risk of having code rot, having quality rot. The first day you have a CrowdStrike moment in your software, who do you blame? Do you blame the AI, or do you blame yourself for not making it clear that quality matters?\u201d The answer the testing pipeline gives to that question depends entirely on whether the AI integration was built with discipline around the layers that have to stay deterministic, or whether speed and convenience drove the architectural choices and the structural gaps are waiting to surface.<\/p>\n<p>The teams getting durable value from AI in their testing pipelines have accepted a straightforward premise. AI belongs in the generation and analysis layers. Deterministic logic belongs in the gate-keeping layer. The boundary between them is not a product feature that can be configured. It is an architectural decision that has to be made deliberately at the start and defended consistently as the system scales.<\/p>\n<p><a href=\"https:\/\/devops.com\/your-ai-testing-framework-might-be-passing-tests-it-should-be-failing\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\">Read More<\/a><\/p>\n<p>\u200b<\/p>","protected":false},"excerpt":{"rendered":"<p>The integration of AI into software testing pipelines has moved past the experimental phase in most engineering organizations that were [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":4469,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[5],"tags":[],"class_list":["post-4468","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/4468","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=4468"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/4468\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media\/4469"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=4468"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=4468"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=4468"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}