{"id":4428,"date":"2026-06-26T08:15:08","date_gmt":"2026-06-26T08:15:08","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/06\/26\/the-bottleneck-isnt-coding-anymore-its-verification\/"},"modified":"2026-06-26T08:15:08","modified_gmt":"2026-06-26T08:15:08","slug":"the-bottleneck-isnt-coding-anymore-its-verification","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/06\/26\/the-bottleneck-isnt-coding-anymore-its-verification\/","title":{"rendered":"The Bottleneck Isn\u2019t Coding Anymore. It\u2019s Verification"},"content":{"rendered":"<div><img data-opt-id=46422152  fetchpriority=\"high\" decoding=\"async\" width=\"770\" height=\"330\" src=\"https:\/\/devops.com\/wp-content\/uploads\/2026\/06\/devops_ai_integration_bottleneck_770x330.jpeg\" class=\"attachment-large size-large wp-post-image\" alt=\"\" \/><\/div>\n<p><img data-opt-id=201052132  fetchpriority=\"high\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/devops.com\/wp-content\/uploads\/2026\/06\/devops_ai_integration_bottleneck_770x330-150x150.jpeg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" \/><\/p>\n<p>Last month, one of our autonomous coding agents (not a copilot suggesting inline completions, but a system that reads a ticket, plans a multi-file implementation and opens a PR without a human touching the keyboard) analyzed a ticket, touched 37 files, updated two database migrations and opened a PR in 11 minutes flat. The diff looked clean. Tests passed. The reviewer approved it.<\/p>\n<p>We found the problem at 2:47 a.m. on a Thursday, three days later, during an unrelated log audit. One of our SREs was tailing canary logs trying to trace an intermittent 401, and there it was: A staging secret, printed in plaintext, sitting in a log line the agent had added while \u2018fixing\u2019 a failing test. The agent had also introduced a token audience mismatch. The canary environment expected one audience claim; the agent had hardcoded another. Traffic was routing. Nothing was failing. The tokens were just being validated against the wrong audience, which meant our pre-production slice of live traffic had been running with a quietly broken auth contract for 72 hours.<\/p>\n<p>I remember staring at the log line and thinking: This wasn\u2019t written by a junior engineer who forgot to scrub a credential. This was written by a system that has no concept of \u2018forgetting\u2019. It did exactly what it was optimized to do. It made the tests green. The fact that it leaked a secret along the way wasn\u2019t a bug in the agent. It was a gap in our process.<\/p>\n<p>That incident rewired how I think about <a href=\"https:\/\/github.com\/resources\/whitepapers\/enterprise-octoverse\" target=\"_blank\" rel=\"noopener\">engineering leadership<\/a> because correctness has shifted into gaps our tests and reviews were never designed to cover.<\/p>\n<h3>The Craft Transition: Most Teams Stuck Midway<\/h3>\n<p>Software engineering is shifting from a code-centric craft to an intent-centric operating model. Humans describe outcomes. Agents execute multi-step changes. This is happening whether your roadmap accounts for it or not. The only real question is whether it happens inside your governance framework or outside of it.<\/p>\n<p>I see three stages in how organizations are adapting.<\/p>\n<ol>\n<li>The first is the classic SDLC everyone knows: Humans write code, humans review code and humans ship code.<\/li>\n<li>The second is what people are calling \u2018vibe coding\u2019. Prompt-driven, fast, informal. It feels incredibly productive because output goes up. But review capacity stays the same. Engineering incentives stay the same. What actually happens is you start accumulating verification debt: The growing gap between how fast you can generate change and how confidently you can prove that change is safe.<\/li>\n<li>The third stage, and very few teams have reached, is spec-driven development. Structured, testable inputs that constrain what the agent can do before it does it. When diffs become cheap to produce, specs become the only scalable way to express intent and control blast radius.<\/li>\n<\/ol>\n<p>Most organizations I talk to are stuck in stage two. They\u2019ve adopted the tools. They haven\u2019t redesigned the process.<\/p>\n<h3>The Data Tells a Story Nobody Wants to Hear<\/h3>\n<p>The DORA 2025 State of AI-assisted Software Development report surveyed nearly 5,000 technology professionals and landed on a finding that should make every engineering leader uncomfortable: AI adoption now correlates with higher throughput, but it also correlates with higher instability. AI acts as what the researchers called a \u2018mirror and a multiplier\u2019. In cohesive organizations with strong platforms and fast feedback loops, AI boosts efficiency. In fragmented ones, it exposes and amplifies every existing weakness. 90% of developers now use AI in their daily work. Over 80% say it has improved their productivity. However, 30% still report little or no trust in the code AI generates.<\/p>\n<p>The Stack Overflow 2025 Developer Survey, which drew 49,000 responses across 166 countries, makes the trust problem even starker. 84% of developers now use or plan to use AI tools, up from 76% in 2024, with 51% using them daily. But trust in AI accuracy has dropped from 40% to 29% in a single year. More developers now actively distrust AI output (46%) than trust it (33%). Only 3% report \u2018highly trusting\u2019 what these tools produce. Experienced developers are the most skeptical: Their \u2018highly distrust\u2019 rate is 20%.<\/p>\n<p>Here\u2019s the number that keeps me up at night: 66% of developers say their biggest frustration is AI solutions that are \u2018almost right, but not quite\u2019. 45% say debugging AI-generated code takes longer than writing it themselves would have. Three-quarters of developers say that even in a future where AI can handle most coding tasks, the top reason they\u2019d still ask a human for help is: \u2018When I don\u2019t trust AI\u2019s answers\u2019.<\/p>\n<p>This isn\u2019t a tooling problem. It\u2019s an arithmetic problem. Generation cost has hit zero. Verification cost is fixed or increasing. If you haven\u2019t redesigned your review process, you haven\u2019t gained velocity. You\u2019ve relocated the work. What used to be \u2018writing code\u2019 is now \u2018hunting for subtle, agentic regressions\u2019, and the second job is harder than the first.<\/p>\n<h2>Why Your PR Process is Already Broken<\/h2>\n<p>Here\u2019s the arithmetic that should worry every engineering leader: Generating a thousand lines of code now has near-zero marginal cost. An agent can touch 40 files in minutes. Your senior reviewer, the one you\u2019re counting on to catch problems, is now dealing with diffs they can\u2019t fully reason about in any realistic timeframe.<\/p>\n<p>So they start sampling. Reading some files carefully. Skimming others. Trusting the test suite to catch the rest.<\/p>\n<p>Sampling is not a review strategy. It\u2019s a coping mechanism.<\/p>\n<p>Three things need to change to make agentic workflows survivable.<\/p>\n<ol>\n<li>First, the review has to shift from syntax to intent. You can no longer audit every line, so the question becomes: Did the agent output actually satisfy the original constraints? The failure mode you need to worry about now isn\u2019t \u2018tests failed\u2019. It\u2019s \u2018tests passed, but intent failed\u2019.<\/li>\n<li>Second, the spec review becomes your primary gate. You review the instructions before the agent runs, not after. By the time you\u2019re reading the diff, half the damage is already done.<\/li>\n<li>Third, command approval becomes your new privilege escalation. You don\u2019t need to approve every command the agent runs. But you absolutely need gates on the high-risk ones: Migrations, secrets access, IAM changes, destructive operations, deploys.<\/li>\n<\/ol>\n<h2>The New Loop<\/h2>\n<p>Here\u2019s what the SDLC looks like when you account for agents:<\/p>\n<p>Spec \u2192 Agent Executes \u2192 CI Verifies \u2192 Human Approves \u2192 Deploy \u2192 Observe \u2192 Spec Updated<\/p>\n<p>That loop only works if you can measure verification capacity, and you can start measuring it next week with three metrics.<\/p>\n<p>Track the median PR review time, but segment it: Agent-generated PRs versus human-written ones. If there\u2019s no meaningful difference, either your agents are producing trivial changes or your reviewers aren\u2019t actually reviewing.<\/p>\n<p>Track the agent failure rate separately. How often do agent-assisted changes get rolled back or cause regressions? This number should be going down. If it isn\u2019t, your specs are too loose.<\/p>\n<p>Stop tracking generic code coverage. Start tracking critical path coverage: Auth, payments, data writes. These are the systems where a missed edge case costs you money, trust or both.<\/p>\n<h2>Specs as the Control Plane<\/h2>\n<p>The spec needs to be a versioned artifact that lives in your repo. It gets reviewed before the agent runs. It gets updated after every change. If you\u2019re familiar with the model context protocol (MCP) or NIST\u2019s SP 800-218A (the AI community profile for the Secure Software Development Framework), these provide useful scaffolding for connecting agent tooling to policy engines. MCP is particularly interesting here because it allows your verification engines to feed live system state back to the agent, closing the gap between the IDE and the runtime. Instead of the agent operating on a stale snapshot of your infrastructure, it can query the actual state of your canary environment, your schema version, and your secrets vault permissions before it writes a single line of code.<\/p>\n<p>Let me give two concrete examples of what spec-driven prevention looks like in practice. For auth constraints: You specify that token audience must match the environment. Dev, QA, canary, prod. Then you write an automated test that fails if the audience differs. If that spec had existed before our canary incident, the agent would have been constrained from introducing the mismatch in the first place, and the test would have caught it if it did anyway.<\/p>\n<p>For migration safety: You require every database migration to ship with an automated rollback script and a pre-migration check that validates against a schema snapshot. The agent doesn\u2019t get to run the migration without those artifacts present.<\/p>\n<h3>A Template You Can Actually Use<\/h3>\n<p>I\u2019ve been iterating on a spec format with my team. Here\u2019s roughly where we\u2019ve landed:<\/p>\n<p>Feature: [Name]\n<\/p>\n<p>Context: [Intent and Business Logic]\n<\/p>\n<p>Governance and Risk<\/p>\n<p>Risk Tier: [prototype\/customer-Facing\/regulated]\n<\/p>\n<p>Change Budget: [Max Files\/LOC\/commands Allowed Without Escalation]\n<\/p>\n<p>Data Boundaries: [PII, PCI, GDPR or Specific Files the Agent Must Not Read]\n<\/p>\n<p>Security Controls: [Automatic secrets redaction in logs, scoped time-bound tokens]\n<\/p>\n<p>Contracts and Constraints<\/p>\n<p>API\/Data Contracts: [Endpoints, Inputs, Outputs, Error States]\n<\/p>\n<p>Security Policy: [Auth Requirements, Secrets Scanning, Dependency Rules]\n<\/p>\n<p>Verification and Observability<\/p>\n<p>Acceptance Criteria: [Given\/When\/Then Scenarios]\n<\/p>\n<p>Test Plan: [Unit, Integration and Negative Tests Required]\n<\/p>\n<p>Rollback Verification: [What Specific Metric Proves a Rollback Succeeded]\n<\/p>\n<p>Agent Execution<\/p>\n<p>Files to Touch: [Scope]\n<\/p>\n<p>Rollout Plan: [Feature Flags, Canary Steps, Manual Gates]\n<\/p>\n<p>The \u2018change budget\u2019 line is the one most teams skip, and it\u2019s the one that saves you. Think of it as the SRE error budget concept applied to agent execution: Just as an error budget caps how much reliability you\u2019re willing to burn in exchange for velocity, a change budget caps how much codebase surface area an agent can touch before a human has to re-authorize. If you tell the agent it can touch a maximum of 12 files without escalation, you\u2019ve put a ceiling on blast radius before anything runs. Just as an error budget is enforced by the platform (not by hoping engineers remember to check), a change budget should be enforced by a pre-submit hook that counts modified files and blocks the PR if the agent exceeds its allowance. If it\u2019s not automated, it\u2019s not a budget. It\u2019s a suggestion.<\/p>\n<h3>What \u2018Senior Engineer\u2019 Means Now<\/h3>\n<p>Writing clever syntax is losing value fast. The differentiator for senior engineers increasingly is verification-driven design: The ability to write specs that constrain failure, critique agent output for intent mismatches and harden systems against the kinds of regressions that pass all your tests.<\/p>\n<p>This needs to reach your hiring process. I\u2019m not saying throw out your fundamentals interviews. Keep them. But add an exercise that actually tests agent-era thinking. Here\u2019s what I mean. Give the candidate a feature: \u2018Add a password reset flow to the user service\u2019. Then show them two specs.<\/p>\n<p>The first spec looks like this:<\/p>\n<p>Feature: Password Reset<\/p>\n<p>Context: Users need to reset passwords via email link.<\/p>\n<p>Contracts:<\/p>\n<p>POST \/reset-request \u2192 sends email with reset token POST \/reset-confirm \u2192 accepts token + new password<\/p>\n<p>Test Plan:<\/p>\n<p>Happy Path: Valid token resets password<\/p>\n<p>Integration Test: Email delivery<\/p>\n<p>That spec would have seemed perfectly adequate two years ago. An agent given this spec would produce working code. The tests would pass, and you\u2019d have no idea whether the agent had introduced a token that never expires or stored the reset token in plaintext or allowed unlimited reset attempts from a single IP.<\/p>\n<p>Now show them the senior spec:<\/p>\n<p>Feature: Password Reset<\/p>\n<p>Context: Users need to reset passwords via email link.<\/p>\n<p>Governance<\/p>\n<p>Risk Tier: Customer-Facing\/Regulated (handles auth credentials) Change Budget: Max eight files. Agent must not modify auth_provider.go or session_store.go.<\/p>\n<p>Data Boundaries: Agent must not log, print or persist the reset token in any form outside the tokens table.<\/p>\n<p>Security Controls: All new log statements must pass secrets-pattern scan before merge.<\/p>\n<p>Contracts<\/p>\n<p>Post\/reset-Request \u2192 Sends Email With Reset Token<\/p>\n<p>Token: Cryptographically random, 256-bit, stored as bcrypt hash<\/p>\n<p>Expiry: 15 minutes from creation, enforced server-side<\/p>\n<p>Rate Limit: Max 3 requests per email per hour; return 429<\/p>\n<p>Post\/reset-Confirm \u2192 Accepts Token + New Password<\/p>\n<p>Token is single-use; delete on consumption regardless of outcome<\/p>\n<p>Password must meet policy (min 12 chars, breach-list check)<\/p>\n<p>On Success: Invalidate all existing sessions for that user<\/p>\n<p>Test Plan<\/p>\n<p>Happy Path: Valid token resets password, old sessions invalidated<\/p>\n<p>Expired token returns 410<\/p>\n<p>Reused token returns 410<\/p>\n<p>Brute Force: 4th reset request within 1 hour returns 429<\/p>\n<p>Negative: Agent did not modify auth_provider.go or session_store.go<\/p>\n<p>Negative: Grep codebase for plaintext token outside tokens table<\/p>\n<p>Integration: Email delivery, token expiry enforcement<\/p>\n<p>The gap between those two specs is exactly where every agentic regression hides. The first spec tells the agent what to build. The second spec tells the agent what it must not break. What you\u2019re really testing for is negative space thinking: A junior engineer reads code and sees what is there. A senior engineer reads a spec and sees what is missing. The constraints that weren\u2019t written. The failure modes that weren\u2019t named. The files that should have been marked off-limits but weren\u2019t. Ask your candidate which failure modes the first spec misses, and why the constraints in the second one exist. You\u2019ll learn more about their engineering judgement from that conversation than from watching them invert a binary tree.<\/p>\n<h3>A 30\/60\/90-Day Plan for Leaders<\/h3>\n<p>If you\u2019re an engineering leader reading this and wondering where to start, here\u2019s a sequence that\u2019s worked for me.<\/p>\n<p>In the first 30 days, focus on guardrails. Define which domains agents are allowed to operate in (internal tooling is usually a good starting point) and which are off-limits (billing and payments). Standardize an AGENTS.md file for every repository that makes the rules explicit, and mandate audit logging for every terminal command an agent runs. You need the paper trail before you need anything else.<\/p>\n<p>In the next 30 days, start piloting. Pick high-toil work that nobody wants to do: Dependency upgrades, boilerplate generation, repetitive refactors. Measure the verification gap, which is the time difference between reviewing an agent-generated PR and a human-written one. Put command approval gates on migrations and IAM changes. See what breaks.<\/p>\n<p>By day 90, you should be ready to rewire the process properly. Integrate agent execution into your CI pipelines. Update your performance review criteria to reward spec quality and verification strength. Publish the spec-driven standard as the default for all new services.<\/p>\n<h3>The Bet You Are Making<\/h3>\n<p>The Stack Overflow numbers paint a picture of an industry that has adopted the tools but hasn\u2019t solved the trust problem: 84% adoption, 29% trust. That gap isn\u2019t closing on its own. The DORA data confirms that speed without stability is just accelerated chaos.<\/p>\n<p>The teams that come out ahead in the next two years won\u2019t be the ones that adopted agents fastest. They\u2019ll be the ones that invested in verification capacity with the same seriousness they invest in production capacity: Tests, policies, audit logs, rollout controls.<\/p>\n<p>The bottleneck has moved. Your process needs to move with it.<\/p>\n<p><a href=\"https:\/\/devops.com\/the-bottleneck-isnt-coding-anymore-its-verification\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\">Read More<\/a><\/p>\n<p>\u200b<\/p>","protected":false},"excerpt":{"rendered":"<p>Last month, one of our autonomous coding agents (not a copilot suggesting inline completions, but a system that reads a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":4429,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[5],"tags":[],"class_list":["post-4428","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/4428","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=4428"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/4428\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media\/4429"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=4428"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=4428"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=4428"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}