The Bottleneck Isn’t Coding Anymore. It’s Verification

Last month, one of our autonomous coding agents (not a copilot suggesting inline completions, but a system that reads a ticket, plans a multi-file implementation and opens a PR without a human touching the keyboard) analyzed a ticket, touched 37 files, updated two database migrations and opened a PR in 11 minutes flat. The diff looked clean. Tests passed. The reviewer approved it.

We found the problem at 2:47 a.m. on a Thursday, three days later, during an unrelated log audit. One of our SREs was tailing canary logs trying to trace an intermittent 401, and there it was: A staging secret, printed in plaintext, sitting in a log line the agent had added while ‘fixing’ a failing test. The agent had also introduced a token audience mismatch. The canary environment expected one audience claim; the agent had hardcoded another. Traffic was routing. Nothing was failing. The tokens were just being validated against the wrong audience, which meant our pre-production slice of live traffic had been running with a quietly broken auth contract for 72 hours.

I remember staring at the log line and thinking: This wasn’t written by a junior engineer who forgot to scrub a credential. This was written by a system that has no concept of ‘forgetting’. It did exactly what it was optimized to do. It made the tests green. The fact that it leaked a secret along the way wasn’t a bug in the agent. It was a gap in our process.

That incident rewired how I think about engineering leadership because correctness has shifted into gaps our tests and reviews were never designed to cover.

The Craft Transition: Most Teams Stuck Midway

Software engineering is shifting from a code-centric craft to an intent-centric operating model. Humans describe outcomes. Agents execute multi-step changes. This is happening whether your roadmap accounts for it or not. The only real question is whether it happens inside your governance framework or outside of it.

I see three stages in how organizations are adapting.

  1. The first is the classic SDLC everyone knows: Humans write code, humans review code and humans ship code.
  2. The second is what people are calling ‘vibe coding’. Prompt-driven, fast, informal. It feels incredibly productive because output goes up. But review capacity stays the same. Engineering incentives stay the same. What actually happens is you start accumulating verification debt: The growing gap between how fast you can generate change and how confidently you can prove that change is safe.
  3. The third stage, and very few teams have reached, is spec-driven development. Structured, testable inputs that constrain what the agent can do before it does it. When diffs become cheap to produce, specs become the only scalable way to express intent and control blast radius.

Most organizations I talk to are stuck in stage two. They’ve adopted the tools. They haven’t redesigned the process.

The Data Tells a Story Nobody Wants to Hear

The DORA 2025 State of AI-assisted Software Development report surveyed nearly 5,000 technology professionals and landed on a finding that should make every engineering leader uncomfortable: AI adoption now correlates with higher throughput, but it also correlates with higher instability. AI acts as what the researchers called a ‘mirror and a multiplier’. In cohesive organizations with strong platforms and fast feedback loops, AI boosts efficiency. In fragmented ones, it exposes and amplifies every existing weakness. 90% of developers now use AI in their daily work. Over 80% say it has improved their productivity. However, 30% still report little or no trust in the code AI generates.

The Stack Overflow 2025 Developer Survey, which drew 49,000 responses across 166 countries, makes the trust problem even starker. 84% of developers now use or plan to use AI tools, up from 76% in 2024, with 51% using them daily. But trust in AI accuracy has dropped from 40% to 29% in a single year. More developers now actively distrust AI output (46%) than trust it (33%). Only 3% report ‘highly trusting’ what these tools produce. Experienced developers are the most skeptical: Their ‘highly distrust’ rate is 20%.

Here’s the number that keeps me up at night: 66% of developers say their biggest frustration is AI solutions that are ‘almost right, but not quite’. 45% say debugging AI-generated code takes longer than writing it themselves would have. Three-quarters of developers say that even in a future where AI can handle most coding tasks, the top reason they’d still ask a human for help is: ‘When I don’t trust AI’s answers’.

This isn’t a tooling problem. It’s an arithmetic problem. Generation cost has hit zero. Verification cost is fixed or increasing. If you haven’t redesigned your review process, you haven’t gained velocity. You’ve relocated the work. What used to be ‘writing code’ is now ‘hunting for subtle, agentic regressions’, and the second job is harder than the first.

Why Your PR Process is Already Broken

Here’s the arithmetic that should worry every engineering leader: Generating a thousand lines of code now has near-zero marginal cost. An agent can touch 40 files in minutes. Your senior reviewer, the one you’re counting on to catch problems, is now dealing with diffs they can’t fully reason about in any realistic timeframe.

So they start sampling. Reading some files carefully. Skimming others. Trusting the test suite to catch the rest.

Sampling is not a review strategy. It’s a coping mechanism.

Three things need to change to make agentic workflows survivable.

  1. First, the review has to shift from syntax to intent. You can no longer audit every line, so the question becomes: Did the agent output actually satisfy the original constraints? The failure mode you need to worry about now isn’t ‘tests failed’. It’s ‘tests passed, but intent failed’.
  2. Second, the spec review becomes your primary gate. You review the instructions before the agent runs, not after. By the time you’re reading the diff, half the damage is already done.
  3. Third, command approval becomes your new privilege escalation. You don’t need to approve every command the agent runs. But you absolutely need gates on the high-risk ones: Migrations, secrets access, IAM changes, destructive operations, deploys.

The New Loop

Here’s what the SDLC looks like when you account for agents:

Spec → Agent Executes → CI Verifies → Human Approves → Deploy → Observe → Spec Updated

That loop only works if you can measure verification capacity, and you can start measuring it next week with three metrics.

Track the median PR review time, but segment it: Agent-generated PRs versus human-written ones. If there’s no meaningful difference, either your agents are producing trivial changes or your reviewers aren’t actually reviewing.

Track the agent failure rate separately. How often do agent-assisted changes get rolled back or cause regressions? This number should be going down. If it isn’t, your specs are too loose.

Stop tracking generic code coverage. Start tracking critical path coverage: Auth, payments, data writes. These are the systems where a missed edge case costs you money, trust or both.

Specs as the Control Plane

The spec needs to be a versioned artifact that lives in your repo. It gets reviewed before the agent runs. It gets updated after every change. If you’re familiar with the model context protocol (MCP) or NIST’s SP 800-218A (the AI community profile for the Secure Software Development Framework), these provide useful scaffolding for connecting agent tooling to policy engines. MCP is particularly interesting here because it allows your verification engines to feed live system state back to the agent, closing the gap between the IDE and the runtime. Instead of the agent operating on a stale snapshot of your infrastructure, it can query the actual state of your canary environment, your schema version, and your secrets vault permissions before it writes a single line of code.

Let me give two concrete examples of what spec-driven prevention looks like in practice. For auth constraints: You specify that token audience must match the environment. Dev, QA, canary, prod. Then you write an automated test that fails if the audience differs. If that spec had existed before our canary incident, the agent would have been constrained from introducing the mismatch in the first place, and the test would have caught it if it did anyway.

For migration safety: You require every database migration to ship with an automated rollback script and a pre-migration check that validates against a schema snapshot. The agent doesn’t get to run the migration without those artifacts present.

A Template You Can Actually Use

I’ve been iterating on a spec format with my team. Here’s roughly where we’ve landed:

Feature: [Name]

Context: [Intent and Business Logic]

Governance and Risk

Risk Tier: [prototype/customer-Facing/regulated]

Change Budget: [Max Files/LOC/commands Allowed Without Escalation]

Data Boundaries: [PII, PCI, GDPR or Specific Files the Agent Must Not Read]

Security Controls: [Automatic secrets redaction in logs, scoped time-bound tokens]

Contracts and Constraints

API/Data Contracts: [Endpoints, Inputs, Outputs, Error States]

Security Policy: [Auth Requirements, Secrets Scanning, Dependency Rules]

Verification and Observability

Acceptance Criteria: [Given/When/Then Scenarios]

Test Plan: [Unit, Integration and Negative Tests Required]

Rollback Verification: [What Specific Metric Proves a Rollback Succeeded]

Agent Execution

Files to Touch: [Scope]

Rollout Plan: [Feature Flags, Canary Steps, Manual Gates]

The ‘change budget’ line is the one most teams skip, and it’s the one that saves you. Think of it as the SRE error budget concept applied to agent execution: Just as an error budget caps how much reliability you’re willing to burn in exchange for velocity, a change budget caps how much codebase surface area an agent can touch before a human has to re-authorize. If you tell the agent it can touch a maximum of 12 files without escalation, you’ve put a ceiling on blast radius before anything runs. Just as an error budget is enforced by the platform (not by hoping engineers remember to check), a change budget should be enforced by a pre-submit hook that counts modified files and blocks the PR if the agent exceeds its allowance. If it’s not automated, it’s not a budget. It’s a suggestion.

What ‘Senior Engineer’ Means Now

Writing clever syntax is losing value fast. The differentiator for senior engineers increasingly is verification-driven design: The ability to write specs that constrain failure, critique agent output for intent mismatches and harden systems against the kinds of regressions that pass all your tests.

This needs to reach your hiring process. I’m not saying throw out your fundamentals interviews. Keep them. But add an exercise that actually tests agent-era thinking. Here’s what I mean. Give the candidate a feature: ‘Add a password reset flow to the user service’. Then show them two specs.

The first spec looks like this:

Feature: Password Reset

Context: Users need to reset passwords via email link.

Contracts:

POST /reset-request → sends email with reset token POST /reset-confirm → accepts token + new password

Test Plan:

Happy Path: Valid token resets password

Integration Test: Email delivery

That spec would have seemed perfectly adequate two years ago. An agent given this spec would produce working code. The tests would pass, and you’d have no idea whether the agent had introduced a token that never expires or stored the reset token in plaintext or allowed unlimited reset attempts from a single IP.

Now show them the senior spec:

Feature: Password Reset

Context: Users need to reset passwords via email link.

Governance

Risk Tier: Customer-Facing/Regulated (handles auth credentials) Change Budget: Max eight files. Agent must not modify auth_provider.go or session_store.go.

Data Boundaries: Agent must not log, print or persist the reset token in any form outside the tokens table.

Security Controls: All new log statements must pass secrets-pattern scan before merge.

Contracts

Post/reset-Request → Sends Email With Reset Token

Token: Cryptographically random, 256-bit, stored as bcrypt hash

Expiry: 15 minutes from creation, enforced server-side

Rate Limit: Max 3 requests per email per hour; return 429

Post/reset-Confirm → Accepts Token + New Password

Token is single-use; delete on consumption regardless of outcome

Password must meet policy (min 12 chars, breach-list check)

On Success: Invalidate all existing sessions for that user

Test Plan

Happy Path: Valid token resets password, old sessions invalidated

Expired token returns 410

Reused token returns 410

Brute Force: 4th reset request within 1 hour returns 429

Negative: Agent did not modify auth_provider.go or session_store.go

Negative: Grep codebase for plaintext token outside tokens table

Integration: Email delivery, token expiry enforcement

The gap between those two specs is exactly where every agentic regression hides. The first spec tells the agent what to build. The second spec tells the agent what it must not break. What you’re really testing for is negative space thinking: A junior engineer reads code and sees what is there. A senior engineer reads a spec and sees what is missing. The constraints that weren’t written. The failure modes that weren’t named. The files that should have been marked off-limits but weren’t. Ask your candidate which failure modes the first spec misses, and why the constraints in the second one exist. You’ll learn more about their engineering judgement from that conversation than from watching them invert a binary tree.

A 30/60/90-Day Plan for Leaders

If you’re an engineering leader reading this and wondering where to start, here’s a sequence that’s worked for me.

In the first 30 days, focus on guardrails. Define which domains agents are allowed to operate in (internal tooling is usually a good starting point) and which are off-limits (billing and payments). Standardize an AGENTS.md file for every repository that makes the rules explicit, and mandate audit logging for every terminal command an agent runs. You need the paper trail before you need anything else.

In the next 30 days, start piloting. Pick high-toil work that nobody wants to do: Dependency upgrades, boilerplate generation, repetitive refactors. Measure the verification gap, which is the time difference between reviewing an agent-generated PR and a human-written one. Put command approval gates on migrations and IAM changes. See what breaks.

By day 90, you should be ready to rewire the process properly. Integrate agent execution into your CI pipelines. Update your performance review criteria to reward spec quality and verification strength. Publish the spec-driven standard as the default for all new services.

The Bet You Are Making

The Stack Overflow numbers paint a picture of an industry that has adopted the tools but hasn’t solved the trust problem: 84% adoption, 29% trust. That gap isn’t closing on its own. The DORA data confirms that speed without stability is just accelerated chaos.

The teams that come out ahead in the next two years won’t be the ones that adopted agents fastest. They’ll be the ones that invested in verification capacity with the same seriousness they invest in production capacity: Tests, policies, audit logs, rollout controls.

The bottleneck has moved. Your process needs to move with it.

Read More

Scroll to Top