Survey Surfaces Multiple Challenges Measuring AI Coding Productivity

A survey of 700 developers and engineering leaders published this week finds 89% have seen an improvement in the productivity metrics their organization tracks following the adoption of artificial intelligence (AI) tools and platforms, with 81% noting that the amount of time spent reviewing code has increased. However, just under a third of their day is now consumed by AI-related tasks that existing metrics don’t track.

Conducted by Harness, a full 94% said technical debt, validation time, and developer burnout are not being tracked by existing productivity metrics. Specific activities not being tracked include time spent reviewing AI code for accuracy (53%), fixing subtle bugs from AI code (52%), explaining AI code to teammates (48%) and context switching between tools (45%), the survey finds. Only 38% of respondents said their organization is tracking the time spent reviewing code generated by AI tools.

Trevor Stuart, general manager and senior vice president for Harness, said the survey makes it clear that in the age of AI many organizations need to revisit the productivity metrics used to evaluate the productivity of software engineering teams. For example, organizations need to better understand how many tokens are being consumed at what cost to automate a software engineering task, he added.

There also needs to be a greater appreciation for the cognitive load that managing a small army of AI agents inevitably adds, noted Stuart.

Additionally, software engineering teams need to consider which AI model might be better cost-effectively used to automate a task, versus always using the latest version of an AI model that, as it becomes more advanced, also becomes more costly to employ, noted Stuart.

Software engineering teams should also compare the various prompts being used to determine which ones consistently work best for specific tasks, he added.

Unfortunately, too many organizations are simply tracking the total number of tokens consumed by application developers. That so-called “token-maxxing” approach to tracking AI productivity typically results in creating a set of incentives that ultimately can prove counterproductive, added Stuart. In general, organizations would be well advised to spend more time tracking ship rates to better understand how much of the code being created is actually making its way into a production environment, he added.

Not what metrics are tracked, organizations should make sure they are designed to encourage deeper adoption of AI tools rather than to only assess individual developer performance. In fact, more than half of respondents (54%) said they fear individual performance evaluations based on AI data, with most (55%) wanting more transparency into the metrics being used to assess performance and involvement (50%) in defining metrics.

It’s not clear what level of balance is being struck between using metrics to simply assess performance from a retention perspective versus identifying additional opportunities to provide better training. Regardless of the approach to measuring productivity, the code generated by AI tools tends to be denser than what humans normally create, noted Stuart. In general, there’s just a lot more so-called “AI slop” that needs to be removed from code bases, he added.

It’s still early days so far as adoption of AI coding tools is concerned. The one certain thing, however, is even in the age of AI, things measured are, for better or worse, the things that tend to get done.