{"id":4180,"date":"2026-05-29T11:06:01","date_gmt":"2026-05-29T11:06:01","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/05\/29\/why-your-ai-agent-is-a-black-box-and-how-to-fix-it-with-opentelemetry\/"},"modified":"2026-05-29T11:06:01","modified_gmt":"2026-05-29T11:06:01","slug":"why-your-ai-agent-is-a-black-box-and-how-to-fix-it-with-opentelemetry","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/05\/29\/why-your-ai-agent-is-a-black-box-and-how-to-fix-it-with-opentelemetry\/","title":{"rendered":"Why Your AI Agent is a Black Box and How to fix it With OpenTelemetry\u00a0"},"content":{"rendered":"<div><img data-opt-id=748832225  fetchpriority=\"high\" decoding=\"async\" width=\"770\" height=\"330\" src=\"https:\/\/devops.com\/wp-content\/uploads\/2021\/04\/Observability-DeepFactor.jpg\" class=\"attachment-large size-large wp-post-image\" alt=\"\" \/><\/div>\n<p><img data-opt-id=23095151  fetchpriority=\"high\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/devops.com\/wp-content\/uploads\/2021\/04\/Observability-DeepFactor-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"\" \/><\/p>\n<p><span data-contrast=\"auto\">You built the agent. It works in testing. Then it hits production and starts giving wrong answers, timing out or burning through your token budget, and you have no idea why.\u00a0This is when developers discover that print statements and log files weren\u2019t\u00a0designed\u00a0for this.\u00a0<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">LLM applications fail in ways\u00a0that\u00a0traditional tooling can\u2019t see. A hallucination doesn\u2019t throw an exception. A slow retrieval step doesn\u2019t show up in CPU metrics. A prompt that worked yesterday silently degrades today.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The fix is observability,\u00a0and the standard for doing it right is\u00a0OpenTelemetry\u00a0(OTel).<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">What\u00a0OpenTelemetry\u00a0Actually\u00a0Is<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":120}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">OTel\u00a0isn\u2019t a monitoring product;\u00a0it\u2019s a vendor-neutral specification under the CNCF that defines a standard way to collect observability data:\u00a0What gets collected, what it\u2019s called\u00a0and\u00a0how it\u2019s\u00a0shipped. You instrument your application once and can send that data to Grafana, Datadog, Jaeger or a purpose-built LLM platform without rewriting your instrumentation.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">That portability matters more than people realize\u00a0early on. Your observability investment is in your instrumentation code, not in the back\u00a0end you happen to be using today.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">The\u00a0Semantic\u00a0Conventions\u00a0Problem\u00a0Nobody\u00a0Talks\u00a0About<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":120}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">Every LLM observability platform claims\u00a0OTel\u00a0compatibility. Technically, most\u00a0are\u00a0\u2014\u00a0they\u2019ll accept an OTLP payload without crashing.\u00a0However,\u00a0protocol-level compatibility says nothing about whether your spans will actually mean anything on the other side.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The problem is semantic conventions.\u00a0OTel\u00a0defines how to send data but doesn\u2019t fully define what to name LLM-specific attributes. Three competing standards have emerged:\u00a0OTel\u2019s\u00a0own GenAI conventions (still evolving, not fully ratified),\u00a0Arize\u2019s\u00a0OpenInference\u00a0conventions (used by LlamaIndex, structurally different) and whatever each vendor decided to call things before any standard existed.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">In practice, this means your LlamaIndex pipeline emits\u00a0OpenInference, your custom LLM wrapper emits GenAI conventions and your framework\u2019s built-in tracing emits\u00a0something proprietary. All three land as valid OTLP. None use the same attribute names. Your token usage dashboard is reading three different fields depending on which span it hits.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">This is the real state of LLM observability tooling in 2026. The protocol works. The conventions are a mess.\u00a0It\u00a0mirrors what happened with APM a decade ago:\u00a0Fragmentation precedes consolidation. The difference is that the LLM space is moving faster, and teams that pick a coherent instrumentation strategy now will avoid painful migrations later.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">Why\u00a0Traces\u00a0Matter\u00a0More\u00a0Than\u00a0Logs for LLM\u00a0Work<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":120}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">OTel\u00a0has three pillars:\u00a0<a href=\"https:\/\/devops.com\/debugging-in-production-leveraging-logs-metrics-and-traces\/\" target=\"_blank\" rel=\"noopener\">Logs, metrics and traces.<\/a> For LLM applications, traces do the heavy lifting.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">A log is fire-and-forget. Something happens, you write a line. No duration, no parent-child relationship, no shared context. Metrics tell you something is wrong but not what exactly went wrong with request #a7b3c9 at 2:32\u00a0p.m.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">A trace represents the complete life\u00a0cycle of a single request. Inside that trace are spans, each wrapping one unit of work with a start time, end time, attributes and a known relationship to other\u00a0spans:<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Trace:\u00a0user_query\u00a0[total: 1.2s]<\/span><span data-ccp-props='{\"335559739\":200}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">\u251c\u2500\u2500 Span: routing [12ms]<\/span><span data-ccp-props='{\"335559739\":200}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">\u2502\u00a0\u00a0 \u2514\u2500\u2500\u00a0router.model=gpt-4o-mini,\u00a0routing.result=rag<\/span><span data-ccp-props='{\"335559739\":200}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">\u251c\u2500\u2500 Span: retrieval [180ms]<\/span><span data-ccp-props='{\"335559739\":200}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">\u2502\u00a0\u00a0 \u2514\u2500\u2500\u00a0retrieval.doc_count=5,\u00a0retrieval.top_score=0.87<\/span><span data-ccp-props='{\"335559739\":200}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">\u251c\u2500\u2500 Span:\u00a0llm.completion\u00a0[890ms]<\/span><span data-ccp-props='{\"335559739\":200}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">\u2502\u00a0\u00a0 \u2514\u2500\u2500\u00a0llm.model=gpt-4o,\u00a0llm.tokens.prompt=1240,\u00a0llm.cost_usd=0.0094<\/span><span data-ccp-props='{\"335559739\":200}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">\u2514\u2500\u2500 Span:\u00a0post_processing\u00a0[8ms]<\/span><span data-ccp-props='{\"335559739\":200}'>\u00a0<\/span><\/p>\n<p><span data-ccp-props='{\"335559739\":200}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">You see exactly where time went, what each step cost and what decisions were made. When a user reports a wrong answer, you pull up the trace\u00a0\u2014\u00a0already assembled,\u00a0already timed, already attributed. No grepping through log files trying to reconstruct a sequence.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">One trend worth calling out:\u00a0As model providers add more built-in capabilities (function calling, structured outputs, vision), the surface area of what needs to be traced per request keeps expanding. Teams that treat observability as an afterthought are accumulating blind spots faster than they realize.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">What\u00a0Makes\u00a0Agents\u00a0Especially\u00a0Hard to\u00a0Observe<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":120}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">Simple chat completions are easy\u00a0\u2014 one request, one LLM call, one response. Agents are different. An agent might route to one of several tools, loop multiple times before settling on an answer, run subtasks in parallel or hand off to another agent. Each step is a span, and several things break naive\u00a0approaches.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Context propagation across tool calls requires trace IDs to travel with outgoing requests. Multi-agent traces need explicit context passing across agent boundaries, which most frameworks don\u2019t do automatically. Loops need clearly labeled iterations, not identical span names you untangle manually. Routing decisions should be recorded as attributes so you can see why the agent chose RAG over a direct LLM call.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">One useful mental model is representing agents as state machines,\u00a0where each state transition is a span. This makes control flow visible in the trace rather than implied by span ordering. If you can see the states an agent moved through, debugging a wrong decision gets tractable.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">This also reflects something happening across the industry:\u00a0As agent architectures grow more complex (multi-agent orchestration, dynamic tool selection, self-correction loops), the gap between\u00a0\u2018it ran\u2019\u00a0and\u00a0\u2018I understand what it did\u2019\u00a0keeps widening. Observability isn\u2019t just debugging infrastructure anymore;\u00a0it\u2019s becoming the feedback loop that teams use to actually improve agent behavior over time.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">What it\u00a0Looks\u00a0Like\u00a0When it\u00a0Works<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":120}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">A user reports that your research agent gave outdated information. With traces, you open the specific request and see the retrieval step returned documents from 2023, the LLM generated a confident response from stale content and no error was thrown because technically nothing failed. A warning event flagged that the content age exceeded a threshold.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">This is the class of failure that defines LLM observability: Silent quality problems rather than system errors. The trace caught it because the instrumentation measured the content age and flagged it. That\u2019s the difference between finding the problem in a trace and having a user report it three days later.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">Where\u00a0This is Heading<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":120}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">The LLM observability space is still early, but a few directions are becoming clear.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">First, evaluation and observability are converging. Today,\u00a0most teams treat evals as a CI\/CD concern and observability as a production concern.\u00a0However,\u00a0the same trace data that helps you debug a bad response can also feed automated quality scoring in production. Teams that connect these two loops will iterate faster than those running them separately.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Second, cost observability is becoming a first-class requirement, not an afterthought. As teams scale from prototypes to production workloads with thousands of daily requests across multiple models, understanding cost per request, per feature and per user segment is table stakes. Token-level attribution across a multi-step pipeline is something most teams currently do in spreadsheets. It belongs in the trace.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Third, the tooling will consolidate. Right now, the space has dozens of startups, several open-source projects and the major APM vendors all building LLM-specific features. History says this shakes out to a few winners within 2\u20133 years. The teams that instrument on OTel now, regardless of which back end they pick, will be the ones who navigate that consolidation without pain.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">The\u00a0Bottom\u00a0Line<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":120}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">OTel\u00a0is the right foundation because it\u00a0is vendor-neutral,\u00a0and you instrument once. But\u00a0OTel\u00a0is infrastructure. What matters is what you build on top of it:\u00a0The semantic understanding of token costs, the alerting that knows when retrieval quality degrades\u00a0and\u00a0the trace view that makes an agent\u2019s decision-making visible. If you\u2019re moving past simple chat completions into agents and RAG, the observability requirements go up fast. Traces are how you keep up.<\/span><span data-ccp-props='{\"335559738\":240,\"335559739\":240}'>\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/devops.com\/why-your-ai-agent-is-a-black-box-and-how-to-fix-it-with-opentelemetry\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\">Read More<\/a><\/p>\n<p>\u200b<\/p>","protected":false},"excerpt":{"rendered":"<p>You built the agent. It works in testing. Then it hits production and starts giving wrong answers, timing out or [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":4181,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[5],"tags":[],"class_list":["post-4180","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/4180","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=4180"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/4180\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media\/4181"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=4180"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=4180"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=4180"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}