{"id":4168,"date":"2026-05-28T11:11:40","date_gmt":"2026-05-28T11:11:40","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/05\/28\/more-signal-less-clarity-the-observability-paradox-no-one-wants-to-talk-about\/"},"modified":"2026-05-28T11:11:40","modified_gmt":"2026-05-28T11:11:40","slug":"more-signal-less-clarity-the-observability-paradox-no-one-wants-to-talk-about","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/05\/28\/more-signal-less-clarity-the-observability-paradox-no-one-wants-to-talk-about\/","title":{"rendered":"More Signal, Less Clarity: The Observability Paradox No One Wants to Talk About\u00a0"},"content":{"rendered":"<div><img data-opt-id=825851060  fetchpriority=\"high\" decoding=\"async\" width=\"769\" height=\"330\" src=\"https:\/\/devops.com\/wp-content\/uploads\/2024\/04\/binoculars-observability-details-tito-pixel-KBPzLjwdFbI-unsplash-1.jpg\" class=\"attachment-large size-large wp-post-image\" alt=\"observability, 2.0, developers, observability, datadog, your, observability, customers, blind spots, telemetry, New Relic, Observe, Gen AI, Generative AI, modern, applications, risk, observability, AI, unified observability, binoculars\" \/><\/div>\n<p><img data-opt-id=1367499258  fetchpriority=\"high\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/devops.com\/wp-content\/uploads\/2024\/04\/binoculars-observability-details-tito-pixel-KBPzLjwdFbI-unsplash-1-150x150.jpg\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"observability, 2.0, developers, observability, datadog, your, observability, customers, blind spots, telemetry, New Relic, Observe, Gen AI, Generative AI, modern, applications, risk, observability, AI, unified observability, binoculars\" \/><\/p>\n<p><span data-contrast=\"auto\">The amount of time it takes engineering teams to get back to work after an incident\u00a0is\u00a0getting worse every year, even though spending on observability tools has reached record highs. This should worry everyone in this field.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The\u00a0Logz.io Observability Pulse\u00a0followed teams with a <a href=\"https:\/\/devops.com\/when-customer-facing-systems-fail-how-incident-response-and-observability-reduce-mttr\/\" target=\"_blank\" rel=\"noopener\">mean time to resolution<\/a> (MTTR) of more than one hour: 47% in 2021, 64% in 2022, 74% in 2023 and 82% in 2024. Four years in a row of going backwards. During the same time, the average number of tools used by a team rose to eight or nine different platforms.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The answer from the industry has always been the same:\u00a0More\u00a0\u2014 more tools, more dashboards, more signals.\u00a0The working assumption is that the problem is visibility; that if engineers could see more of what was going on, they would be able to fix problems faster.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">This article asserts that the contrary is accurate:\u00a0Beyond\u00a0a specific limit, excessive observability data results in cognitive overload that hinders\u00a0root\u00a0cause analysis\u00a0(RCA)\u00a0and increases MTTR. More signals can, surprisingly, mean less clarity.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><i><span data-contrast=\"auto\">The data suggests this assumption is wrong,\u00a0and\u00a0I\u2019ve watched it play out firsthand.<\/span><\/i><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">While I was\u00a0on-call\u00a0for a big cloud provider, at one of my previous jobs, I had to deal with a problem where a customer was seeing packet drops during a performance test. The war room got going quickly. We looked at all our dashboards, which showed things\u00a0such as\u00a0average CPU, per-node CPU, soft IRQ and memory use across the fleet. Everything seemed fine. There\u00a0was\u00a0no problem anywhere. We spent a long time in that space, methodically going through the stack, sure that the answer was in there somewhere if we just looked harder.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">However, it\u00a0wasn\u2019t. We finally did a packet capture, which is a simple, old-school way to diagnose a problem, and we found the real problem right away. Bursty traffic had pushed the use of the SNAT port past its limit. There were too many connections happening at the same time. The solution was simple:\u00a0Add\u00a0nodes.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">There was a metric for the SNAT port. It just wasn\u2019t on any of our dashboards. We had set up a complicated observability system that covered almost everything, but it confidently kept us from focusing on the one thing that really mattered.\u00a0Not only did the complexity not help,\u00a0but\u00a0it also wasted our time by making us look for a complicated answer instead of the simple one.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<h3 data-ccp-border-between=\"0px none #000000\" data-ccp-padding-between=\"0px\"><span data-contrast=\"none\">We Built More; Things Got\u00a0Slower<\/span><span data-ccp-props='{\"335572071\":0,\"335572072\":0,\"335572073\":4278190080,\"335572075\":0,\"335572076\":0,\"335572077\":4278190080,\"335572079\":0,\"335572080\":0,\"335572081\":4278190080,\"335572083\":0,\"335572084\":0,\"335572085\":4278190080,\"335572087\":0,\"335572088\":0,\"335572089\":4278190080,\"469789798\":\"nil\",\"469789802\":\"nil\",\"469789806\":\"nil\",\"469789810\":\"nil\",\"469789814\":\"nil\"}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">The\u00a0Grafana Labs Observability Survey 2025\u00a0found that companies now use an average of eight different observability tools, down from nine the year before. Teams are starting to come together, but it\u2019s not because they are thinking about what observability is really for;\u00a0it\u2019s because of the cost.\u00a0Nearly 74%\u00a0of businesses say that cost is the most important factor when choosing tools.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The CNCF\u2019s own research backs this up: 72% of teams use up to nine different observability tools, and tool sprawl is the most common operational problem mentioned by half of all respondents. This is a neutral finding from a foundation that has no vendor interest in the answer.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">On the other hand,\u00a039% of engineering teams say that complexity and operational overhead are their biggest problems. Not a lack of information. Not enough tools.\u00a0Difficult.\u00a0The thing that was supposed to fix the problem has become the problem.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">One engineering team kept track of this very well:\u00a0They\u00a0checked their setup and found 47 dashboards, each of which made sense on its own. During incidents, engineers opened them at random and looked at panels that told different stories. They deleted 28 without a plan for archiving. Within two weeks, it was easier to see.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">We all got mixed up between data coverage and operational clarity. They feel the same when they buy things.\u00a0However,\u00a0at 2 a.m., they feel very different.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<h3 data-ccp-border-between=\"0px none #000000\" data-ccp-padding-between=\"0px\"><span data-contrast=\"none\">There\u00a0is\u00a0a Cognitive Wall<\/span><span data-ccp-props='{\"335572071\":0,\"335572072\":0,\"335572073\":4278190080,\"335572075\":0,\"335572076\":0,\"335572077\":4278190080,\"335572079\":0,\"335572080\":0,\"335572081\":4278190080,\"335572083\":0,\"335572084\":0,\"335572085\":4278190080,\"335572087\":0,\"335572088\":0,\"335572089\":4278190080,\"469789798\":\"nil\",\"469789802\":\"nil\",\"469789806\":\"nil\",\"469789810\":\"nil\",\"469789814\":\"nil\"}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">Gary Klein\u2019s research on how experts make decisions under pressure found that experienced professionals don\u2019t solve problems by looking at all the information they have. They match patterns. They look for a shape they know, do a quick mental simulation and then do something. The skill is knowing what to let go of.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">An engineer can\u2019t put together eight dashboards across four platforms when an incident happens at 2 a.m. They\u00a0look\u00a0for a signal that is similar to one they have seen before. Every extra chart and every alert that isn\u2019t directly related to the main issue is friction.\u00a0In an environment with a lot of unique cards, there\u2019s a lot of friction.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Studies on MTTR reduction showed that three things consistently speed up recovery time:\u00a0Fast\u00a0and accurate detection, low-cardinality instrumentation and easy-to-follow diagnostic paths. Not always does more data help. There is a\u00a0<\/span><i><span data-contrast=\"auto\">sweet spot<\/span><\/i><span data-contrast=\"auto\">\u00a0in the number of metrics, beyond which adding more signal starts to slow down resolution instead of speeding it up. This is what the Google SRE Book has said since 2016. Most people in the industry have ignored it.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<h3 data-ccp-border-between=\"0px none #000000\" data-ccp-padding-between=\"0px\"><span data-contrast=\"none\">AI\u00a0is\u00a0Repeating the Same Mistake<\/span><span data-ccp-props='{\"335572071\":0,\"335572072\":0,\"335572073\":4278190080,\"335572075\":0,\"335572076\":0,\"335572077\":4278190080,\"335572079\":0,\"335572080\":0,\"335572081\":4278190080,\"335572083\":0,\"335572084\":0,\"335572085\":4278190080,\"335572087\":0,\"335572088\":0,\"335572089\":4278190080,\"469789798\":\"nil\",\"469789802\":\"nil\",\"469789806\":\"nil\",\"469789810\":\"nil\",\"469789814\":\"nil\"}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">AI-assisted observability is the current\u00a0industry answer\u00a0to cognitive overload. If engineers cannot handle the amount of data, teach a model to find the important signals. It\u2019s a good guess.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">However,\u00a0the DORA 2024 report found something that made people uncomfortable. A 25% rise in the use of AI was linked to a 1.5% drop in delivery throughput and a 7.2% drop in stability. The mechanism: AI speeds up code production,\u00a0which raises the risk of deployment,\u00a0leading\u00a0to more incidents, which uses up the time that was saved.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">This is the same way that tool proliferation fails, but one level higher. More capability was added to solve a human problem, but this made the problem worse by adding more complexity.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"none\">Measure Where the Engineer\u2019s Attention Actually Goes<\/span><span data-ccp-props='{\"335559738\":400,\"335559739\":180}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">The solution is to measure something more honest if the industry has been measuring the wrong thing, like tool adoption instead of incident effectiveness.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Take the on-call engineer\u2019s attention as a sign. What did they actually open during the event? What were they looking at for more than\u00a030\u00a0seconds? What did they do? What did they completely miss?<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">It\u2019s not hard:\u00a0Incident\u00a0timeline data, tooling interaction logs and one question after the incident\u00a0\u2014\u00a0<\/span><i><span data-contrast=\"auto\">What\u00a0actually helped\u00a0you diagnose this?<\/span><\/i><span data-contrast=\"auto\">\u00a0\u2014\u00a0give you a surprisingly clear picture over time. You could call it incident attention mapping. The data is like an audit of your observability stack done by the people who need it to work.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Most teams would find that a small number of signals do most of the work on most incidents. A few dashboards, two or three types of alerts and one log query pattern that happens a lot.\u00a0Additionally,\u00a0a long list of tools that were made to be complete, checked from\u00a0time-to-time\u00a0out of habit, and have never once helped solve a real problem.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">That tail isn\u2019t in the middle. It costs money to build the infrastructure and time for an engineer under pressure to rule it out and find what they really\u00a0need. If an engineer doesn\u2019t open a tool during a real incident, it\u2019s not an observability asset. It\u2019s debt for the organization that comes with a bill every month.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"none\">What Teams Should Do Instead<\/span><span data-ccp-props='{\"335559738\":400,\"335559739\":180}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">\u200b\u200bThe goal isn\u2019t to use fewer instruments; it\u2019s to use them on purpose. Three changes that always make response time better:<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<ul>\n<li data-leveltext=\"\uf0b7\" data-font=\"Symbol\" data-listid=\"1\" data-list-defn-props='{\"335552541\":1,\"335559685\":720,\"335559991\":360,\"469769226\":\"Symbol\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\uf0b7\",\"469777815\":\"hybridMultilevel\"}' data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\">First, make dashboards that focus on incidents, not coverage. Every panel should answer a specific question that an engineer would ask during a real incident. Get rid of the panel if you can\u2019t name the question.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\uf0b7\" data-font=\"Symbol\" data-listid=\"1\" data-list-defn-props='{\"335552541\":1,\"335559685\":720,\"335559991\":360,\"469769226\":\"Symbol\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\uf0b7\",\"469777815\":\"hybridMultilevel\"}' data-aria-posinset=\"2\" data-aria-level=\"1\"><span data-contrast=\"auto\">Second, put the signal ahead of volume. One alert that goes off correctly and points directly to the problem is worth more than\u00a050\u00a0alerts that need to be interpreted. Check your alert-to-action ratio often. If engineers are often ignoring or silencing alerts, that\u2019s the system telling you something.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\uf0b7\" data-font=\"Symbol\" data-listid=\"1\" data-list-defn-props='{\"335552541\":1,\"335559685\":720,\"335559991\":360,\"469769226\":\"Symbol\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\uf0b7\",\"469777815\":\"hybridMultilevel\"}' data-aria-posinset=\"3\" data-aria-level=\"1\"><span data-contrast=\"auto\">Third, do attention audits after big events. Talk to your engineers about what they opened, what worked and what didn\u2019t. If you do this every day for three months, you\u2019ll have a better idea of how valuable your observability stack really is than any vendor dashboard can show you.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/li>\n<\/ul>\n<h3><span data-contrast=\"none\">The Reframe<\/span><span data-ccp-props='{\"335559738\":400,\"335559739\":180}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">A system with good observability doesn\u2019t let you see everything. It\u2019s a system that lets an engineer who gets a call at 2 a.m. understand what\u2019s wrong in the first three minutes and take the right first step. Anything that helps you get over that bar is worth keeping. No matter how complete it makes your coverage look on a vendor dashboard, you should cut out anything that adds noise.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Don\u2019t ask your vendors to tell you\u00a0which is which. It\u2019s to keep an eye on your engineers. Let the people who are fixing things show you what is actually useful by how they act.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">A good observability system doesn\u2019t depend on how much it can show you. It is how quickly it helps someone make a confident choice when they are under pressure. Start there and work your way back.<\/span><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><i><span data-contrast=\"auto\">The tool count going from nine to eight is a start. The next step is finding out which eight\u00a0\u2014\u00a0and being honest about what the answer reveals.<\/span><\/i><span data-ccp-props='{\"335559739\":260}'>\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/devops.com\/more-signal-less-clarity-the-observability-paradox-no-one-wants-to-talk-about\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\">Read More<\/a><\/p>\n<p>\u200b<\/p>","protected":false},"excerpt":{"rendered":"<p>The amount of time it takes engineering teams to get back to work after an incident\u00a0is\u00a0getting worse every year, even [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":4169,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[5],"tags":[],"class_list":["post-4168","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/4168","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=4168"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/4168\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media\/4169"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=4168"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=4168"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=4168"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}