{"id":4176,"date":"2026-05-29T09:04:01","date_gmt":"2026-05-29T09:04:01","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/05\/29\/agentic-sre-the-next-frontier-of-reliability\/"},"modified":"2026-05-29T09:04:01","modified_gmt":"2026-05-29T09:04:01","slug":"agentic-sre-the-next-frontier-of-reliability","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/05\/29\/agentic-sre-the-next-frontier-of-reliability\/","title":{"rendered":"Agentic SRE: The Next Frontier of Reliability\u00a0"},"content":{"rendered":"<div><img data-opt-id=473574031  fetchpriority=\"high\" decoding=\"async\" width=\"770\" height=\"516\" src=\"https:\/\/devops.com\/wp-content\/uploads\/2020\/12\/SRE-e1719572198678.png\" class=\"attachment-large size-large wp-post-image\" alt=\"reliability, SRE, practices, Site reliability engineering, operations, SRE, SREs, software,\" \/><\/div>\n<p><img data-opt-id=1319012398  fetchpriority=\"high\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/devops.com\/wp-content\/uploads\/2020\/12\/SRE-150x150.png\" class=\"attachment-thumbnail size-thumbnail wp-post-image\" alt=\"reliability, SRE, practices, Site reliability engineering, operations, SRE, SREs, software,\" \/><\/p>\n<p><span data-contrast=\"none\"><a href=\"https:\/\/devops.com\/mcp-powered-agentic-ai-in-devops-building-secure-scalable-multi-agent-pipelines-for-autonomous-sre-and-observability\/\" target=\"_blank\" rel=\"noopener\">Agentic SRE is the evolution of site reliability engineering<\/a> where AI agents help\u00a0observe\u00a0systems, reason over\u00a0telemetry\u00a0and take bounded operational actions under human-defined guardrails. The goal is not to replace SREs, but to reduce toil, speed up diagnosis\u00a0and make incident response more consistent and scalable.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">Why\u00a0This\u00a0Matters<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":80}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">Modern systems are too distributed,\u00a0noisy\u00a0and fast-moving for purely manual operations to keep up. Engineers spend\u00a0significant\u00a0time\u00a0correlating dashboards, reading logs, checking recent\u00a0deploys\u00a0and\u00a0hunting for\u00a0context before they can even start fixing the problem.\u00a0Agentic SRE addresses this by turning telemetry into actionable context and automating safe parts of the response loo<\/span><span data-contrast=\"auto\">p<\/span><span data-contrast=\"none\">.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">This shift is especially important because reliability work is full of repetitive, high-pressure tasks that are easy to standardize but hard to execute perfectly at 2\u00a0a.m. That makes it a\u00a0perfect\u00a0fit for agents that can summarize, correlate, recommend\u00a0and execute within policy boundaries.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">What\u00a0Agentic SRE\u00a0Looks\u00a0Like<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":80}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">A practical agentic SRE workflow usually starts with signals from OpenTelemetry, logs, traces, metrics, deployment events and incident history. The agent then enriches the alert, asks follow-up questions if needed, identifies the likely blast radius and proposes next actions based on runbooks or prior incidents.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">The important distinction is between assistive and autonomous behavior.\u00a0Various\u00a0current systems, including vendor offerings, emphasize bounded\u00a0assistance\u00a0rather than unrestricted production changes, because trust and safety are central to operational use. In other words, the agent should be useful enough to accelerate the\u00a0human but\u00a0constrained enough that it does not create new failure modes.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">Core\u00a0Tool\u00a0Stack<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":80}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">A solid agentic SRE stack can be built from the following layers:<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"1\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"none\">Telemetry:\u00a0<\/span><span data-contrast=\"none\">OpenTelemetry<\/span><span data-contrast=\"none\">\u00a0for logs,\u00a0metrics\u00a0and traces<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"1\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"2\" data-aria-level=\"1\"><span data-contrast=\"none\">Observability\u00a0Back\u00a0End:\u00a0<\/span><span data-contrast=\"none\">Datadog<\/span><span data-contrast=\"none\">,\u00a0<\/span><span data-contrast=\"none\">ObserveNow\u00a0by\u00a0StackGen<\/span><span data-contrast=\"auto\">,\u00a0<\/span><span data-contrast=\"none\">Grafana<\/span><span data-contrast=\"none\">, New Relic, Elastic,\u00a0Prometheus\u00a0or cloud-native telemetry systems<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"1\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"3\" data-aria-level=\"1\"><span data-contrast=\"none\">Orchestration: MCP servers, internal\u00a0APIs\u00a0or workflow engines to expose safe tools to the agent.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"1\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"4\" data-aria-level=\"1\"><span data-contrast=\"none\">Agent\u00a0Runtime: LLM-based agent frameworks with function calling, planning\u00a0and tool use<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"1\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"5\" data-aria-level=\"1\"><span data-contrast=\"none\">Incident\u00a0Workflow: PagerDuty,\u00a0Opsgenie, Slack, Jira, ServiceNow\u00a0or internal incident systems<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"1\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"6\" data-aria-level=\"1\"><span data-contrast=\"none\">Safety\u00a0Layer: RBAC, approval gates, audit logs, action allowlists\u00a0and rollback paths<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<p><span data-contrast=\"none\">A good rule is that the agent should only be able to do what an on-call engineer could\u00a0reasonably do\u00a0after approval. That keeps the system practical while reducing the risk of accidental damage.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">Use\u00a0Case 1: Alert\u00a0Triage<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":80}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">One of the best first use cases is alert triage. When an alert fires, the agent can pull related traces, check recent deploys,\u00a0identify\u00a0matching log spikes\u00a0and summarize the most probable cause in plain English.\u00a0This reduces\u00a0the\u00a0<\/span><i><span data-contrast=\"none\">where\u00a0do I even start?<\/span><\/i><span data-contrast=\"none\">\u00a0problem that burns time during incidents.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">Here is a simple logic flow:<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<ol>\n<li><span data-contrast=\"none\">Alert arrives.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<li><span data-contrast=\"none\">Agent fetches service health, recent\u00a0deploys\u00a0and correlated traces.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<li><span data-contrast=\"none\">Agent groups related alerts into one incident.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<li><span data-contrast=\"none\">Agent ranks likely\u00a0causes\u00a0by confidence.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<li><span data-contrast=\"none\">Agent posts a summary to Slack or PagerDuty.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ol>\n<p><span data-contrast=\"none\">Example pseudocode:<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">def\u00a0triage_alert(alert):<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 context = {<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201crecent_deploys\u201d:\u00a0get_recent_deploys(alert.service, hours=6),<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201ctraces\u201d:\u00a0query_traces(alert.service, since=alert.time\u00a0\u2013 15),<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201clogs\u201d:\u00a0query_logs(alert.service, since=alert.time\u00a0\u2013 15),<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201cmetrics\u201d:\u00a0query_metrics(alert.service, since=alert.time\u00a0\u2013 30),<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 }<\/span><\/p>\n<p><span data-contrast=\"none\">\u00a0\u00a0\u00a0 summary =\u00a0llm.summarize({<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201calert\u201d:\u00a0alert.message,<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201ccontext\u201d: context,<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201ctask\u201d: \u201cIdentify likely root cause and next best actions\u201d<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 })<\/span><\/p>\n<p><span data-contrast=\"none\">\u00a0\u00a0\u00a0 return {<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201cincident_summary\u201d:\u00a0summary.text,<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201cconfidence\u201d:\u00a0summary.confidence,<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201crecommended_actions\u201d:\u00a0summary.actions<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 }<\/span><br \/>\n<span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335557856\":16447736,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">This kind of workflow is often more valuable than full automation because it improves speed without taking dangerous actions too early.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">Use\u00a0Case 2: Incident\u00a0Copilot<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":80}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">Another high-value use case is an incident copilot that joins the response channel and acts like a second brain. It can generate timelines, summarize what has happened so far, pull links to\u00a0dashboards\u00a0and keep track of hypotheses as responders test them.\u00a0This is especially useful when multiple engineers are\u00a0involved\u00a0and context gets fragmented.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">A simple implementation might use structured prompts plus tool access:<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">tools = [<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 \u201csearch_incident_history\u201d,<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 \u201cfetch_service_dashboard\u201d,<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 \u201cquery_logs\u201d,<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 \u201cquery_traces\u201d,<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 \u201copen_runbook\u201d<\/span><br \/>\n<span data-contrast=\"none\">]<\/span><\/p>\n<p><span data-contrast=\"none\">prompt = \u201c\u201d\u201d<\/span><br \/>\n<span data-contrast=\"none\">You are an SRE incident copilot.<\/span><br \/>\n<span data-contrast=\"none\">Summarize\u00a0current status,\u00a0identify\u00a0likely cause,<\/span><br \/>\n<span data-contrast=\"none\">and suggest the next safe diagnostic step.<\/span><br \/>\n<span data-contrast=\"none\">Do not recommend risky changes without human approval.<\/span><br \/>\n<span data-contrast=\"none\">\u201c\u201d\u201d<\/span><br \/>\n<span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335557856\":16447736,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">The value here is not magical reasoning; it is\u00a0a\u00a0disciplined\u00a0coordination. A copilot reduces duplicate effort and helps teams move from noise to signal faster.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">Use\u00a0Case 3: Automated RCA\u00a0Drafts<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":80}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">Root cause analysis is another strong fit, especially for post-incident review preparation. The agent can compare the incident timeline against recent changes,\u00a0identify\u00a0likely\u00a0triggers\u00a0and draft a first-pass RCA document with evidence links. Human engineers still\u00a0validate\u00a0the write-up, but the time savings can be\u00a0substantial.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">A useful RCA pipeline is:<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"3\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"none\">Collect\u00a0incident\u00a0timeline.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"3\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"2\" data-aria-level=\"1\"><span data-contrast=\"none\">Pull deploy diffs,\u00a0config\u00a0changes\u00a0and feature flag updates.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"3\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"3\" data-aria-level=\"1\"><span data-contrast=\"none\">Map symptoms to\u00a0changes\u00a0in time order.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"3\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"4\" data-aria-level=\"1\"><span data-contrast=\"none\">Generate a draft with supporting evidence.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"3\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"5\" data-aria-level=\"1\"><span data-contrast=\"none\">Ask the engineer to review and\u00a0correct.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<p><span data-contrast=\"none\">Example snippet:<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">def\u00a0draft_rca(incident_id):<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 timeline =\u00a0get_incident_timeline(incident_id)<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 changes =\u00a0get_recent_changes(incident_id.service, before=timeline.start)<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 evidence = correlate(timeline, changes)<\/span><\/p>\n<p><span data-contrast=\"none\">\u00a0\u00a0\u00a0 draft =\u00a0llm.write({<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201ctimeline\u201d: timeline,<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201cchanges\u201d: changes,<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201cevidence\u201d: evidence,<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u201cformat\u201d: \u201cRCA with impact, trigger, contributing factors, corrective actions\u201d<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 })<\/span><\/p>\n<p><span data-contrast=\"none\">\u00a0\u00a0\u00a0 return draft<\/span><br \/>\n<span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335557856\":16447736,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">This is a good example of agentic SRE because the agent accelerates documentation and analysis while humans\u00a0retain\u00a0final ownership.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">Use\u00a0Case 4: Safe\u00a0Remediation<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":80}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">The next step is\u00a0bounded\u00a0remediation. For well-understood incidents, an agent can recommend or execute low-risk actions such as restarting a failed worker, scaling a\u00a0deployment\u00a0or disabling a broken feature flag.\u00a0However,\u00a0these actions should be tied to policy checks, confidence\u00a0thresholds\u00a0and human approval for anything that affects customer-facing production behavior.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">A safe remediation decision tree might look like this:<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">def remediate(issue):<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 confidence =\u00a0assess_confidence(issue)<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 risk =\u00a0assess_risk(issue.action)<\/span><\/p>\n<p><span data-contrast=\"none\">\u00a0\u00a0\u00a0 if confidence &gt; 0.9 and risk == \u201clow\u201d:<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 return execute(issue.action)<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0elif\u00a0confidence &gt; 0.7:<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 return\u00a0request_approval(issue.action)<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0 else:<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 return\u00a0escalate_to_human(issue)<\/span><br \/>\n<span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335557856\":16447736,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">This approach aligns with how responsible vendors frame operational AI:\u00a0Assistive, controlled and grounded in observability rather than free-form autonomy.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">Use\u00a0Case 5: Learning\u00a0From\u00a0Incidents<\/span><span data-ccp-props='{\"134245418\":true,\"134245529\":true,\"335559738\":360,\"335559739\":80}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">Agentic SRE is also useful after the incident is over. The agent can extract patterns across incidents,\u00a0identify\u00a0recurring root\u00a0causes\u00a0and suggest where to improve observability or reduce toil. Over time, this creates a feedback loop between production pain and platform investment.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">This\u00a0can lead to concrete improvements such as better alerts, richer traces, missing\u00a0dashboards\u00a0or new runbook steps. In\u00a0various\u00a0teams, this is where the highest long-term ROI shows up because the agent helps the organization become more reliable, not just faster at firefighting.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">Guardrails and\u00a0Risks<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559685\":-30,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">The biggest risk in agentic SRE is not that the agent is too smart; it is that it is too confident. LLMs can produce plausible but wrong explanations, so every recommendation must be traceable to real\u00a0telemetry and explicitly scoped permissions. Security,\u00a0auditability\u00a0and rollback are non-negotiable in production operations.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">Good guardrails include:<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"4\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"none\">Action allowlists<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"4\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"2\" data-aria-level=\"1\"><span data-contrast=\"none\">Approval gates for risky changes<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"4\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"3\" data-aria-level=\"1\"><span data-contrast=\"none\">Full audit logs<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"4\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"4\" data-aria-level=\"1\"><span data-contrast=\"none\">Read-only mode during early rollout<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"4\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"5\" data-aria-level=\"1\"><span data-contrast=\"none\">Per-service or per-team permissions<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li data-leveltext=\"\u25cf\" data-font=\"Noto Sans Symbols\" data-listid=\"4\" data-list-defn-props='{\"335552541\":1,\"335559685\":540,\"335559991\":360,\"469769226\":\"Noto Sans Symbols\",\"469769242\":[8226],\"469777803\":\"left\",\"469777804\":\"\u25cf\",\"469777815\":\"multilevel\"}' data-aria-posinset=\"6\" data-aria-level=\"1\"><span data-contrast=\"none\">Human override at every step\u00a0<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/li>\n<\/ul>\n<p><span data-contrast=\"none\">If you treat the agent like an intern with excellent recall but no judgment, you will design much safer systems.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Example\u00a0Architecture<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559685\":-30,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">A practical architecture for agentic SRE could look like this:<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">OpenTelemetry\u00a0\/ Logs \/ Metrics \/ Traces<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2193<\/span><br \/>\n<span data-contrast=\"none\">Observability Platform<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2193<\/span><br \/>\n<span data-contrast=\"none\">Incident Context Builder<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2193<\/span><br \/>\n<span data-contrast=\"none\">LLM Agent + Policy Engine<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2193<\/span><br \/>\n<span data-contrast=\"none\">Tool Layer (dashboards, tickets, runbooks,\u00a0ChatOps)<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2193<\/span><br \/>\n<span data-contrast=\"none\">Human Approval \/ Automatic Safe Actions<\/span><br \/>\n<span data-contrast=\"none\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2193<\/span><br \/>\n<span data-contrast=\"none\">Audit Logs + RCA + Learning Loop<\/span><br \/>\n<span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335557856\":16447736,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">You can implement this with\u00a0OpenTelemetry, Prometheus or Grafana, Slack or Teams, PagerDuty\u00a0(a vector store for incident knowledge)\u00a0and an LLM orchestration layer with tool calling. The architecture\u00a0matters more than the model choice because most reliability\u00a0value comes\u00a0from context,\u00a0constraints\u00a0and execution discipline.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<h3><span data-contrast=\"auto\">Closing\u00a0Perspective<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559685\":-30,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">Agentic SRE is the next frontier of reliability because it changes how teams investigate,\u00a0decide\u00a0and act during operational events. The real promise is not full autonomy, but faster understanding, safer automation\u00a0and better human-machine collaboration in the moments that matter most.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559685\":-30,\"335559739\":105,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">If you build it well, the outcome is a stronger reliability practice:\u00a0Fewer wasted cycles, shorter\u00a0incidents\u00a0and more time for engineers to work on systemic fixes instead of repetitive toil. That is the real value of agentic SRE \u2014 not replacing the\u00a0SRE but\u00a0giving the SRE a far more capable operating model.<\/span><span data-ccp-props='{\"134233117\":true,\"201341983\":0,\"335559739\":210,\"335559740\":360}'>\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/devops.com\/agentic-sre-the-next-frontier-of-reliability\/\" target=\"_blank\" class=\"feedzy-rss-link-icon\">Read More<\/a><\/p>\n<p>\u200b<\/p>","protected":false},"excerpt":{"rendered":"<p>Agentic SRE is the evolution of site reliability engineering where AI agents help\u00a0observe\u00a0systems, reason over\u00a0telemetry\u00a0and take bounded operational actions under [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":4177,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[5],"tags":[],"class_list":["post-4176","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/4176","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=4176"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/4176\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media\/4177"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=4176"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=4176"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=4176"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}