The shift: from dashboards to agents

Operations is undergoing its most fundamental transformation since the move from on-prem to cloud. The human operator staring at dashboards and running playbooks is being replaced by AI agents that monitor, reason, and act autonomously.

Three waves are converging simultaneously:

1. AI coding agents become the primary interface

Developers stopped writing infrastructure code by hand. Claude Code, Cursor, Copilot, Windsurf, and Codex now mediate nearly all infrastructure interactions — Terraform changes, Kubernetes configs, database migrations, CI/CD pipelines. By early 2026, 93% of developers use AI tools regularly, with agent mode adoption at 55% and climbing toward 70%+ by year-end.

The implication for operations: the coding agent is now the primary consumer of infrastructure knowledge. When a developer asks Claude Code "why is my PostgreSQL query slow?", the quality of the answer depends entirely on what structured knowledge the agent can access.

2. Cloud providers ship always-on SRE agents

Azure SRE Agent went GA as an always-on reliability service with MCP-based subagents. AWS followed with DevOps Agent — an autonomous on-call engineer. PagerDuty's SRE Agent evolved from copilot to virtual responder, heading toward fully autonomous operations.

These are persistent agents that continuously monitor health, correlate alerts across services, execute remediation runbooks autonomously, and communicate with each other via MCP.

3. Agent frameworks enable custom orchestration

Enterprises aren't just consuming off-the-shelf agents — they're building their own. CrewAI, LangGraph, and Semantic Kernel let platform teams compose multi-agent workflows: an alert-triage agent hands off to a root-cause agent, which invokes a remediation agent, which notifies a human only if confidence is below threshold.

The problem all these agents share

Every agent — whether it's Claude Code helping a developer tune a Redis cache, Azure SRE Agent diagnosing a database failover, or a custom CrewAI workflow triaging alerts — needs to answer the same fundamental questions:

  1. What should I monitor? Which metrics matter? What are normal ranges?
  2. What does this symptom mean? Given these patterns, what's the likely root cause?
  3. What should I do? What are the recommended remediation steps?
  4. Where's the authoritative source? Can I link the operator to official docs for verification?

Today, agents cobble this together from web searches (slow, noisy), training data (stale, no source links), and user-provided context (defeats automation). The results are inconsistent, unverifiable, and often wrong.

What operations execs are going through right now

  1. Tooling explosion. Teams adopt AI coding agents without central coordination. Cloud providers push SRE agents. Incident management vendors add AI copilots. No coherent strategy.
  2. Knowledge fragmentation. Each tool has its own context window, its own training data. When Azure SRE Agent and Claude Code look at the same PostgreSQL instance, they may give contradictory advice.
  3. Trust deficit. Operations teams can't trust AI recommendations that don't cite sources. An SRE making a production change needs to see the docs link, the metric thresholds, and the risk assessment.
  4. Cost pressure. $300K+/hour incident costs mean faster resolution has direct financial impact. But agents that hallucinate don't reduce MTTR — they add noise.

The operations landscape is changing

The tools your teams use to manage infrastructure are being transformed by AI. Coding agents like Claude Code and Cursor now mediate most infrastructure changes. Cloud providers are shipping always-on SRE agents — Azure SRE Agent, AWS DevOps Agent, PagerDuty SRE — that monitor, diagnose, and remediate autonomously. And platform teams are building custom agent workflows with frameworks like CrewAI and LangGraph.

All of these agents share the same problem: they need deep, accurate infrastructure knowledge to make good decisions. What metrics matter for your PostgreSQL cluster? What does a Redis eviction spike mean? What's the right remediation for a Kafka consumer lag?

Today, agents cobble this together from web searches, stale training data, and whatever context the user provides. The results are inconsistent, unverifiable, and often wrong.

Schema is the intelligence layer

Schema extracts structured knowledge from authoritative sources — official documentation, changelogs, source code, benchmarks — and delivers it through the interfaces your agents already use: MCP, Skills, API, CLI, and the web.

Every answer is grounded in original sources with direct links, so your team can verify before acting. The knowledge is structured for speed and token efficiency, so agents get precise answers instead of pages of prose. And it stays current — as infrastructure evolves, Schema's extraction pipeline keeps the knowledge base up to date.

Schema doesn't replace your agents or your observability platform. It makes them smarter. Whether it's Claude Code helping a developer optimize a query, Azure SRE Agent diagnosing a failover, or a custom agent triaging alerts at 3am — Schema provides the ground truth they need to act with confidence.

How it fits together

Schema occupies a unique position in the stack — the intelligence layer between raw infrastructure documentation and the agents that need to act on that knowledge.

Consumers
Claude Code · Cursor · Copilot · Windsurf
Azure SRE Agent · PagerDuty SRE · AWS DevOps Agent
▼ ▼ ▼
Schema Surfaces
Skills · MCP · API · CLI · Web
Schema — infrastructure intelligence
▲ ▲ ▲
Sources
Docs · Changelogs · Source Code · Benchmarks · Forums

Visitors scanning top to bottom read: "All these agents I know consume infrastructure knowledge through Schema's surfaces, which is powered by Schema's extraction of authoritative sources." The inverted funnel — broad ecosystem narrowing to Schema, then broadening again to sources — communicates that Schema is the pinch point where raw knowledge becomes agent-ready intelligence.

  • Consumers — the coding agents, SRE agents, and frameworks your teams already use
  • Surfaces — Schema's delivery interfaces (MCP, Skills, API, CLI, Web)
  • Schema — the intelligence layer that extracts, structures, and grounds knowledge
  • Sources — official docs, changelogs, source code, benchmarks, forums

The market is moving fast

In January 2026, Gartner published its inaugural Market Guide for AI Site Reliability Engineering Tooling, formally recognizing AI SRE as a distinct market category. The report projects that 85% of enterprises will use AI SRE tooling by 2029, up from less than 5% in 2025 — a 17x expansion in four years. Gartner notes that traditional SRE teams "cannot keep up with the technology and operational demands required of them" and recommends AI SRE tools as the on-ramp for organizations that couldn't previously justify full SRE practices. The guide calls out that effective AI SRE must work across telemetry, event correlations, and root cause analysis — and warns that tools focused solely on reactive operations "will not improve system reliability." Meanwhile, pure-play AI SRE startups are attracting significant capital: Resolve AI raised $125M at a $1B valuation, the broader observability market reached $3.3B in 2025, and Gartner's emerging capability roadmap includes proactive incident avoidance, SLO protection, and multi-agent architectures. Every one of these agents needs structured infrastructure knowledge to reason about the telemetry they observe — which is exactly what Schema provides.