Built with GitHub Copilot · Shipped via GitHub Actions

Agentic Ops Advisor

Ops teams drown in alerts — but alerts don't explain why. The Agentic Ops Advisor is an AI agent that merges infrastructure telemetry with human intent — change events, decisions, and ownership — to deliver root-cause diagnoses with confidence levels and governed remediation plans.

Built end-to-end on the GitHub platform — coded with GitHub Copilot, tested and shipped through GitHub Actions, hosted on GitHub Pages, and deployed to Azure AI Foundry.

Copilot AI-Coded with GitHub Copilot
Actions CI/CD via GitHub Actions
Issues Automated via GitHub Issues

Traditional monitoring sees metrics — not meaning

Your dashboards show what happened. But nobody can tell you why — until someone manually pieces it together.

Sound familiar?

  • 📉 Alert fatigue — hundreds of alerts, no prioritization, same noise every day
  • 🔍 Manual log trawling — hours digging through logs, metrics, and dashboards to find root cause
  • 💬 "Who owns this?" — Slack threads spiraling while the incident burns

The Agentic Ops Advisor was built to close this gap — an AI agent that doesn't just see the metrics, it understands the organizational context behind them.

Every layer of this project runs on GitHub

From the first line of code to the live deployment, GitHub is the hero of this story. Four products power the entire lifecycle.

GitHub Copilot

GitHub Copilot

Every Python file, every tool, every evaluator — written with GitHub Copilot's AI pair programming. Copilot accelerated development from scaffolding to production-ready agent code in record time.

Learn about Copilot ↗
GitHub Actions

GitHub Actions

The full CI/CD pipeline runs on GitHub Actions: lint → test → eval gates → Docker build → Azure deployment. Every pull request triggers automated quality and regression checks before anything ships.

View Workflows ↗
GitHub Pages

GitHub Pages

This very site is hosted on GitHub Pages — deployed automatically from the docs/ folder on every merge to main. Zero infrastructure, zero cost, zero ops burden. The demo site ships with the code.

Learn about Pages ↗
GitHub Issues

GitHub Issues

Squad automation workflows automatically triage, label, and assign incoming issues. Pull request CI results are posted back as issue comments — issue tracking is fully integrated into the delivery pipeline.

View Issues ↗

GitHub builds it · Azure AI Foundry runs it

The GitHub platform owns the entire build pipeline — code, CI/CD, security, packaging. Azure AI Foundry provides the production runtime for the agent.

⬡ GITHUB PLATFORM — BUILD LAYER 🤖 GitHub Copilot AI pair programmer 🐛 GitHub Issues Squad triage · auto-assign 📄 GitHub Pages This site · docs/ folder ⚙️ GitHub Actions Lint · Test · Eval · Deploy 🔀 Pull Requests CI checks · eval gates 📁 GitHub Repository Branch protection · CODEOWNERS Squad Automation · Issue Triage · Label Sync squad-triage · squad-issue-assign · sync-squad-labels workflows deploy ☁ AZURE — RUNTIME LAYER 👤 Ops Engineer Natural language query 🤖 Agentic Ops Advisor Azure AI Foundry Agent Service GPT-4.1 · System Prompt · Tools 🐳 Docker · GitHub Actions → ACR 📊 SQL Telemetry Tool GPU · Network · Cost · Incidents 🔗 Work IQ Context Tool ⚠️ Simulated 🛡️ Action Stub Tool Propose · Approve · Simulate 📡 OpenTelemetry → Azure Application Insights / Azure Monitor Every agent, tool, and LLM call traced end-to-end Agentic Ops Advisor — GitHub Build Layer + Azure Runtime (Synthetic Demo)

The deploy pipeline story

git push GitHub Actions Lint + Test + Eval Docker Build → ACR Foundry Agent Service

Every push triggers the full pipeline: code quality checks, 346 unit tests, evaluation regression gates, then a Docker container build pushed to Azure Container Registry (ACR). The containerized agent is deployed to Azure AI Foundry Agent Service using Bicep IaC templates — infrastructure as code, fully reproducible.

Container Deployment

Containerized agent built by GitHub Actions and deployed to Azure Container Registry (ACR), then served via Azure AI Foundry Agent Service using Bicep IaC.

Bicep Infrastructure as Code

All Azure resources — Foundry project, AI Hub, Application Insights, Azure SQL — are defined in Bicep templates. One deployment, fully reproducible.

Foundry Playground

Once deployed, the agent is accessible through the Azure AI Foundry Playground — a browser-based interface for interacting with the agent in real time.

Work IQ — The missing context layer

The agent is deployed and running on Azure. Now here's what makes it fundamentally different: organizational context. Traditional AIOps sees metrics. Agentic ops sees metrics plus the human decisions that caused them.

Why telemetry alone isn't enough

A GPU utilization drop is just a number. Was it a planned model-serving rollout? An accidental config change? A cost-optimization policy kicking in? Without organizational context, every anomaly is a mystery. Work IQ bridges that gap — correlating infrastructure telemetry with the change events, decisions, ownership, and runbooks that explain what happened and who to talk to next.

Four context surfaces the agent consults

Change Events

Recent deployments, config changes, policy updates, and rollout windows that coincide with anomaly timestamps. The agent asks: "What changed near this incident?"

Decisions

Approval records, AI Factory planning decisions, budget sign-offs, and architecture choices. The agent correlates "who decided this" with "what broke."

Ownership

Service owners, on-call rosters, and escalation paths. When the agent identifies a root cause, it knows exactly who to page — no more "who owns this?" Slack threads.

Runbooks

Documented remediation procedures linked to specific failure modes. The agent surfaces the right runbook for the diagnosed issue — governance through institutional knowledge.

Change Correlation

The agent connects infrastructure events to change context — approvals, policy updates, rollout windows, and AI Factory planning decisions — using the Work IQ pattern. This is how the agent moves from "something broke" to "here's what caused it and who owns it."

The MCP Pattern — Model Context Protocol

Work IQ data can be surfaced via the Model Context Protocol (MCP) — an open standard for connecting AI agents to external context sources. In this architecture, the agent calls an MCP-wrapped endpoint to retrieve change events, decisions, and ownership context, keeping the tool interface clean and the context pipeline extensible. The MCP wrapper is gated behind the ENABLE_MCP feature flag.

Agent MCP Protocol Work IQ Context Diagnosis + Action

The hybrid advantage

This is what makes agentic ops fundamentally different from traditional monitoring. The agent doesn't just detect anomalies — it explains them by merging two signal streams:

📊 Telemetry

GPU drops, latency spikes, cost anomalies, incidents — the what.

🔗 Intent

Change events, decisions, ownership, runbooks — the why and who.

⚙️ Governed Diagnosis

Root-cause with evidence, confidence, owner, and a safe remediation plan.

We're simulating Work IQ outputs in this demo. Work IQ is in public preview and requires Microsoft 365 Copilot licensing + admin consent. All change events, decisions, ownership, and runbook data shown here are synthetic. Work IQ is presented as a pattern, not a live integration.

Insights from the running agent

You can't govern what you can't see. Every agent call, tool invocation, and LLM request is instrumented with OpenTelemetry and exported to Azure Application Insights.

End-to-end trace pipeline

Agent Call Tool Invocation LLM Request OpenTelemetry Application Insights

Every span is captured — from the initial user query through tool dispatch to LLM token generation. Traces, metrics, and logs flow through the OpenTelemetry SDK to Azure Application Insights for real-time analysis.

Telemetry Analysis

Queries synthetic GPU utilization, network latency, cost, and incident data via SQL. Surfaces anomalies, trends, and root-cause signals in seconds.

Observability Instrumentation

OpenTelemetry traces for every agent invocation, tool call, and LLM request. Exports to Application Insights with Azure Monitor Workbook dashboards.

Azure Monitor Workbooks

Pre-built dashboards tracking token usage, response latency, tool call patterns, and error rates — giving operators a live view of agent health and behavior.

Privacy by Default

Content recording is OFF by default (AZURE_TRACING_GEN_AI_CONTENT_RECORDING_ENABLED=false). Trace structure and latency are captured — prompt/response content is not.

Quality gates in the pipeline

You don't just build agents — you continuously evaluate them. Four custom evaluators run on every PR in GitHub Actions, gating deployment to ensure the agent never regresses.

Correctness

Does the agent's diagnosis match the expected root cause? Evaluates factual accuracy of the agent's conclusions against ground-truth test sets.

Evidence Quality

Does the agent cite specific telemetry data and change events? Measures grounding — the agent must show its work, not just guess.

Safety

Are the proposed remediation actions safe? Checks for harmful, risky, or out-of-scope actions. Human-in-the-loop gates must be preserved.

Groundedness

Is the agent hallucinating? Detects fabricated data, invented metrics, or claims not supported by the tool outputs the agent actually received.

The responsible AI story

PR Opened GitHub Actions CI Eval Suite Runs Pass / Fail Gate Deploy (or Block)

Evaluations run in GitHub Actions CI on every pull request. The eval runner supports --save-baseline to snapshot current scores and --compare-baseline to detect regressions. Failed evals block deployment — the agent can't ship if quality drops. This is continuous evaluation as a first-class CI/CD citizen.

From question to governed action

Developed with GitHub Copilot, validated on every PR by GitHub Actions — here's what the agent actually does at runtime.

Ask a natural language question

The ops engineer poses a question directly — no dashboards, no manual log queries. The agent selects the right tools autonomously.

User > Why did GPU utilization drop in the last 24h on cluster prod-east-01?

Agent queries telemetry + change context

The agent calls the SQL Telemetry Tool to pull GPU metrics and the Work IQ Context Tool to surface recent change events near the anomaly window.

Tool > sql_telemetry: SELECT avg(utilization_pct) … WHERE ts > NOW() - INTERVAL 24h

Root-cause diagnosis with evidence

The agent returns a concise diagnosis citing telemetry data and correlated change events, with a confidence level and "next best question" when needed.

Agent > Confidence: High · Cause: model-serving rollout at 14:32 UTC reduced batch size

Safe remediation with human approval gate

The Action Stub Tool proposes a change plan with risk level and tradeoffs. No action executes until the operator explicitly approves — governance built-in.

Action > propose_change(plan) → risk:LOW · approval:PENDING · simulate_only:true

Getting started in 60 seconds

Clone the repo, seed the synthetic database, and run the agent locally — no Azure credentials required for demo mode.

Terminal
# 1. Clone & install
git clone https://github.com/tammym-demos/Agentic-Ops-Advisor.git
cd Agentic-Ops-Advisor
pip install -r requirements.txt

# 2. Seed the synthetic telemetry database
python -m data.seed_telemetry

# 3. Run in demo mode (no Azure creds needed)
python scripts/run_local.py

# 4. Run the test suite
python -m pytest tests/ -q

GitHub-first, Azure-powered

The GitHub platform is the build and delivery backbone. Azure provides the production runtime for the AI agent.

🤖 GitHub Copilot ⚙️ GitHub Actions 📄 GitHub Pages 🐛 GitHub Issues 🐍 Python 3.11+ 🧠 GPT-4.1 ☁️ Azure AI Foundry 🔵 Azure OpenAI 📡 OpenTelemetry 📊 Application Insights 🗄️ SQLite / Azure SQL 🏗️ Bicep IaC 🧪 Pytest + azure-ai-evaluation
Layer Technology Purpose
AI Coding GitHub Copilot AI pair programmer — authored all Python, Bicep, and workflow files
CI/CD GitHub Actions Lint, test, eval regression gates, Docker build, Azure deployment
Site Hosting GitHub Pages This brochure site — deployed from docs/ on every merge to main
Issue Tracking GitHub Issues Squad workflows auto-triage, label, and assign issues; CI results posted as PR comments
Agent Framework azure-ai-projects (Foundry SDK) Agent lifecycle, tool dispatch, thread management
LLM GPT-4.1 via Azure OpenAI Reasoning, diagnosis, natural language generation
Telemetry DB SQLite (local) / Azure SQL (prod) Synthetic GPU, network, cost, incident tables
Observability OpenTelemetry → Application Insights Trace every agent, tool, and LLM call
Evaluation azure-ai-evaluation + custom evaluators Diagnosis accuracy, grounding, safety, hallucination
IaC Bicep Foundry project, Azure SQL, App Insights provisioning
Context (simulated) Work IQ pattern stub / MCP wrapper Change events, decisions, runbooks (synthetic)

Disclaimers

Synthetic Data Only

All telemetry, incidents, change events, decisions, and ownership data in this demo are entirely synthetic. No real infrastructure, customer data, or internal Microsoft data is used at any point.

Work IQ Simulation

Work IQ outputs shown here are simulated patterns, not a live integration. Work IQ is in public preview and requires Microsoft 365 Copilot licensing plus explicit admin consent for tenant access.

No Real Actions Taken

The Action Stub Tool never modifies external systems. All propose_change and request_approval calls are simulated and return synthetic payloads only.