Built with GitHub Copilot · Shipped via GitHub Actions

Agentic Ops Advisor

Ops teams drown in alerts — but alerts don't explain why. The Agentic Ops Advisor is an AI agent that merges infrastructure telemetry with human intent — change events, decisions, and ownership — to deliver root-cause diagnoses with confidence levels and governed remediation plans.

Built end-to-end on the GitHub platform — coded with GitHub Copilot, tested and shipped through GitHub Actions, hosted on GitHub Pages, and deployed to Azure AI Foundry.

🚀 Try the Demo View on GitHub ↗ See It in Action

Copilot AI-Coded with GitHub Copilot

Actions CI/CD via GitHub Actions

Issues Automated via GitHub Issues

The Problem

Traditional monitoring sees metrics — not meaning

Your dashboards show what happened. But nobody can tell you why — until someone manually pieces it together.

Sound familiar?

📉 Alert fatigue — hundreds of alerts, no prioritization, same noise every day
🔍 Manual log trawling — hours digging through logs, metrics, and dashboards to find root cause
💬 "Who owns this?" — Slack threads spiraling while the incident burns

The Agentic Ops Advisor was built to close this gap — an AI agent that doesn't just see the metrics, it understands the organizational context behind them.

Built with GitHub

Every layer of this project runs on GitHub

From the first line of code to the live deployment, GitHub is the hero of this story. Four products power the entire lifecycle.

GitHub Copilot

Every Python file, every tool, every evaluator — written with GitHub Copilot's AI pair programming. Copilot accelerated development from scaffolding to production-ready agent code in record time.

Learn about Copilot ↗

GitHub Actions

The full CI/CD pipeline runs on GitHub Actions: lint → test → eval gates → Docker build → Azure deployment. Every pull request triggers automated quality and regression checks before anything ships.

View Workflows ↗

GitHub Pages

This very site is hosted on GitHub Pages — deployed automatically from the docs/ folder on every merge to main. Zero infrastructure, zero cost, zero ops burden. The demo site ships with the code.

Learn about Pages ↗

GitHub Issues

Squad automation workflows automatically triage, label, and assign incoming issues. Pull request CI results are posted back as issue comments — issue tracking is fully integrated into the delivery pipeline.

View Issues ↗

Deployed to Azure

GitHub builds it · Azure AI Foundry runs it

The GitHub platform owns the entire build pipeline — code, CI/CD, security, packaging. Azure AI Foundry provides the production runtime for the agent.

The deploy pipeline story

git push GitHub Actions Lint + Test + Eval Docker Build → ACR Foundry Agent Service

Every push triggers the full pipeline: code quality checks, 346 unit tests, evaluation regression gates, then a Docker container build pushed to Azure Container Registry (ACR). The containerized agent is deployed to Azure AI Foundry Agent Service using Bicep IaC templates — infrastructure as code, fully reproducible.

Container Deployment

Containerized agent built by GitHub Actions and deployed to Azure Container Registry (ACR), then served via Azure AI Foundry Agent Service using Bicep IaC.

Bicep Infrastructure as Code

All Azure resources — Foundry project, AI Hub, Application Insights, Azure SQL — are defined in Bicep templates. One deployment, fully reproducible.

Foundry Playground

Once deployed, the agent is accessible through the Azure AI Foundry Playground — a browser-based interface for interacting with the agent in real time.

Telemetry + Intent

Work IQ — The missing context layer

The agent is deployed and running on Azure. Now here's what makes it fundamentally different: organizational context. Traditional AIOps sees metrics. Agentic ops sees metrics plus the human decisions that caused them.

Why telemetry alone isn't enough

A GPU utilization drop is just a number. Was it a planned model-serving rollout? An accidental config change? A cost-optimization policy kicking in? Without organizational context, every anomaly is a mystery. Work IQ bridges that gap — correlating infrastructure telemetry with the change events, decisions, ownership, and runbooks that explain what happened and who to talk to next.

Four context surfaces the agent consults

Change Events

Recent deployments, config changes, policy updates, and rollout windows that coincide with anomaly timestamps. The agent asks: "What changed near this incident?"

Decisions

Approval records, AI Factory planning decisions, budget sign-offs, and architecture choices. The agent correlates "who decided this" with "what broke."

Ownership

Service owners, on-call rosters, and escalation paths. When the agent identifies a root cause, it knows exactly who to page — no more "who owns this?" Slack threads.

Runbooks

Documented remediation procedures linked to specific failure modes. The agent surfaces the right runbook for the diagnosed issue — governance through institutional knowledge.

Change Correlation

The agent connects infrastructure events to change context — approvals, policy updates, rollout windows, and AI Factory planning decisions — using the Work IQ pattern. This is how the agent moves from "something broke" to "here's what caused it and who owns it."

The MCP Pattern — Model Context Protocol

Work IQ data can be surfaced via the Model Context Protocol (MCP) — an open standard for connecting AI agents to external context sources. In this architecture, the agent calls an MCP-wrapped endpoint to retrieve change events, decisions, and ownership context, keeping the tool interface clean and the context pipeline extensible. The MCP wrapper is gated behind the ENABLE_MCP feature flag.

Agent MCP Protocol Work IQ Context Diagnosis + Action

The hybrid advantage

This is what makes agentic ops fundamentally different from traditional monitoring. The agent doesn't just detect anomalies — it explains them by merging two signal streams:

📊 Telemetry

GPU drops, latency spikes, cost anomalies, incidents — the what.

🔗 Intent

Change events, decisions, ownership, runbooks — the why and who.

⚙️ Governed Diagnosis

Root-cause with evidence, confidence, owner, and a safe remediation plan.

We're simulating Work IQ outputs in this demo. Work IQ is in public preview and requires Microsoft 365 Copilot licensing + admin consent. All change events, decisions, ownership, and runbook data shown here are synthetic. Work IQ is presented as a pattern, not a live integration.

Monitoring & Observability

Insights from the running agent

You can't govern what you can't see. Every agent call, tool invocation, and LLM request is instrumented with OpenTelemetry and exported to Azure Application Insights.

End-to-end trace pipeline

Agent Call Tool Invocation LLM Request OpenTelemetry Application Insights

Every span is captured — from the initial user query through tool dispatch to LLM token generation. Traces, metrics, and logs flow through the OpenTelemetry SDK to Azure Application Insights for real-time analysis.

Telemetry Analysis

Queries synthetic GPU utilization, network latency, cost, and incident data via SQL. Surfaces anomalies, trends, and root-cause signals in seconds.

Observability Instrumentation

OpenTelemetry traces for every agent invocation, tool call, and LLM request. Exports to Application Insights with Azure Monitor Workbook dashboards.

Azure Monitor Workbooks

Pre-built dashboards tracking token usage, response latency, tool call patterns, and error rates — giving operators a live view of agent health and behavior.

Privacy by Default

Content recording is OFF by default (AZURE_TRACING_GEN_AI_CONTENT_RECORDING_ENABLED=false). Trace structure and latency are captured — prompt/response content is not.

Evaluations

Quality gates in the pipeline

You don't just build agents — you continuously evaluate them. Four custom evaluators run on every PR in GitHub Actions, gating deployment to ensure the agent never regresses.

Correctness

Does the agent's diagnosis match the expected root cause? Evaluates factual accuracy of the agent's conclusions against ground-truth test sets.

Evidence Quality

Does the agent cite specific telemetry data and change events? Measures grounding — the agent must show its work, not just guess.

Safety

Are the proposed remediation actions safe? Checks for harmful, risky, or out-of-scope actions. Human-in-the-loop gates must be preserved.

Groundedness

Is the agent hallucinating? Detects fabricated data, invented metrics, or claims not supported by the tool outputs the agent actually received.

The responsible AI story

PR Opened GitHub Actions CI Eval Suite Runs Pass / Fail Gate Deploy (or Block)

Evaluations run in GitHub Actions CI on every pull request. The eval runner supports --save-baseline to snapshot current scores and --compare-baseline to detect regressions. Failed evals block deployment — the agent can't ship if quality drops. This is continuous evaluation as a first-class CI/CD citizen.

Demo Walkthrough

From question to governed action

Developed with GitHub Copilot, validated on every PR by GitHub Actions — here's what the agent actually does at runtime.

Ask a natural language question

The ops engineer poses a question directly — no dashboards, no manual log queries. The agent selects the right tools autonomously.

              User > 
              Why did GPU utilization drop in the last 24h on cluster prod-east-01?
            

Agent queries telemetry + change context

The agent calls the SQL Telemetry Tool to pull GPU metrics and the Work IQ Context Tool to surface recent change events near the anomaly window.

              Tool > 
              sql_telemetry: SELECT avg(utilization_pct) … WHERE ts > NOW() - INTERVAL 24h
            

Root-cause diagnosis with evidence

The agent returns a concise diagnosis citing telemetry data and correlated change events, with a confidence level and "next best question" when needed.

              Agent > 
              Confidence: High · Cause: model-serving rollout at 14:32 UTC reduced batch size
            

Safe remediation with human approval gate

The Action Stub Tool proposes a change plan with risk level and tradeoffs. No action executes until the operator explicitly approves — governance built-in.

              Action > 
              propose_change(plan) → risk:LOW · approval:PENDING · simulate_only:true
            

Try It Yourself

Getting started in 60 seconds

Clone the repo, seed the synthetic database, and run the agent locally — no Azure credentials required for demo mode.

Terminal

# 1. Clone & install
git clone https://github.com/tammym-demos/Agentic-Ops-Advisor.git
cd Agentic-Ops-Advisor
pip install -r requirements.txt

# 2. Seed the synthetic telemetry database
python -m data.seed_telemetry

# 3. Run in demo mode (no Azure creds needed)
python scripts/run_local.py

# 4. Run the test suite
python -m pytest tests/ -q

⭐ Clone on GitHub ⚙️ View CI/CD Pipelines 📊 Explore the Dashboard

Tech Stack

GitHub-first, Azure-powered

The GitHub platform is the build and delivery backbone. Azure provides the production runtime for the AI agent.

🤖 GitHub Copilot ⚙️ GitHub Actions 📄 GitHub Pages 🐛 GitHub Issues 🐍 Python 3.11+ 🧠 GPT-4.1 ☁️ Azure AI Foundry 🔵 Azure OpenAI 📡 OpenTelemetry 📊 Application Insights 🗄️ SQLite / Azure SQL 🏗️ Bicep IaC 🧪 Pytest + azure-ai-evaluation

Layer	Technology	Purpose
AI Coding	GitHub Copilot	AI pair programmer — authored all Python, Bicep, and workflow files
CI/CD	GitHub Actions	Lint, test, eval regression gates, Docker build, Azure deployment
Site Hosting	GitHub Pages	This brochure site — deployed from docs/ on every merge to main
Issue Tracking	GitHub Issues	Squad workflows auto-triage, label, and assign issues; CI results posted as PR comments
Agent Framework	azure-ai-projects (Foundry SDK)	Agent lifecycle, tool dispatch, thread management
LLM	GPT-4.1 via Azure OpenAI	Reasoning, diagnosis, natural language generation
Telemetry DB	SQLite (local) / Azure SQL (prod)	Synthetic GPU, network, cost, incident tables
Observability	OpenTelemetry → Application Insights	Trace every agent, tool, and LLM call
Evaluation	azure-ai-evaluation + custom evaluators	Diagnosis accuracy, grounding, safety, hallucination
IaC	Bicep	Foundry project, Azure SQL, App Insights provisioning
Context (simulated)	Work IQ pattern stub / MCP wrapper	Change events, decisions, runbooks (synthetic)

Important Notices

Disclaimers

Synthetic Data Only

All telemetry, incidents, change events, decisions, and ownership data in this demo are entirely synthetic. No real infrastructure, customer data, or internal Microsoft data is used at any point.

Work IQ Simulation

Work IQ outputs shown here are simulated patterns, not a live integration. Work IQ is in public preview and requires Microsoft 365 Copilot licensing plus explicit admin consent for tenant access.

No Real Actions Taken

The Action Stub Tool never modifies external systems. All propose_change and request_approval calls are simulated and return synthetic payloads only.