De-Risk Your Network Using CML, Splunk and MCP Server

Going live at Cisco Live
Abhay and I have been working on something that kept us up way too many nights. On Tuesday, we're presenting it at Cisco Live: a session called "De-Risk Your Network using CML, Splunk and MCP Server." The core idea is simple. Stop testing network changes in production. The execution, though, that's where it gets interesting.
If you've ever pushed a routing change on a Friday evening and watched your phone light up with alerts, you know the feeling. The cold sweat. The "let me just rollback real quick" moment that never goes as quick as you hoped. We've been there. Multiple times.
This post is a walkthrough of what we built, why we built it, and what you'll see in the session.
The problem nobody wants to admit
Network engineers are still pushing changes to production with varying degrees of confidence. Some teams have staging environments. Most don't. Even those who do often find that their staging doesn't match production closely enough to catch real issues.
The numbers tell the story. A significant portion of outages trace back to configuration changes. Not hardware failures. Not software bugs. Human-initiated changes that didn't behave the way someone expected.
The traditional workflow looks like this:
- Engineer designs a change
- Maybe runs it through a peer review
- Schedules a maintenance window
- Pushes the change to production
- Monitors and hopes
Step 5 is the problem. "Monitor and hope" isn't a strategy. It's a prayer.
What we built
We combined three tools that, individually, are powerful. Together, they create something that fundamentally changes how you validate network operations.
Cisco Modeling Labs (CML)
CML gives you a virtual replica of your production network. Real Cisco images. Real routing protocols. Real behavior. You can spin up a topology that mirrors your data center, campus, or WAN and run changes against it before touching anything in production.
But here's what most people miss about CML. It's not just a lab tool for training. It's a validation engine. You can programmatically create topologies, inject configurations, simulate failures, and collect the results. All through APIs.
We use CML to:
- Mirror production topologies automatically
- Pre-validate configuration changes against the model
- Simulate failure scenarios (link down, node failure, BGP peer loss)
- Run regression tests after every proposed change
The key word there is "automatically." Nobody opens a GUI and clicks around. Everything is API-driven.
Splunk for operational intelligence
Splunk sits on the other side of the equation. While CML models what should happen, Splunk tells you what is happening. Syslog, SNMP traps, streaming telemetry, NetFlow. All of it flows into Splunk where you can correlate events, detect anomalies, and build dashboards that actually mean something.
For our workflow, Splunk serves two purposes:
Pre-change intelligence. Before we validate a change in CML, we pull operational data from Splunk. What does the current state look like? Are there existing issues? What's the baseline for interface utilization, BGP session stability, or error counters? This context matters because you can't validate a change in isolation.
Post-change verification. After a change goes live (validated through CML first), Splunk monitors the real network for deviations from expected behavior. If something drifts from the validated model, you know immediately.
The MCP Server (the glue that makes it work)
This is where it gets fun. The Model Context Protocol server is what connects large language models to both CML and Splunk. It exposes topology data, telemetry feeds, and operational state to an LLM through a standardized protocol.
Why does this matter? Because it means you can interact with your network infrastructure using natural language.
Instead of writing custom scripts for every validation scenario, you can ask:
- "What would happen if I lose the BGP session between core-rtr-01 and core-rtr-02?"
- "Show me the current utilization on all WAN links and compare it to last week"
- "Validate this OSPF area migration plan against the CML topology"
- "Are there any anomalies in Splunk that correlate with the last maintenance window?"
The MCP server translates these queries into API calls against CML and Splunk, aggregates the results, and presents them in a way that makes sense. No more jumping between three different consoles.
The workflow in practice
Here's what a real change validation looks like end to end:
Step 1: Pull current state
The MCP server queries Splunk for the current operational baseline. Interface states, routing tables, error counters, CPU/memory utilization. This becomes the "before" snapshot.
Query: "Pull the current operational baseline for the DC fabric"
MCP Server → Splunk API:
- Interface utilization (last 24h average)
- BGP session states
- Error counters trending
- CPU/memory baseline per device
Step 2: Model the change in CML
The proposed configuration change gets applied to the CML topology. This happens programmatically. The engineer defines the change, and the system pushes it to the virtual environment.
Query: "Apply the proposed VXLAN EVPN migration to the CML lab and validate convergence"
MCP Server → CML API:
- Push configuration delta to virtual topology
- Wait for convergence
- Collect routing tables, MAC tables, ARP tables
- Run connectivity tests between all leaf pairs
Step 3: Validate and compare
The LLM, through the MCP server, compares the CML results against the Splunk baseline. It identifies any discrepancies, unexpected behavior, or potential issues.
Query: "Compare the CML validation results with the production baseline. Flag any concerns."
MCP Server → Analysis:
- Routing table diff: 3 new routes, 0 missing routes ✓
- BGP sessions: All established ✓
- Convergence time: 2.3 seconds ✓
- Warning: Interface Eth1/3 on leaf-03 shows higher utilization in model
Step 4: Execute with confidence
Only after CML validation passes do you push to production. And Splunk keeps watching.
Step 5: Continuous monitoring
Post-change, the MCP server keeps comparing real telemetry from Splunk against the validated model from CML. If behavior deviates from what was predicted, you get an alert. Not a generic "something is wrong" alert. A specific "the validated model predicted X but production is showing Y" alert.
AI-driven troubleshooting
The MCP server doesn't just help with planned changes. It becomes a troubleshooting companion.
When an issue occurs, the traditional approach involves logging into devices, running show commands, checking multiple dashboards, correlating timestamps manually. It's slow and depends heavily on the engineer's experience level.
With the MCP server connected to both CML and Splunk, troubleshooting looks different:
"We're seeing packet loss between site A and site B. What's going on?"
The MCP server can:
- Pull relevant telemetry from Splunk (interface errors, drops, utilization)
- Check the CML model for the expected path between sites
- Compare expected vs. actual routing
- Identify where the deviation occurs
- Suggest remediation based on similar past incidents in Splunk
All of that in seconds, not hours.
What you'll see in the session
During the talk on Tuesday, Abhay and I will walk through a live demo. Not slides with screenshots. Actual live interaction with CML, Splunk, and the MCP server.
You'll see:
- Topology mirroring: How we replicate a production-like environment in CML automatically
- Change validation: A real configuration change validated end to end
- Natural language queries: Asking the LLM questions about the network and getting actionable answers
- Failure simulation: Breaking things in CML on purpose and watching the system detect and diagnose the issue
- Splunk correlation: How operational data enriches the AI's understanding of what's happening
We'll also show some of the edge cases we ran into during development. The times where the LLM confidently gave the wrong answer and how we built guardrails around that. Because if there's one thing I've learned building MCP servers, it's that trust needs verification.
Why this matters for your team
The shift here isn't just technical. It's operational.
Junior engineers get superpowers. A less experienced engineer connected to this system can troubleshoot at a level that used to require a decade of experience. The AI brings the context. The engineer brings the judgment.
Change management becomes evidence-based. Instead of "I think this will work," you have "CML validated this change against a production mirror, and here are the results." Try getting that through a CAB meeting. It's a different conversation entirely.
Mean time to resolution drops. When an issue hits, you're not starting from scratch. The system already has context about recent changes, current state, and historical patterns.
Innovation accelerates. When testing is cheap and fast (spin up a CML topology, validate, tear down), teams experiment more. They try new designs. They optimize. Because the cost of being wrong in a lab is zero.
The tech stack at a glance
| Component | Role | Integration |
|---|---|---|
| Cisco Modeling Labs | Virtual network topology, change validation | REST API |
| Splunk | Operational telemetry, event correlation | REST API, HEC |
| MCP Server | Protocol bridge between LLM and infrastructure tools | Model Context Protocol |
| LLM (Claude) | Natural language interface, analysis, recommendations | Via MCP Server |
Come find us
If you're at Cisco Live, come to the session on Tuesday. Bring your skepticism. We built this because we were tired of the "test in production" culture that somehow became normalized in networking.
If you can't make it to the session, I'll be posting a follow-up with more technical details, architecture diagrams, and links to the code after the event. Abhay and I are also happy to chat during the conference, so don't hesitate to reach out.
The era of "monitor and hope" is over. It's time to validate, verify, and deploy with confidence.






