EVPN in a Box Part 3: CI/CD, an LLM That Talks to Your Lab, and the Memory Wall

Part 3 of the EVPN in a Box series. Part 1: ND and CML on Proxmox | Part 2: Building the fabric deployer

Where we left off

In Part 1, I got Nexus Dashboard and CML running on a single Proxmox NUC, hacking around ND's assumptions about ESXi. In Part 2, I built a Python deployer that provisions a full VXLAN/EVPN fabric from a single YAML file through NDFC. The fabric works end-to-end with two VRFs and full multi-tenancy. But every change still means SSH-ing into the automation box and running commands by hand.

This final part covers three things: adding a CI/CD pipeline so the fabric deploys on every git push, wiring an LLM into the lab so I can query the infrastructure through natural language, and hitting a memory wall that forced me to rebuild the entire lab.

Part I: The CI/CD Pipeline

The EVPN fabric deployer worked. You could run python deploy.py --all and get a working VXLAN/EVPN fabric from a single YAML file. But every change still meant SSH-ing into the automation box. So I added Phase 0c: a Docker-based CI/CD pipeline on GitLab that validates and deploys the fabric config on every push.

The idea was simple. Push a YAML change. GitLab runs a validate stage (schema check, diff preview). If it looks good, hit the manual deploy button. Ansible runs against NDFC. Done.

It took four pipeline runs and 12 issues to get there.

The architecture

VM 102 runs Ubuntu 24.04 with GitLab CE on HTTPS (self-signed TLS chain), GitLab Container Registry on port 5050, Docker CE, and a GitLab Runner with the Docker executor. The runner pulls a custom ansible-runner image from the local registry. Multi-stage Dockerfile with Python 3.11, pinned ansible-core, the cisco.nac_dc_vxlan and cisco.dcnm collections, all pip dependencies, the dcnm bug patch from Part 2 pre-applied, and the self-signed CA cert for registry trust.

Credentials come from GitLab CI/CD masked variables, with an Ansible Vault-encrypted vault.yml as fallback.

The whole setup is automated. python deploy.py --phase 0 provisions the VM from scratch, generates the TLS cert chain, installs GitLab and Docker, builds the runner image, configures CI/CD variables, pushes the repo. You can destroy VM 102 and rebuild from nothing in about 30 minutes.

The issues that cost the most time

sudo only applies to the first command. The deployer's SSH helper prepends sudo to the command string. But when you pipe commands like sudo curl ... | gpg --dearmor -o /etc/apt/keyrings/docker.gpg, the sudo only covers curl. The gpg runs as the regular user. Permission denied. This bit me four separate times with different piped commands. Fix: wrap everything in sudo bash -c '...'.

GitLab said no, four different ways. The health check polled an endpoint that returns 401 without auth (fix: treat 401 as "up"). Personal access tokens with expires_at: nil stopped working in GitLab 18.9 (fix: 1.year.from_now). Masked variables reject passwords with ! in them (fix: fallback to unmasked). Runner registration with glrt-* tokens rejects --tag-list and --description flags (fix: set via API instead).

The TOML merge problem. The GitLab Runner config.toml needed a volume mount for SSL certificates. I used sed to insert the volume. But the file already had volumes = ["/cache"]. Two volumes keys in the same TOML section. The runner silently used one and ignored the other. Rewrote it with Python-based TOML parsing that merges both volume paths into a single array. Don't use sed for structured config files.

Four pipeline runs to green

Run #1: Missing pip dependencies. The cisco.nac_dc_vxlan collection needs nac-yaml, nac-validate, jmespath, macaddress, netaddr, and packaging. None were in the Docker image. Added all six, rebuilt.

Run #2: ansible-core too new. The Dockerfile installed the latest ansible-core (2.19.7). The collection requires <2.19.0. The internal prep plugin broke silently with a 'dict object' has no attribute 'mgmt_ip_address' error. Fix: pin to "ansible-core>=2.15.0,<2.19.0".

Run #3: The prep plugin's hostname-to-IP resolution didn't work. The fix was including mgmt_ip_address directly in the generated YAML alongside hostname in the attach groups. The collection uses whatever fields are present.

Run #4: Validate: passed, 21 seconds. Deploy: passed, 62 seconds. 250 Ansible tasks, 33 changes, 0 failures.

VRF-Red:  DEPLOYED
VRF-Blue: DEPLOYED
Net-Red:  DEPLOYED
Net-Blue: DEPLOYED

Push a commit that changes the fabric YAML. GitLab picks it up. Validate stage runs a schema check and diff against current NDFC state. Manual gate. Click deploy. 62 seconds later, the fabric is updated.

Part II: An LLM That Talks to Your Lab

At this point the lab was fully automated. Push a YAML change, pipeline deploys it, done. But every time I wanted to check something, I was still jumping between browser tabs. Is the CML lab running? What VRFs are deployed? Which switches are online?

I wanted to just ask. Type "is my lab running?" into a chat and get an answer from the actual infrastructure.

That's what MCP (Model Context Protocol) does. It lets LLMs call external tools. There are already MCP servers for both platforms I care about: my own Nexus Dashboard MCP server, and a community CML MCP by xorrkaz.

The architecture

VM 102 (192.168.1.252)
+-------------------------------------------------------------------+
|                                                                   |
|  ND MCP Server (self-contained repo)                              |
|  +-------------------------------------------------------------+ |
|  | PostgreSQL :15432  |  MCP Server  |  Web API :8444 (HTTPS)  | |
|  | Web UI :7443       |  638 tools auto-discovered from NDFC    | |
|  +-------------------------------------------------------------+ |
|                                                                   |
|  MCP Platform (CML MCP + LibreChat)                               |
|  +-------------------------------------------------------------+ |
|  | CML MCP :9000 (Streamable HTTP) -> CML at 192.168.1.251     | |
|  | LibreChat :3080 -> CML MCP + OpenAI                         | |
|  | MongoDB :27018                                               | |
|  +-------------------------------------------------------------+ |
|                                                                   |
+-------------------------------------------------------------------+

Two separate docker-compose files. The deployment is fully scripted. scripts/deploy-mcp-platform.sh reads credentials from config.yaml, generates crypto keys, SSHes to VM 102, and stands up both stacks. Idempotent. Re-run it and it updates configs and restarts containers.

LibreChat was chosen over Open WebUI because it natively supports both SSE and Streamable HTTP MCP transports. No proxy layer needed.

The issues worth mentioning

MongoDB can't do math. First docker compose up -d. MongoDB crashes immediately. Exit code 132: SIGILL (illegal instruction). MongoDB 5+ requires AVX. VM 102 was running with Proxmox's default kvm64 CPU type, which doesn't have AVX. One command fix: qm set 102 --cpu host. You lose live migration. In a single-node lab, irrelevant.

LibreChat blocks its own MCP connections. All containers running, MCP tools never connected. LibreChat has an mcpSettings.allowedDomains security feature that whitelists MCP server hosts. If you don't set it, everything is blocked. Had to explicitly allow both MCP server addresses.

ND MCP SSE returns 401. The ND MCP requires an API token in the Authorization header. Three places need to agree on the same token: the ND MCP container, the LibreChat docker-compose environment, and the server config in librechat.yaml.

685 tools walk into a context window

This was the big one. Both MCP servers connected. The ND MCP registered 638 tools. The CML MCP registered 47. Total: 685.

I typed "is my CML lab running?" and got:

{"type":"empty_messages","info":"Message pruning removed all messages
as none fit in the context window."}

Every MCP tool gets sent to the LLM as part of the system prompt. At roughly 300-500 tokens per tool, 685 tools consume 200,000-340,000 tokens before you even type a word. GPT-4o has a 128,000 token context window. The tool definitions alone were two to three times larger than the entire context.

The fix was removing the ND MCP from LibreChat's config. 638 tools from a single MCP server is too many for any current LLM context window. With only the CML MCP's 47 tools, everything worked.

Typed "is my CML lab running?" again. GPT-4o called get_cml_labs, got the response, and told me my evpn-lab was running with all nodes active.

Fun fact: the ND MCP server is my own project. I built it to auto-discover every API endpoint NDFC exposes and turn them into MCP tools. Great for exploring the API programmatically. Not so great when 638 tools blow up the context window of every LLM you connect it to.

So I went back and added tool filtering. You can now configure which NDFC services to expose (fabric management only, or just read operations, or a custom subset). Took a few iterations to get the filtering granular enough without losing the auto-discovery that makes it useful. The ND MCP still runs on port 7443 with its management dashboard for direct NDFC access. With filtering enabled, it's actually usable in a chat session now.

Part III: The Memory Wall

The pipeline was green. Validate passed. Deploy passed. NDFC showed both VRFs and both networks as DEPLOYED. Everything looked perfect on paper. Then I checked BGP EVPN on the spine.

Spine-1# show bgp l2vpn evpn summary
Neighbor  V  AS    MsgRcvd  MsgSent  InQ  OutQ  Up/Down  State/PfxRcd
10.0.0.1  4  65000 0        0        0    0     00:00:00 Idle (NoMem)

Idle (NoMem). BGP couldn't allocate memory to establish the peering session. On a switch with 6 GB of RAM. Only 675 MB free. The NX-OS "Severe Alert" threshold had kicked in.

The root cause: I was running the nxosv9300-lite-10-4-7 image. The lite image ships with reduced feature sets and tighter memory defaults. It handles basic L2/L3 fine, but BGP EVPN with route reflectors, VNI tables, and Type-2/Type-5 routes needs more headroom than it has.

Shrinking the topology

First move: cut from four switches to two. One spine, one leaf, two hosts. Minimum viable EVPN fabric.

Before:                          After:

  Spine-1    Spine-2               Spine-1
   /    \   /    \                   |
Leaf-1  Leaf-2                     Leaf-1
  |  |    |  |                      |   |
 R1  B1  R2  B2                    R1   B1

Still showed Idle (NoMem). The problem wasn't VM-level allocation. It was inside NX-OS itself. The lite image has hard limits on BGP process memory below what EVPN needs.

Switching to the full image

Time for nxosv9300-10-5-3-f. Full NX-OS with all features. Default 12 GB RAM per switch. I figured 8 GB might work.

First attempt: change the RAM and image on the existing node. CML said no.

400 Bad Request: "Cannot modify node attributes cpus, ram in state STOPPED"

CML's state machine: you can only modify hardware attributes in DEFINED_ON_CORE state. Stopping a node puts it in STOPPED. You have to wipe disks first to get back to DEFINED_ON_CORE.

Wiped, changed image, started. The switch boot-looped:

This image is not compatible with the current hardware platform. Rebooting system.

CML provisions the virtual disk and hardware profile for the original image definition. The lite and full images use different disk layouts. You can change the image_definition field via API, CML accepts it, but the existing virtual disk is incompatible. There's no in-place upgrade path.

The nuclear option

Couldn't just delete and recreate the node either. CML 2.9.1 doesn't support creating interfaces via REST API. No interfaces means no links. Dead end.

Only path: regenerate the entire lab topology from scratch. This is where the topology builder paid for itself. Update config.yaml to use the full image:

yaml

switches:
  image: "nxosv9300-10-5-3-f"
  boot_image: "nxos64-cs.10.5.3.F.bin"
  ram_mb: 8192
  vcpus: 4

Generate, delete old lab, import new topology, start, collect new serial numbers, push to GitLab, pipeline deploys.

Spine-1# show system resources
Memory usage:   Total: 8145612K   Used: 4784232K   Free: 3361380K

3.3 GB free. No more "Severe Alert."

Spine-1# show bgp l2vpn evpn summary
Neighbor  V  AS    MsgRcvd  MsgSent  InQ  OutQ  Up/Down  State/PfxRcd
10.0.0.1  4  65000 42       38       0    0     00:15:32 4

Four prefixes received. BGP EVPN established. VRF isolation confirmed. The EVPN control plane finally works.

Final resource allocation

+-------------------------------------------------------------+
|  Proxmox Host: 96 GB DDR5                                   |
|                                                              |
|  VM 100: Nexus Dashboard           64 GB  (NDFC minimum)    |
|  VM 101: CML                       20 GB                    |
|    - Spine-1 (nxosv9300-10-5-3-f)   8 GB  (3.3 GB free)    |
|    - Leaf-1  (nxosv9300-10-5-3-f)   8 GB  (3.3 GB free)    |
|    - host-red-1 (Alpine)           512 MB                   |
|    - host-blue-1 (Alpine)          512 MB                   |
|    - CML system overhead           ~3 GB                    |
|  VM 102: Automation/GitLab          8 GB                    |
|    + GitLab CE, CI/CD pipeline                              |
|    + MCP platform (ND MCP + CML MCP + LibreChat)            |
|                                                              |
|  Total:                            ~95 GB                    |
+-------------------------------------------------------------+

What I'd do differently

Use the full NX-OS image from day one. The lite image saves about 2 GB per switch but can't run BGP EVPN reliably. For any fabric automation project, that's a non-starter.

Check how many tools an MCP server registers before connecting it to a chat UI. The ND MCP's 638 auto-discovered tools are great for API exploration but need a filtering layer for chat use. 47 tools from CML works fine.

Skip sed for any structured config file. TOML, YAML, JSON, whatever. Parse it properly, modify the data structure, write it back. I hit silent key collisions twice in this project.

Read the vendor collection source code early. Both the ansible-core version issue and the hostname resolution issue would have been obvious from reading 200 lines of Python in the prep plugin.

Don't assume CML image swaps work. If you need to change the NX-OS image family, plan to rebuild the lab. There's no in-place upgrade path through the API.

93 GB of RAM on a single NUC, and the most important decisions were which NX-OS image to pick and how many tools to give the LLM.

The entire project is open source: github.com/beye91/evpn-in-a-box

Resources:

About the Author

Chris Beye

Network automation enthusiast and technology explorer sharing practical insights on Cisco technologies, infrastructure automation, and home lab experiments. Passionate about making complex networking concepts accessible and helping others build better systems.