EVPN in a Box Part 2: Building an Automated VXLAN/EVPN Fabric Deployer

Part 2 of the EVPN in a Box series. Part 1: ND and CML on Proxmox
Where we left off
In Part 1, I moved my entire Cisco lab from ESXi to Proxmox after VMware's licensing changes. Getting Nexus Dashboard to run meant injecting systemd .link files directly into the initrd to fix interface naming, and writing an expect script to automate the setup wizard that forgets everything on reboot. CML is running with a 5-node N9Kv spine-leaf topology. The infrastructure is ready. Now it needs automation.
I spent the last few weeks building a tool that deploys a full VXLAN/EVPN data center fabric from a single YAML file and a python3 deploy.py --all command. Four Nexus 9000v switches in a spine-leaf topology, two VRFs, two overlay networks, OSPF underlay, BGP EVPN overlay, all orchestrated by NDFC. The whole thing runs in my home lab on Proxmox.
This is not a tutorial. This is a war story.
The Setup
Everything runs on a single Proxmox VE host with 96 GB of RAM. Three VMs:
- Nexus Dashboard (64GB RAM) running NDFC for SDN fabric management
- Cisco CML (48GB RAM) hosting the virtual N9Kv switches
- An automation VM running GitLab CE and Ansible
That's 120 GB allocated on a 96 GB host. Yes, it's overprovisioned. In practice, neither ND nor CML use their full allocation at the same time. Proxmox handles memory ballooning, and I configured swap on the host as a safety net for the occasional spike. It's not something I'd recommend for production, but for a lab that runs a few hours at a time, it works.
+----------------------------------------------------------------------+
| Proxmox VE Host |
| |
| +--------------------+ +--------------------+ +-----------------+ |
| | ND (VM 100) | | CML (VM 101) | | Auto (VM 102) | |
| | 64 GB RAM | | 48 GB RAM | | 8 GB RAM | |
| | Nexus Dashboard | | Virtual Switches | | GitLab+Ansible | |
| | | | | | | |
| | mgmt0 > vmbr0 | | bridge0 > vmbr0 | | net0 > vmbr0 | |
| | fabric0 > vmbr1 | | bridge1 > vmbr1 | | net1 > vmbr1 | |
| +--------------------+ +--------------------+ +-----------------+ |
| | | | |
| vmbr0 (LAN) vmbr1 (internal) vmbr0 + vmbr1 |
| 192.168.1.0/24 172.16.1.0/24 |
+----------------------------------------------------------------------+
The deployer itself is a three-phase Python orchestrator:
Phase 0: Infrastructure Phase 1: CML Lab Phase 2: Fabric
========================== ========================== ==========================
Create vmbr1 bridge Generate topology YAML Ansible validate (schema)
Move ND NIC to vmbr1 Import lab into CML Ansible create (NDFC state)
Add CML second NIC Start N9Kv switches Ansible deploy (to switches)
Create automation VM Bootstrap switch configs
Install GitLab + Ansible Collect serial numbers
Generate NAC YAML files
The config file drives everything. Switch names, IPs, VRFs, VLANs, VNIs, BGP ASN. Change a value, re-run the phase, it reconciles.
The fabric topology looks like this:
+-----------+ +-----------+
| Spine-1 | | Spine-2 |
| (RR) | | (RR) |
+-----+-----+ +-----+-----+
| |
+--------+--------+ +--------+--------+
| | | |
+----+-----+ +-----+----+ +----+-----+ +-----+----+
| Leaf-1 | | Leaf-2 | | Leaf-1 | | Leaf-2 |
+--+----+--+ +--+----+--+ +--+----+--+ +--+----+--+
| | | |
| | | |
host host host host
red blue red blue
VRF-R VRF-B VRF-R VRF-B
VRF-Red: VLAN 2301 / VNI 130001 / GW 10.10.10.1/24
VRF-Blue: VLAN 2302 / VNI 130002 / GW 10.20.20.1/24
The Five Things That Actually Broke
1. CML Alpine host configuration
For the Alpine Linux hosts at the edges of the fabric, the network config goes directly into the node's configuration field in CML. No cloud-init, just raw shell commands that get sourced as root on boot:
hostname host-red-1
ip address add 10.10.10.11/24 dev eth0
ip link set dev eth0 up
ip route add default via 10.10.10.1
Four lines per host. Simple once you know where to put it.
2. N9Kv stuck in NDFC "Migration" mode
After the switches booted and NDFC discovered them, they showed up in the inventory. But they wouldn't transition from "Migration" mode to "Normal" mode. NDFC kept retrying, the switches kept resetting.
I spent hours checking SNMP configs, POAP settings, management reachability. Everything looked fine.
The root cause: the bootstrap config was missing a boot statement.
boot nxos bootflash:nxos64-cs.10.5.3.F.bin
Without this, NDFC can't persist the running config. Every time it tries to write, the switch comes back without a boot variable, and NDFC treats it as a migration scenario. One missing line, hours of debugging.
3. cisco.dcnm Ansible collection bug
Phase 2 uses the cisco.nac_dc_vxlan collection (version 0.6.0), which depends on cisco.dcnm (version 3.10.0). During fabric deployment, the obtain_federated_fabric_associations() method in the dcnm collection would crash with:
'str' object has no attribute 'get'
The NDFC API returns a string instead of a dict for certain federation responses. The collection code assumes it always gets a dict and calls .get() on it. Classic API response handling bug.
There's no workaround. I had to patch the vendor code directly in ~/.ansible/collections/ansible_collections/cisco/dcnm/. Not ideal, but it works. The fix is a three-line type check before the .get() call.
4. The access port vs. trunk port trap
This one was subtle. The NetAsCode collection lets you attach networks to switches with a ports field in the attachment group. I used it to specify which interface each host was on:
attach:
- switch_name: "Leaf-1"
ports:
- "Ethernet1/3"Looks reasonable. But what this actually creates is a trunk interface. The Alpine hosts send untagged traffic, which hits native VLAN 1 instead of the configured overlay VLAN 2301. Traffic enters the switch and goes nowhere.
The fix was to separate concerns completely. The access interface configuration (switchport mode access, switchport access vlan) goes into the topology bootstrap config that gets applied during Phase 1. The network attachment in Phase 2 just declares which switches carry the network, without specifying ports. NDFC handles the SVI and VXLAN mapping. The physical port config stays in the switch running-config from bootstrap.
5. Anycast gateway and ping verification
One thing to keep in mind when building automated verification: pinging from a switch SVI to a remote host behind another leaf won't work with anycast gateways. Both leafs share the same gateway MAC and IP, so the ICMP reply goes to the local SVI instead of crossing the VXLAN tunnel back. This is expected EVPN behavior, not a fault. If your automation runs connectivity checks, always ping host-to-host, not from the switch SVI.
The Result
After working through all of this, the fabric works end-to-end:
host-red-1 (Leaf-1, VRF-Red) <--VXLAN--> host-red-2 (Leaf-2, VRF-Red) : 0% loss
host-blue-1 (Leaf-1, VRF-Blue) <--VXLAN--> host-blue-2 (Leaf-2, VRF-Blue) : 0% loss
host-red-1 -> host-blue-1 (cross-VRF) : blocked
Two VRFs, two overlay networks, full multi-tenancy with VRF isolation. All deployed from a single config file.
The deployer runs in about 25-40 minutes end-to-end, depending on how fast the N9Kv images boot. Most of that time is waiting for NX-OS. The actual automation runs in a few minutes.
The Tech Stack
For anyone looking to build something similar:
- Proxmox VE on an ASUS NUC 14 Pro with 96 GB RAM
- Nexus Dashboard + NDFC for SDN fabric management
- Cisco CML running N9Kv 10.5.3
- Python 3.11 with virl2-client (CML API), netmiko, and paramiko
- Ansible with cisco.nac_dc_vxlan 0.6.0 and cisco.dcnm 3.10.0
- GitLab CE on the automation VM for CI/CD (validate before deploy)
- Everything config-driven from a single YAML file, secrets in env vars
What I'd Do Differently
If I were starting over:
- Always include the boot variable in N9Kv bootstrap. This should be in every NX-OS virtual switch config template, period. Without it, NDFC can't persist configs and treats the switch as a migration scenario.
- Test with a single leaf first. I deployed the full four-switch topology every time. A single spine + single leaf would have been enough to catch most issues in a fraction of the time.
- Read the vendor collection source code early. The cisco.dcnm bug would have been found faster if I'd been reading the module code instead of just the error messages.
- Always verify connectivity host-to-host, not from the switch. With anycast gateways, pinging from the SVI to a remote host will fail by design. Build your automated verification around host-to-host checks from the start.
The deployer code is structured in phases specifically so you can re-run individual steps without tearing down the whole lab. That decision alone probably saved me more time than anything else in this project.
The entire project is open source: github.com/beye91/evpn-in-a-box. The repo contains the full deployer, the config.yaml that drives everything, the topology builder for CML, the NAC data model generator, and the CI/CD pipeline config. Clone it, adjust the config to your environment, and run python deploy.py --all.
Resources:







