Building an AI-Powered Webhook Platform for Splunk and ServiceNow

Back in 2025, Jörg and I were building lab content for Cisco Live APJC. The session was LTRATO-2600, a cross-domain automation lab covering everything from HashiCorp Vault to NDFC pipelines. One of the optional tasks, Scenario 3, focused on something I'd been wanting to build for a while: using AI to make ServiceNow tickets actually useful.
The idea was straightforward. Syslog errors hit Splunk, Splunk fires a webhook, a Python service catches it, asks an LLM "what's going on here and how do I fix it?", then creates a ServiceNow incident with the AI-generated analysis baked in. No more copy-pasting raw syslog output into a ticket and hoping the next engineer knows what DUP_SRC_IP means at 3am.
The lab version worked. A single Python file, hardcoded credentials, one mnemonic it could handle. Good enough for a 20-minute lab exercise. Not good enough for anything real.
The gap between "demo" and "usable"
After the session, I kept thinking about it. The concept was solid. The implementation was a prototype at best. Here's what bugged me:
- Hardcoded everything. The LLM API key, ServiceNow credentials, SMTP settings, all sitting in a Python file. Change one thing and you're editing source code.
- Single mnemonic support. The lab only handled
DUP_SRC_IP. Real networks throw hundreds of different error types. Each one might need different notification routing. - No visibility. Did the webhook fire? Did the LLM respond? Did ServiceNow accept the ticket? No logs, no audit trail, nothing.
- No management interface. Want to add a new alert type? SSH in and edit Python. Want to change the LLM provider? Same thing.
I decided to rebuild it properly. Not as a script, but as a platform. One Python file became four containers, 11 database tables, and a full admin interface.
The architecture
┌──────────────────────────────────────────────────────────────┐
│ Docker Network │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Admin UI │ │Config API│ │Webhook Service│ │
│ │ (Next.js)│ │ (FastAPI)│ │ (Flask) │ │
│ │ Port 3000│ │ Port 8000│ │ Port 5001 │ │
│ └────┬─────┘ └────┬─────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────────────┼─────────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ PostgreSQL │ │
│ │ Port 5432 │ │
│ └────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
External Integrations:
Splunk ──webhook──> Webhook Service
Webhook Service ──> OpenAI / Ollama (LLM Analysis)
Webhook Service ──> ServiceNow (Incident Creation)
Webhook Service ──> SMTP (Email Notifications)
Four containers. Each one does one thing.
Webhook Service (Flask)
This is what Splunk talks to. It receives the JSON payload, looks up the alert type by mnemonic, decides what to do with it, and executes. The processing pipeline looks like this:
- Splunk fires a POST to
/webhookwith the alert payload - Service parses the JSON, extracts the mnemonic, host, vendor, severity
- Looks up the alert type in the database. If there's no match, it logs it and moves on
- If an LLM provider is configured for that alert type, it sends the error context and asks for analysis
- Based on the notification routing, it creates a ServiceNow incident, sends an email, or both
- Everything gets logged with timestamps and processing duration
The LLM integration supports both OpenAI and Ollama. I run Ollama locally for testing because sending lab syslog messages to OpenAI's API gets expensive fast when you're iterating.
Config API (FastAPI)
This handles all the configuration CRUD. LLM providers, ServiceNow instances, SMTP servers, alert types, notification routing. Everything that was hardcoded in the original script now lives in the database and is managed through REST endpoints.
It also handles authentication. JWT tokens, bcrypt password hashing, role-based access. The API docs are auto-generated at /docs thanks to FastAPI's OpenAPI integration.
One thing I'm happy with: credential encryption. All API keys, passwords, and tokens are encrypted with Fernet before they hit the database. The encryption key lives in the environment, not in the database. So even if someone gets a database dump, they get encrypted blobs, not passwords.
Admin UI (Next.js)
Because nobody wants to manage webhook configurations through curl commands. The UI gives you:
- A dashboard with webhook processing statistics
- LLM provider management (add/test/switch between OpenAI and Ollama)
- ServiceNow configuration with connection testing
- SMTP server setup
- Alert type management with notification routing
- A webhook log viewer with filtering and search
- A test webhook page where you can fire test payloads and watch them flow through the pipeline
PostgreSQL
Eleven tables. Users, LLM providers, ServiceNow configs, SMTP configs, alert types, notification routing, email recipients, webhook logs, audit logs. The schema gets initialized automatically on first boot through init scripts.
How alert routing works
This is where the platform gets more interesting than the original script. Instead of one hardcoded mnemonic, you can configure any number of alert types and route them independently.
Say you want DUP_SRC_IP errors to create a ServiceNow ticket with LLM analysis AND send an email to the network team. But LINK_DOWN should only send an email (no need for LLM analysis on something that obvious). And BGP_PEER_RESET should create a ServiceNow ticket with LLM analysis but no email.
Each alert type (identified by mnemonic) can have multiple notification channels. Each channel specifies:
- Whether to use a ServiceNow instance (and which one)
- Whether to send email (and to whom, with cc/bcc support)
- Which LLM provider to use for analysis (or none)
All configurable through the admin UI. No code changes.
The LLM integration
The LLM prompt construction is deliberate. I don't just throw the raw syslog message at the model and hope for the best. The service builds a context block that includes:
- The error message text
- The originating host
- The vendor (Cisco IOS, NX-OS, etc.)
- The mnemonic identifier
The LLM's response gets appended to the ServiceNow ticket alongside the raw error information. So the support engineer sees both: the actual syslog data and the AI's analysis with suggested remediation steps.
For OpenAI, it uses the chat completions API. For Ollama, it hits the local inference endpoint. The model, temperature, and max tokens are all configurable per provider through the admin interface.
What I learned building it
Encryption matters from day one. The original lab script had API keys in plain text. When I started building the platform, I added Fernet encryption for all credentials before writing a single CRUD endpoint. It's so much harder to retrofit encryption into an existing data model.
Gunicorn timeout settings will bite you. LLM inference can be slow, especially with Ollama running on modest hardware. The default Gunicorn timeout is 30 seconds. A complex error analysis on a slower model can take 60-90 seconds. I bumped the timeout to 120 seconds and added 4 workers to handle concurrent requests while one worker waits for an LLM response.
Test webhooks save hours. Building the test webhook page in the admin UI was one of the first things I did. Being able to fire a synthetic payload and watch it flow through the entire pipeline (LLM call, ServiceNow ticket creation, email) without waiting for Splunk to trigger an alert made development so much faster.
Docker networking between services is its own debugging adventure. The webhook service needs to talk to the config API's database. The admin UI needs to talk to the config API. Everything needs to resolve each other by container name, not localhost. Getting the Docker Compose networking right with proper health checks and startup ordering took more time than I'd like to admit.
From the lab to production
The Cisco Live lab task still exists if you want to see where this all started. That single-file webhook handler taught the concept. This platform makes it operational.
The entire project is open source: github.com/beye91/splunk_webhook_service. Clone it, run docker-compose up, and you've got a working webhook platform with an admin interface, LLM integration, and ServiceNow ticket creation. All the configuration happens through the UI after first boot.
What's next
There are a few things on my list:
- Webhook templates. Right now, the service expects Splunk's specific JSON format. I want to support configurable payload parsers so it can handle webhooks from other monitoring tools.
- Response playbooks. Instead of just creating tickets, trigger automated remediation through Ansible or a CI/CD pipeline based on the LLM's analysis.
- Multi-tenant support. Different teams with different LLM providers, different ServiceNow instances, different alert routing.
- Metrics and alerting. The webhook logs are there, but proper Prometheus metrics for processing latency, failure rates, and LLM response times would make operational monitoring easier.
How are you handling the gap between "Splunk alert fired" and "someone actually fixes the problem"? I'm curious what other teams are building in that space.
Resources:






