Dear Future Me: How to Fix Splunk Better Webhook SSL Certificate Errors (Again)

Picture of Chris Beye

Chris Beye

Systems Architect @ Cisco Systems

Table of Contents

The Pain of Rediscovering Your Own Solutions
Have you ever spent hours troubleshooting an issue, only to realize you’ve solved this exact problem before? That was me last week, watching my closed-loop network automation completely fail to… well, automate.

Network device hitting bandwidth limits? Splunk should detect it. GitLab pipeline should trigger. Ansible should deploy QoS policies. The whole beautiful self-healing cycle. Except none of it was happening.

This isn’t just a technical write-up. This is a letter to my future self—and anyone else who might stumble into this SSL certificate maze with Splunk’s Better Webhook plugin.
The Use Case: Closed-Loop Network Automation

The Automation Flow:

┌─────────────────┐      ┌──────────┐      ┌─────────┐
│ Catalyst 8000V  │─MDT─>│ Telegraf │─HEC─>│ Splunk  │
│  (Bandwidth >   │      │          │      │Analytics│
│    5 Mbps)      │      └──────────┘      └────┬────┘
└─────────────────┘                              │
                                                 │ Alert Fires
                                                 │
                                         ┌───────▼──────┐
                                         │   Webhook    │
                                         │   Trigger    │ ❌ FAILS HERE
                                         └───────┬──────┘   (SSL Cert)
                                                 │
                                         ┌───────▼──────┐
                                         │GitLab Pipeline│
                                         │              │
                                         └───────┬──────┘
                                                 │
                           ┌─────────────────────┼─────────────────────┐
                           │                     │                     │
                      ┌────▼────┐          ┌─────▼─────┐        ┌─────▼──────┐
                      │ NetBox  │          │  Ansible  │        │   Device   │
                      │ Update  │          │ Playbook  │        │QoS Applied │
                      └─────────┘          └───────────┘        └────────────┘
Step-by-step breakdown:
  1. Telemetry Collection: Catalyst 8000V routers transmit interface bandwidth metrics via Model-Driven Telemetry to Telegraf
  2. Data Ingestion: Telegraf forwards metrics to Splunk via HTTP Event Collector (HEC)
  3. Intelligent Detection: Splunk analytics detect when interface transmission exceeds 5 Mbps threshold
  4. Automated Response: Alert triggers webhook to GitLab CI/CD pipeline with HOSTNAME_TRIGGER variable
  5. Self-Healing Execution:
    • GitLab runs ansible playbook – Updates NetBox interface custom field qos5mbit=true
    • Follows with another ansible playbook – Deploys QoS policy “LIMIT-5MBPS” to the affected device

The Goal: Zero-touch remediation. Network detects congestion, automatically applies QoS policies, minimizes human intervention.

The Reality: My webhook wasn’t firing. SSL certificate verification failed silently. The entire automation chain was broken at step 4.

The Problem: Silent Failures and Missing Pipeline Triggers

The Setup:

  • Splunk Enterprise monitoring telemetry from network infrastructure
  • Better Webhook plugin configured to trigger GitLab pipelines (https://198.18.133.99/api/v4/projects/<id>/pipeline)
  • Alert configured with 1-minute frequency and 4-hour throttle to prevent duplicate executions
  • GitLab instance with self-signed SSL certificates (common in lab/internal environments)

 

The Symptom: Complete radio silence. Bandwidth exceeds threshold. Splunk alert fires. But the GitLab pipeline? Nowhere to be seen. No triggers. No error messages. Nothing.

The automation that should heal the network was dead on arrival.

The Investigation: SSH, Bash History, and Déjà Vu

After checking all the usual suspects (network connectivity, webhook URL, authentication tokens), I had a nagging feeling I’d seen this before.

I SSH’d into our old Splunk instance—the one we decommissioned back in February. Started digging through bash history like an archaeologist searching for ancient wisdom:

cat ~/.bash_history | grep -i webhook
cat ~/.bash_history | grep -i better_webhook

And there it was. A trail of breadcrumbs from February. The same investigation. The same solution.

The realization hit hard: We had already fixed this. Months ago. But when we updated the Splunk instance and reinstalled the Better Webhook plugin, our changes were wiped out. No documentation. No comments in the code. Just… gone.

The Root Cause: Python Requests and SSL Verification

The Better Webhook plugin is a Python script that uses the requests library to send HTTP POST requests to your webhook endpoints. By default, requests validates SSL certificates—which is the right thing to do in production.

But in lab environments with self-signed certificates? It fails silently.

Here’s the relevant code section from better_webhook.py:

def send_webhook_request(
    url: str, body: bytes, headers: dict, auth: Union[tuple, None],
    user_agent: str, proxy: str
):
    """
    Send the webhook and attempt to log as much information as possible if it fails.
    """
    # ... setup code ...

    try:
        r = requests.post(
            url,
            data=body,
            headers=headers,
            auth=auth,
            proxies=proxies
            # ❌ Missing: verify=False for self-signed certs
        )

The problem? No verify=False parameter. The request fails SSL verification, throws an exception, and your pipeline trigger vanishes into the void.

The Solution: One Parameter, Two Minutes, Zero Headaches

Navigate to your Splunk Better Webhook plugin directory:

cd $SPLUNK_HOME/etc/apps/BetterWebhooks/bin/

Find the send_webhook_request function (around line 65-75) and modify the requests.post() call:

def send_webhook_request(
    url: str, body: bytes, headers: dict, auth: Union[tuple, None],
    user_agent: str, proxy: str
):
    # ... existing code ...

    try:
        r = requests.post(
            url,
            data=body,
            headers=headers,
            auth=auth,
            proxies=proxies,
            verify=False  # ✅ Add this line
        )

Important: If you’re working with production systems using valid SSL certificates, DO NOT disable verification. This solution is specifically for lab/internal environments with self-signed certificates.

Restart Splunk

sudo $SPLUNK_HOME/bin/splunk restart

Test Your Webhook

Trigger a test alert and watch your GitLab pipeline spring to life

Lessons Learned: Document Everything

This cost me several hours in February. It cost me several hours again last week.

Why?

  1. Plugin updates/reinstalls overwrite custom modifications
  2. No inline code comments explaining the change
  3. No documentation in our runbooks
  4. No knowledge sharing with the team

What I’m doing differently now:

  1. This blog post – Searchable documentation for future me
  2. Inline comments – Clear explanation of why we disabled SSL verification
  3. Automation – Considering automation to apply this change post-installation
Troubleshooting Tips

If you’re experiencing similar issues, here’s your diagnostic checklist:

  1. Check Splunk logs:

    tail -f $SPLUNK_HOME/var/log/splunk/splunkd.log | grep -i webhook
    
  2. Test the webhook URL manually:

    curl -X POST https://your-gitlab-instance.com/api/v4/projects/1/trigger/pipeline \
      -F token=YOUR_TOKEN \
      -F ref=main
    
  3. Verify Python can reach your endpoint:

    import requests
    r = requests.post('https://your-gitlab-instance.com/...', verify=False)
    print(r.status_code)
    
  4. Check certificate issues:

    openssl s_client -connect your-gitlab-instance.com:443
Conclusion

SSL certificate validation is important. In production, never disable it without good reason and proper security review.

But in lab environments and training scenarios? Sometimes you just need things to work. And when a plugin update wipes out your fix, having it documented somewhere you’ll actually find it is priceless.

This tiny SSL verification issue broke an entire automation chain: telemetry → analytics → orchestration → remediation. Hours of work building the pipeline, debugging Ansible playbooks… all rendered useless by a missing parameter.

So here it is, future me. When you’re troubleshooting this in 2026, I hope you find this post before spending another 4 hours in bash history.


Have you experienced similar “I’ve fixed this before” moments? Share your stories in the comments. Let’s help each other avoid rediscovering the same solutions.