Monitoring Internals

Observability Stack

Three parallel signal pipelines across 2 Kubernetes clusters · 9 nodes · 222 pods: Metrics (Prometheus → Alertmanager → Grafana), Logs (Fluent Bit → Loki / Splunk HEC), and Detection (Suricata IDS/IPS · Wazuh SIEM · Cilium eBPF · KubeArmor). Every signal ends in an actionable alert or AWX job.

Signal Pipelines

Metrics · Logs · Detection — Internals

📈 METRICS
PROMETHEUS → ALERTMANAGER → GRAFANA
Node Exporter
HOST METRICS · DAEMONSET · ALL NODES
CPU · MEM · DISK · Net I/O per interface
Filesystem inode + mount usage
Scrape: :9100 per node (soc1–6 · sec1–3) · 9 nodes total
scrape :9100
kube-state-metrics
K8S OBJECT STATE EXPORTER
Pod restarts · Deploy replicas · PVC state
HPA current/desired · StatefulSet health
Scrape: :8080/metrics
pull every 30s
Prometheus
TSDB · SCRAPE ENGINE · RULES
ServiceMonitor auto-discover CRD
Recording Rules pre-aggregate heavy queries
Remote Write → Thanos long-term
Retention: 15d local
firing rules
Alertmanager
ROUTE · DEDUP · GROUP · SILENCE
Group by cluster/namespace/severity
Inhibit node down → all pod alerts
Silence maintenance windows
Routes → Discord + AWX job critical
datasource
Grafana
DASHBOARDS · SOC VIEWS · OIDC SSO
SOC · Infra · Network flow dashboards
Loki log explore integrated
grafana.onelabs.work
🪵 LOGS
FLUENT BIT → LOKI / SPLUNK HEC
Log Sources
RAW INPUT STREAMS
containerd pod stdout/stderr
Suricata eve.json alerts
Wazuh Agent syslog + auditd
K8s API server audit logs
ingress-nginx access/error
Talos kernel + machined
tail + parse
Fluent Bit
COLLECTOR · PARSER · ROUTER · DAEMONSET
INPUT tail · systemd · tcp · syslog
FILTER K8s metadata inject
FILTER grep — drop health-check noise
PARSER JSON · regex · CRI multiline
OUTPUT 1 → Loki (label stream)
OUTPUT 2 → Splunk HEC :8088
Buffer: memory + filesystem
Instances 6 SOC · 3 SEC = 9 DaemonSet pods
push to backends
Loki
LOG AGGREGATION · LABELINDEX
Labels namespace · pod · container · node
LogQL filter + metric queries
Chunks → MinIO S3 backend
Ruler alert on log patterns
Compactor retention policy
Retention: 30d · backend: minio.onelabs.work
HEC index k8s_onelabs_sec
Splunk
SIEM · CORRELATION · SEARCH
Index k8s_onelabs_sec
SPL alerts brute force · priv-esc
Correlation multi-source event join
Webhook → TheHive case create
HEC: sim.saza.com.au:8088
🔍 DETECTION
SURICATA · WAZUH · CILIUM · KUBEARMOR
Suricata IDS/IPS
DEEP PACKET INSPECT · DAEMONSET
AF_PACKET zero-copy packet capture
Ruleset ET Open + custom SAZA rules
eve.json alert · flow · dns · tls · http
IPS mode NFQueue inline drop
Protocols HTTP · DNS · TLS · SMB · SSH
hostNetwork: true · all 6 nodes
eve.json → Fluent Bit
Cilium 1.19.2 / Hubble
L3–L7 NETWORK OBSERVE · EBPF
eBPF kernel-level flow capture
L7 decode HTTP · gRPC · Kafka headers
NetworkPolicy default-deny + allow
mTLS transparent pod encryption
DNS policy enforcement by FQDN
hub.onelabs.work
compliance + FIM
Wazuh
SIEM + COMPLIANCE + FIM + AGENTS
Agents 9/9 DaemonSet (6 SOC · 3 SEC) + every VM
FIM /etc · /bin · /usr/bin integrity
Auditd syscall + privilege change
CIS Benchmark Level 1/2 auto-score
MITRE alert TTP tagging
wazuh.onelabs.work
policy enforce
KubeArmor + Kyverno
RUNTIME POLICY + ADMISSION
LSM AppArmor/BPF hooks per workload
Syscall per-workload allowlist
File block /proc · /sys writes
Kyverno validate · mutate · generate
PSA Pod Security Admission
Replaced Falco (incompatible Talos 6.18.x)
Alert Routing

Alertmanager Routes — 4 Levels

Alerts are deduplicated, grouped by cluster/ns/severity, and routed. Critical alerts trigger automatic AWX playbooks for self-healing in addition to Discord paging.

🔴 CRITICAL
Discord #soc-alerts + AWX Job
Group wait: 30s · Group interval: 5m
Repeat: 1h · AWX trigger on fire
Examples: Node down · CrashLoop · Cert expire
🟡 WARNING
Discord #soc-warnings
Group wait: 2m · Group interval: 10m
Repeat: 4h · Stage-color: yellow embed
Examples: High CPU · PVC 80% · Cert <30d
🔵 INFO
Spool JSON audit log
Audit-standard JSON · AWX Job ID
SAZA branding · delegate_to: localhost
ansible.builtin.copy to spool file
⚫ INHIBITED
Suppressed by inhibit rules
Node down → inhibit all pod alerts
Cluster unreachable → inhibit node
Maintenance window → silence all
Grafana

Dashboard Registry

🛡️ SOC Overview
Alert heatmap by severity
Incident timeline · Active cases
MITRE ATT&CK coverage map
Wazuh compliance score
📈 Cluster Health
Node CPU/MEM/DISK per VM
Pod restarts · OOMKill rate
API server latency p99
etcd leader election health
🌐 Network Flows
Hubble L7 top talkers (hub.onelabs.work)
DNS query rate per namespace
Dropped flows heatmap
mTLS handshake rate
💾 Storage
Longhorn volume health (stog.onelabs.work)
PVC usage per namespace
MinIO bucket size trend
Snapshot success rate
🚀 DevSecOps
GitLab CI pipeline duration
Argo CD sync health · drift
AWX job success/fail rate
Trivy CVE count trend
🔑 Identity & Secrets
Vault token lease count
Failed Authentik auth attempts
Certificate expiry countdown
Vault HA raft leader status