Skip to main content

Monitoring

Observability for openDesk is split into two independently-toggled concerns, both deployed by the opendesk-monitoring Go microservice on port 8093:

  • Central (monitoring.central.enabled, default ON): forward this instance's metrics to the central platform (cmon-prod / VictoriaMetrics) via a vmagent + a handful of standalone exporters.
  • Local (monitoring.local.enabled, default OFF): deploy a FULL in-cluster kube-prometheus-stack (Prometheus + a customer-facing Grafana at grafana.<domain> + Alertmanager + node-exporter) plus Blackbox Exporter, dashboards, ServiceMonitors, SLO + infra alerts.

monitoring.enabled is now only a hard kill-switch (explicit false => skip ALL); the two sub-toggles drive what deploys. Coverage extends from base K8s utilization up to Tier-0 SLOs with multi-window multi-burn-rate alerts (Google SRE Workbook ch. 5).

State of this topic (2026-06-23 — central/local split)

Active. The runner carries the full step set and gates each step per-instance on the central/local toggles (monitoringStepShouldRun). The artefacts below are produced only by the Local stack (monitoring.local.enabled); the Central path deploys just the shared exporters + a vmagent that remote_writes everything to the central platform, where storage, dashboards, SLO/alert rules live centrally. When Local is on, the runner produces:

LayerArtefact
Base stackkube-prometheus-stack Helm release in cfg.Monitoring.Namespace — Prometheus TSDB tier-sized (storage 20Gi→200Gi + retentionSize ≈85%, memory per tier) and every component lifted out of BestEffort with Burstable floors (kill-switch OPENDESK_INCLUSTER_TIER_SIZING)
SyntheticBlackbox Exporter + Probe CRs for Keycloak OIDC, Nextcloud WebDAV, portal, Matrix federation
Dashboards10 ConfigMap-discovered Grafana dashboards (5 base infra + 5 SRE-grade); 7 thin per-app + duplicate dashboards retired and actively pruned
ServiceMonitorsingress-nginx, cert-manager, ArgoCD ApplicationSet controller, dovecot, haproxy-ingress (ESO ships its own)
Exportersmysqld_exporter (OX MariaDB; info_schema.tables collector off at ≥10k users), redis_exporter (Redis proxy); resource floors on all; postgres_exporter is a follow-up
Alertingopendesk-alerts (availability incl. the OpenDeskProbesAbsent watchdog — fires when blackbox probes go absent, the case OpenDeskEndpointDown can't see) + opendesk-infra-alerts (Galera PVC / Redis proxy) + 6 SLOs with fast (1h/5m, 14.4×) and slow (6h/30m, 6×) burn-rate alerts over HAProxy ingress metrics

The service also exposes 4 module modes re-grouped around the split — module:exporters (shared collectors), module:local-stack (full local Prometheus + Grafana + alerts), module:agent (central vmagent forwarder), module:safety-net (local alert safety net) — for scoped re-deploys, and a destroy plan enumerator (2026-06-12): destroy removes ONLY the kube-prometheus-stack Helm release; the namespace, TSDB PVC, exporters, PrometheusRules, ServiceMonitors and dashboard ConfigMaps are preserved.

What's still a follow-up (not policy-blocked): postgres_exporter wiring with per-app DSNs, and metrics.enabled on the deployer's own ingress-nginx / cert-manager Helm releases.

Out of scope by policy: patching the upstream openDesk app Helm charts (Keycloak, Synapse, Jitsi, Nextcloud) to enable internal /metrics. See concept-servicemonitors#Boundary: openDesk app Helm charts are off-limits.

Pages

Components

  • component-monitoring-service — opendesk-monitoring Go service on port 8093: the per-step-gated central/local runner, 4 module modes, destroy plan

Concepts

  • concept-sre-dashboards — the 10-dashboard catalogue (files numbered to 17; 7 retired) with audience, panels, and which DashboardsConfig toggle controls each
  • concept-slo-burn-rate-alerts — the six SLOs over haproxy_backend_http_responses_total, recording-rule shapes, source-presence gating, and the multi-window burn-rate methodology
  • concept-prometheus-rules — the three PrometheusRule CRs (opendesk-alerts, opendesk-slo-rules, opendesk-infra-alerts), what fires when, and how to tune thresholds
  • concept-servicemonitors — which platform components get explicit ServiceMonitor CRs and why the upstream chart's defaults aren't enough
  • deployment — most operational debugging during install still goes through the deployer SSE stream
  • infrastructure — STACKIT-side observability (SKE control-plane, managed Postgres) is provided by STACKIT
  • apps — per-app readiness probes
  • idp — Keycloak and Nubus metrics enablement (currently a follow-up)

When to add a page here

  • A new Grafana dashboard is curated → concept-dashboard-<name> and update concept-sre-dashboards
  • A new SLO is adopted → update concept-slo-burn-rate-alerts
  • A new PrometheusRule is shipped → update concept-prometheus-rules
  • An exporter or ServiceMonitor is added → update concept-servicemonitors
  • A monitoring-related incident → incident-*
  • A logs / tracing pipeline (Loki, OTel) → component-* or concept-*