Monitoring
Observability for openDesk is split into two independently-toggled concerns, both deployed by the
opendesk-monitoringGo microservice on port 8093:
- Central (
monitoring.central.enabled, default ON): forward this instance's metrics to the central platform (cmon-prod / VictoriaMetrics) via a vmagent + a handful of standalone exporters.- Local (
monitoring.local.enabled, default OFF): deploy a FULL in-clusterkube-prometheus-stack(Prometheus + a customer-facing Grafana atgrafana.<domain>+ Alertmanager + node-exporter) plus Blackbox Exporter, dashboards, ServiceMonitors, SLO + infra alerts.
monitoring.enabledis now only a hard kill-switch (explicitfalse=> skip ALL); the two sub-toggles drive what deploys. Coverage extends from base K8s utilization up to Tier-0 SLOs with multi-window multi-burn-rate alerts (Google SRE Workbook ch. 5).
State of this topic (2026-06-23 — central/local split)
Active. The runner carries the full step set and gates each step per-instance on the central/local toggles (monitoringStepShouldRun). The artefacts below are produced only by the Local stack (monitoring.local.enabled); the Central path deploys just the shared exporters + a vmagent that remote_writes everything to the central platform, where storage, dashboards, SLO/alert rules live centrally. When Local is on, the runner produces:
| Layer | Artefact |
|---|---|
| Base stack | kube-prometheus-stack Helm release in cfg.Monitoring.Namespace — Prometheus TSDB tier-sized (storage 20Gi→200Gi + retentionSize ≈85%, memory per tier) and every component lifted out of BestEffort with Burstable floors (kill-switch OPENDESK_INCLUSTER_TIER_SIZING) |
| Synthetic | Blackbox Exporter + Probe CRs for Keycloak OIDC, Nextcloud WebDAV, portal, Matrix federation |
| Dashboards | 10 ConfigMap-discovered Grafana dashboards (5 base infra + 5 SRE-grade); 7 thin per-app + duplicate dashboards retired and actively pruned |
| ServiceMonitors | ingress-nginx, cert-manager, ArgoCD ApplicationSet controller, dovecot, haproxy-ingress (ESO ships its own) |
| Exporters | mysqld_exporter (OX MariaDB; info_schema.tables collector off at ≥10k users), redis_exporter (Redis proxy); resource floors on all; postgres_exporter is a follow-up |
| Alerting | opendesk-alerts (availability incl. the OpenDeskProbesAbsent watchdog — fires when blackbox probes go absent, the case OpenDeskEndpointDown can't see) + opendesk-infra-alerts (Galera PVC / Redis proxy) + 6 SLOs with fast (1h/5m, 14.4×) and slow (6h/30m, 6×) burn-rate alerts over HAProxy ingress metrics |
The service also exposes 4 module modes re-grouped around the split — module:exporters (shared collectors), module:local-stack (full local Prometheus + Grafana + alerts), module:agent (central vmagent forwarder), module:safety-net (local alert safety net) — for scoped re-deploys, and a destroy plan enumerator (2026-06-12): destroy removes ONLY the kube-prometheus-stack Helm release; the namespace, TSDB PVC, exporters, PrometheusRules, ServiceMonitors and dashboard ConfigMaps are preserved.
What's still a follow-up (not policy-blocked): postgres_exporter wiring with per-app DSNs, and metrics.enabled on the deployer's own ingress-nginx / cert-manager Helm releases.
Out of scope by policy: patching the upstream openDesk app Helm charts (Keycloak, Synapse, Jitsi, Nextcloud) to enable internal /metrics. See concept-servicemonitors#Boundary: openDesk app Helm charts are off-limits.
Pages
Components
- component-monitoring-service —
opendesk-monitoringGo service on port 8093: the per-step-gated central/local runner, 4 module modes, destroy plan
Concepts
- concept-sre-dashboards — the 10-dashboard catalogue (files numbered to 17; 7 retired) with audience, panels, and which
DashboardsConfigtoggle controls each - concept-slo-burn-rate-alerts — the six SLOs over
haproxy_backend_http_responses_total, recording-rule shapes, source-presence gating, and the multi-window burn-rate methodology - concept-prometheus-rules — the three
PrometheusRuleCRs (opendesk-alerts,opendesk-slo-rules,opendesk-infra-alerts), what fires when, and how to tune thresholds - concept-servicemonitors — which platform components get explicit ServiceMonitor CRs and why the upstream chart's defaults aren't enough
Related topics
- deployment — most operational debugging during install still goes through the deployer SSE stream
- infrastructure — STACKIT-side observability (SKE control-plane, managed Postgres) is provided by STACKIT
- apps — per-app readiness probes
- idp — Keycloak and Nubus metrics enablement (currently a follow-up)
When to add a page here
- A new Grafana dashboard is curated →
concept-dashboard-<name>and update concept-sre-dashboards - A new SLO is adopted → update concept-slo-burn-rate-alerts
- A new
PrometheusRuleis shipped → update concept-prometheus-rules - An exporter or ServiceMonitor is added → update concept-servicemonitors
- A monitoring-related incident →
incident-* - A logs / tracing pipeline (Loki, OTel) →
component-*orconcept-*