Monitoring

Observability for openDesk is split into two independently-toggled concerns, both deployed by the opendesk-monitoring Go microservice on port 8093:

Central (monitoring.central.enabled, default ON): forward this instance's metrics to the central platform (cmon-prod / VictoriaMetrics) via a vmagent + a handful of standalone exporters.

Local (monitoring.local.enabled, default OFF): deploy a FULL in-cluster kube-prometheus-stack (Prometheus + a customer-facing Grafana at grafana.<domain> + Alertmanager + node-exporter) plus Blackbox Exporter, dashboards, ServiceMonitors, SLO + infra alerts.

monitoring.enabled is now only a hard kill-switch (explicit false => skip ALL); the two sub-toggles drive what deploys. Coverage extends from base K8s utilization up to Tier-0 SLOs with multi-window multi-burn-rate alerts (Google SRE Workbook ch. 5).

State of this topic (2026-06-23 — central/local split)

Active. The runner carries the full step set and gates each step per-instance on the central/local toggles (monitoringStepShouldRun). The artefacts below are produced only by the Local stack (monitoring.local.enabled); the Central path deploys just the shared exporters + a vmagent that remote_writes everything to the central platform, where storage, dashboards, SLO/alert rules live centrally. When Local is on, the runner produces:

Layer	Artefact
Base stack	`kube-prometheus-stack` Helm release in `cfg.Monitoring.Namespace` — Prometheus TSDB tier-sized (storage 20Gi→200Gi + `retentionSize` ≈85%, memory per tier) and every component lifted out of BestEffort with Burstable floors (kill-switch `OPENDESK_INCLUSTER_TIER_SIZING`)
Synthetic	Blackbox Exporter + `Probe` CRs for Keycloak OIDC, Nextcloud WebDAV, portal, Matrix federation
Dashboards	10 ConfigMap-discovered Grafana dashboards (5 base infra + 5 SRE-grade); 7 thin per-app + duplicate dashboards retired and actively pruned
ServiceMonitors	ingress-nginx, cert-manager, ArgoCD ApplicationSet controller, dovecot, haproxy-ingress (ESO ships its own)
Exporters	mysqld_exporter (OX MariaDB; `info_schema.tables` collector off at ≥10k users), redis_exporter (Redis proxy); resource floors on all; postgres_exporter is a follow-up
Alerting	`opendesk-alerts` (availability incl. the `OpenDeskProbesAbsent` watchdog — fires when blackbox probes go absent, the case `OpenDeskEndpointDown` can't see) + `opendesk-infra-alerts` (Galera PVC / Redis proxy) + 6 SLOs with fast (1h/5m, 14.4×) and slow (6h/30m, 6×) burn-rate alerts over HAProxy ingress metrics

The service also exposes 4 module modes re-grouped around the split — module:exporters (shared collectors), module:local-stack (full local Prometheus + Grafana + alerts), module:agent (central vmagent forwarder), module:safety-net (local alert safety net) — for scoped re-deploys, and a destroy plan enumerator (2026-06-12): destroy removes ONLY the kube-prometheus-stack Helm release; the namespace, TSDB PVC, exporters, PrometheusRules, ServiceMonitors and dashboard ConfigMaps are preserved.

What's still a follow-up (not policy-blocked): postgres_exporter wiring with per-app DSNs, and metrics.enabled on the deployer's own ingress-nginx / cert-manager Helm releases.

Out of scope by policy: patching the upstream openDesk app Helm charts (Keycloak, Synapse, Jitsi, Nextcloud) to enable internal /metrics. See concept-servicemonitors#Boundary: openDesk app Helm charts are off-limits.

When to add a page here

A new Grafana dashboard is curated → concept-dashboard-<name> and update concept-sre-dashboards
A new SLO is adopted → update concept-slo-burn-rate-alerts
A new PrometheusRule is shipped → update concept-prometheus-rules
An exporter or ServiceMonitor is added → update concept-servicemonitors
A monitoring-related incident → incident-*
A logs / tracing pipeline (Loki, OTel) → component-* or concept-*

Monitoring

State of this topic (2026-06-23 — central/local split)

Pages

Components

Concepts

When to add a page here

State of this topic (2026-06-23 — central/local split)​

Pages​

Components​

Concepts​

Related topics​

When to add a page here​

State of this topic (2026-06-23 — central/local split)

Pages

Components

Concepts

Related topics

When to add a page here