openDesk Operator Runbooks
52 runbooks, grouped by subsystem — 14 idP, 29 infrastructure, 9 opendesk apps, 0 monitoring, 0 backup.
This is the operator handbook for openDesk on STACKIT. Each runbook is a self-contained HTML page with the exact kubectl / helm / curl commands an SRE needs. The orchestrator (opendesk-deployer) automates everything documented here; these runbooks exist for the cases when the orchestrator is unavailable, when operators need to verify what's wrong, or when manual recovery is the right tool.
Jump to: idP Infrastructure openDesk apps Monitoring Backup Pipeline diagram Related docs
Each runbook card carries a kind badge: Manual = pipeline-step walkthrough, Issue = failure verification + recovery, Ops = day-2 routine, Brief = customer-facing knowledge.
The 28-step pipeline → manual runbook mapping
The orchestrator runs 28 steps in this order. Each manual deployment runbook (left column) covers one or more steps; they map to the deployer's source files (right column).
flowchart TD A["Steps 01-02
Preflight + Kubeconfig
(manual-01)"] --> B["Steps 03-06
In-cluster Infra
(manual-02)"] B --> C["Steps 07-10
Clone fork + Render
(manual-03)"] C --> D["Step 11
Push to Git
(manual-04)"] D --> E["Steps 12-13
Install ArgoCD + ESO
(manual-05)"] E --> F["Steps 14-15
Prepare + Create Apps
(manual-06)"] F --> G["Step 16
TLS Certificate
(manual-07)"] G --> H["Step 17
Ensure 2FA Browser Flow
(manual-08)"] H --> I["Steps 18-21
OX + Nextcloud Prereqs
(manual-09, 10)"] I --> J["Steps 22-23
Wait + Validate
(manual-11)"] J --> K["Steps 24-27
Post-sync Keycloak / OX / Matrix
(manual-08, 09)"] K --> L["Step 28
Cleanup Failed Pods
(manual-12)"] classDef done fill:#c6f6d5,stroke:#2f855a,color:#1a1a1a classDef sync fill:#feebc8,stroke:#c05621,color:#1a1a1a classDef post fill:#bee3f8,stroke:#2c5282,color:#1a1a1a class A,B,C,D,E,F,G,H done class I,J sync class K,L post
Start with manual-00-overview.html for the end-to-end walkthrough with placeholders, or jump to a specific step's runbook below.
idP — Identity Providers
Keycloak / Nubus realm, IdP federation (SAML / OIDC), Active Directory and ADFS integration, 2FA flow, identity-related failure modes, and customer-call briefs.
IdP Configuration — Canonical Step-by-Step Runbook Brief
Master IdP configuration runbook (~1055 lines). Created outside the manual/issue/ops convention; may overlap with the AD-integration briefs below.
All_idP — Master Identity-Provider Runbook Brief
Master IdP runbook (~845 lines). Companion to All_idP_manual.html.
All_idP Manual — Manual IdP Federation & Directory Importer Brief
Manual federation + Directory Importer setup (~690 lines).
Question/answer reference (~783 lines).
AD Integration Playbook — Discovery + All Paths Brief
Start here for AD customers. Eight discovery questions in Part 1, four self-contained execution paths (ADFS-SAML, customer OIDC bridge, Directory Importer only, one-shot UDM script) in Part 2.
IdP Integration — Customer-Call Briefing Brief
What openDesk supports and doesn't for IdP federation, per-IdP patterns (Entra / ADFS / generic OIDC / on-prem AD), discovery checklist, top 15 FAQs, and the "what we don't do" honesty list.
AD Integration — Full Manual (reference) Brief
Deep version of the Playbook. Customer-conversation framing, decision tree with clickable boxes, failure-mode triage, rollback, AD-specific pitfalls in symptom/spot/fix form.
AD Integration — 12-step Cheatsheet Brief
Linear single-path recipe (ADFS-SAML + Directory Importer). For when you already know the customer has ADFS and just need the 12 commands.
Keycloak Bootstrap Chain Manual 08
Steps 17, 24, 26, 27: 2fa-browser flow, IdP federation REST patch, Matrix bot registration, Vault KV v2 credential write.
Get Keycloak Admin Password Ops
Three retrieval paths: in-cluster Secret, STACKIT Secrets Manager, Administrator account. Mermaid path diagram.
UMS hook Job stuck "Pending deletion" Issue High
Manual recovery with mandatory hook ordering; explicit "deployer does not auto-recover" warning.
Element degraded after Keycloak ingress change Issue Medium
Four-path discovery tree based on /.well-known/openid-configuration response; nubusKeycloakExtensions.proxy.ingress protection.
StoreKeycloakCredentials can't read admin secret Issue Medium
Fail-loud step 27 behaviour (audit C-8); cluster-fallback for pre-fix silent skip; race diagram.
UMS REST API loops / pod evidence Issue Medium
Distinguished from keycloak-bootstrap-deadlock; passive-recovery rationale; anti-patterns preventing kubectl delete regressions.
Infrastructure
STACKIT provisioning (terraform, Redis, MariaDB Galera, Secrets Manager, ACME / TLS), ArgoCD + ESO, render pipeline, customer-isolation, and deployer-pipeline meta-runbooks.
Full-pipeline diagram, runbook index, common placeholders. Start here for the end-to-end walkthrough.
01 Prerequisites and Kubeconfig Manual
Steps 1–2: preflight checks, mint a STACKIT SKE kubeconfig, conditional upgrade-preflight.
02 In-cluster Infrastructure Manual
Steps 3–6: MariaDB Galera, MariaDB credentials cross-namespace, coturn LoadBalancer, Redis ghostunnel proxy.
03 Render with Helmfile Manual
Steps 7–10: clone the openDesk fork, write the four .yaml.gotmpl env files, run helmfile template + runtime-patches/apply.py.
04 Push Rendered Manifests Manual
Step 11: write one directory per helmfile release, generate per-release Application YAMLs (sync-wave + finalizer), commit and push.
05 Install ArgoCD and ESO Manual
Steps 12–13: helm install argo-cd (no CMP sidecar), External Secrets Operator + Vault SecretStore.
06 Prepare and Create Applications Manual
Steps 14–15: clean legacy CMP artifacts, register the repo Secret, create the nginx-fake-ca ConfigMap, parent + per-release Apps.
Step 16: wait for the opendesk-certificates Cert to be Ready; manual ACME recovery via label-scoped delete.
11 Wait for Sync and Validate Manual
Steps 22–23: poll customer Apps, force-sync ComparisonErrors, post-sync sanity checks.
Step 28: delete status.phase=Failed pods left behind by Job retries.
Cross-instance contamination (SEV-1) Issue High
SEV-1 triage; verifying the atomic-admission fix; anti-pattern table for tenant isolation.
Redis ghostunnel proxy unhealthy Issue High
Three-bucket probe-failure classification; STACKIT UI "Health" trap (PING is the source of truth).
ACME / TLS certificate stuck pending Issue High
Scoped delete recipe (label selector, never --all); LE rate-limit caveats; sibling-cert recovery.
Master password problems Issue High
Whitespace rejection (audit A-2) and v1.14 strict-required mode.
Token or domain with embedded whitespace Issue High
All four token fields and the three domain fields. od -c triage.
Inventory guard blocks deploy: orphan sidecars Issue High
Element-OFF orphan-sidecar case (with explicit list of stale _apps/ files) and the vanished-stack-data-ums inventory-preflight case.
customerId rejected by Kubernetes Issue Medium
DNS-1123 reject/accept table; recovery when bad value already shipped.
ArgoCD Application stuck Terminating Issue Medium
Finalizer-strip recipe; namespace unwedge with last-resort replace --raw /finalize.
Compact + full + conditions + cached HTTP probe + argocd CLI views with status-pattern table.
Force-Sync ArgoCD Application Ops
Force-sync via operation patch + hard-refresh; SSA apply.force incompatibility warning.
Force ESO to Sync a Secret Ops
force-sync annotation trigger; 30-sec ESO mental model + flow diagram; escalation ladder; pod-restart reminder for envFrom mounts.
Remove ArgoCD Cascade Finalizer Ops
Strip resources-finalizer.argocd.argoproj.io before deletion; cascade vs survival diagram.
List Customer Applications Ops
List all Apps for a customerId; documents the full label vocabulary.
Inspect / copy / clean state on the controller cluster's shared PVC; what lives where; sa-key.json security warning.
Manual ACME recovery via labeled delete of Challenge / Order / CertificateRequest; --all warning.
Delete deployer-checkpoint.json to force fresh run; runner skip-logic diagram.
Safe Destroy (Blast Radius Preview) Ops
Plan-then-execute token flow; UI + API forms; 409-on-prod-token-missing troubleshooting.
Three rollback paths (git-revert, prior-tag re-deploy, targeted argocd app rollback); explicit kubectl rollout undo ban.
Choose an openDesk Version (1.13 vs 1.14) Ops
Parallel-lines model, what actually differs (5/20 patches diverge, component versions, feature scope), decision checklist, upgrade cost.
openDesk Apps
The actual openDesk applications — Nextcloud, Open-Xchange (OX), OpenProject, XWiki, Element/Matrix — their per-app bootstrap chains, app-specific failure modes, and license / Enterprise-Edition handling.
Steps 18, 19, 25: pre-create PRIMARYDB_9, run initconfigdb, register filestore/server/database, ensure context 1 exists.
10 Nextcloud Init and Restart Manual
Steps 20–21: build init Job from Deployment spec, wait for management Job, rollout-restart for trusted-domain fix.
License / Enterprise Edition issues Issue High
EE registry creds, XWiki / OpenProject license whitespace and YAML-folding. Decision tree across four sub-incidents.
OX initconfigdb / bootstrap deadlock Issue High
PRIMARYDB_9 chicken-and-egg; INFRA_NS vs deployment NS; Path A (DB missing) and Path B (DB exists, bootstrap still failing).
Nextcloud "untrusted domain" / trusted-domain race Issue High
Three failure modes (schema missing, management-Job race, config drift); per-pod config.php triage.
OpenProject /auth/keycloak 404 Issue High
Backoff exhaustion + EE-token YAML-folded; full inline-seeder stopgap with selfHeal disable; token/domain mismatch detection.
Replace a License in STACKIT Secrets Manager Ops
Update the value in Secrets Manager (Cockpit UI or stackit CLI) → force ESO to sync → restart the consuming app. Reference table mapping each license to its SM path.
Recover OpenProject OIDC Seeder Ops
Inline rails db:seed bypassing TokenSeeder; selfHeal disable + verify pinning; stopgap framing.
Check Nextcloud Trusted Domains Ops
Per-pod config.php grep loop (essential because v1.14 pods can disagree); rollout-restart fix.
Monitoring
No monitoring runbooks yet. This is where Prometheus / Grafana / alerting / observability runbooks would live — file new ones with the monitoring- filename prefix.
Backup
No backup runbooks yet. This is where backup_and_restore service runbooks would live — snapshot, restore, retention checks. File new ones with the backup- filename prefix.
Related docs
../../wiki/— the project's Obsidian LLMWiki (~155 atomic pages). Browsewiki/MOC.mdfor the full index. The wiki carries the depth; these runbooks are the streamlined operator surface.../../CLAUDE.md— single-file always-loaded reference for Claude Code. Concise summary of architecture, the 28 steps, and the patch model.../../docs/— long-form runbooks, ADRs, reliability reports.
Conventions: all commands use ${PLACEHOLDER} for values you must substitute. Each runbook lists its placeholders at the top. Code blocks are copy-paste-ready — no shell-prompt prefix.