Skip to main content

openDesk Operator Runbooks

52 runbooks, grouped by subsystem — 14 idP, 29 infrastructure, 9 opendesk apps, 0 monitoring, 0 backup.

This is the operator handbook for openDesk on STACKIT. Each runbook is a self-contained HTML page with the exact kubectl / helm / curl commands an SRE needs. The orchestrator (opendesk-deployer) automates everything documented here; these runbooks exist for the cases when the orchestrator is unavailable, when operators need to verify what's wrong, or when manual recovery is the right tool.

Jump to: idP Infrastructure openDesk apps Monitoring Backup Pipeline diagram Related docs

Each runbook card carries a kind badge: Manual = pipeline-step walkthrough, Issue = failure verification + recovery, Ops = day-2 routine, Brief = customer-facing knowledge.

The 28-step pipeline → manual runbook mapping

The orchestrator runs 28 steps in this order. Each manual deployment runbook (left column) covers one or more steps; they map to the deployer's source files (right column).

flowchart TD A["Steps 01-02
Preflight + Kubeconfig
(manual-01)"] --> B["Steps 03-06
In-cluster Infra
(manual-02)"] B --> C["Steps 07-10
Clone fork + Render
(manual-03)"] C --> D["Step 11
Push to Git
(manual-04)"] D --> E["Steps 12-13
Install ArgoCD + ESO
(manual-05)"] E --> F["Steps 14-15
Prepare + Create Apps
(manual-06)"] F --> G["Step 16
TLS Certificate
(manual-07)"] G --> H["Step 17
Ensure 2FA Browser Flow
(manual-08)"] H --> I["Steps 18-21
OX + Nextcloud Prereqs
(manual-09, 10)"] I --> J["Steps 22-23
Wait + Validate
(manual-11)"] J --> K["Steps 24-27
Post-sync Keycloak / OX / Matrix
(manual-08, 09)"] K --> L["Step 28
Cleanup Failed Pods
(manual-12)"] classDef done fill:#c6f6d5,stroke:#2f855a,color:#1a1a1a classDef sync fill:#feebc8,stroke:#c05621,color:#1a1a1a classDef post fill:#bee3f8,stroke:#2c5282,color:#1a1a1a class A,B,C,D,E,F,G,H done class I,J sync class K,L post

Start with manual-00-overview.html for the end-to-end walkthrough with placeholders, or jump to a specific step's runbook below.

idP — Identity Providers

Keycloak / Nubus realm, IdP federation (SAML / OIDC), Active Directory and ADFS integration, 2FA flow, identity-related failure modes, and customer-call briefs.

IdP Configuration — Canonical Step-by-Step Runbook Brief

Master IdP configuration runbook (~1055 lines). Created outside the manual/issue/ops convention; may overlap with the AD-integration briefs below.

All_idP — Master Identity-Provider Runbook Brief

Master IdP runbook (~845 lines). Companion to All_idP_manual.html.

All_idP Manual — Manual IdP Federation & Directory Importer Brief

Manual federation + Directory Importer setup (~690 lines).

IdP Integration — Q&A Brief

Question/answer reference (~783 lines).

AD Integration Playbook — Discovery + All Paths Brief

Start here for AD customers. Eight discovery questions in Part 1, four self-contained execution paths (ADFS-SAML, customer OIDC bridge, Directory Importer only, one-shot UDM script) in Part 2.

IdP Integration — Customer-Call Briefing Brief

What openDesk supports and doesn't for IdP federation, per-IdP patterns (Entra / ADFS / generic OIDC / on-prem AD), discovery checklist, top 15 FAQs, and the "what we don't do" honesty list.

AD Integration — Full Manual (reference) Brief

Deep version of the Playbook. Customer-conversation framing, decision tree with clickable boxes, failure-mode triage, rollback, AD-specific pitfalls in symptom/spot/fix form.

AD Integration — 12-step Cheatsheet Brief

Linear single-path recipe (ADFS-SAML + Directory Importer). For when you already know the customer has ADFS and just need the 12 commands.

Keycloak Bootstrap Chain Manual 08

Steps 17, 24, 26, 27: 2fa-browser flow, IdP federation REST patch, Matrix bot registration, Vault KV v2 credential write.

Get Keycloak Admin Password Ops

Three retrieval paths: in-cluster Secret, STACKIT Secrets Manager, Administrator account. Mermaid path diagram.

UMS hook Job stuck "Pending deletion" Issue High

Manual recovery with mandatory hook ordering; explicit "deployer does not auto-recover" warning.

Element degraded after Keycloak ingress change Issue Medium

Four-path discovery tree based on /.well-known/openid-configuration response; nubusKeycloakExtensions.proxy.ingress protection.

StoreKeycloakCredentials can't read admin secret Issue Medium

Fail-loud step 27 behaviour (audit C-8); cluster-fallback for pre-fix silent skip; race diagram.

UMS REST API loops / pod evidence Issue Medium

Distinguished from keycloak-bootstrap-deadlock; passive-recovery rationale; anti-patterns preventing kubectl delete regressions.

Infrastructure

STACKIT provisioning (terraform, Redis, MariaDB Galera, Secrets Manager, ACME / TLS), ArgoCD + ESO, render pipeline, customer-isolation, and deployer-pipeline meta-runbooks.

00 Overview Manual

Full-pipeline diagram, runbook index, common placeholders. Start here for the end-to-end walkthrough.

01 Prerequisites and Kubeconfig Manual

Steps 1–2: preflight checks, mint a STACKIT SKE kubeconfig, conditional upgrade-preflight.

02 In-cluster Infrastructure Manual

Steps 3–6: MariaDB Galera, MariaDB credentials cross-namespace, coturn LoadBalancer, Redis ghostunnel proxy.

03 Render with Helmfile Manual

Steps 7–10: clone the openDesk fork, write the four .yaml.gotmpl env files, run helmfile template + runtime-patches/apply.py.

04 Push Rendered Manifests Manual

Step 11: write one directory per helmfile release, generate per-release Application YAMLs (sync-wave + finalizer), commit and push.

05 Install ArgoCD and ESO Manual

Steps 12–13: helm install argo-cd (no CMP sidecar), External Secrets Operator + Vault SecretStore.

06 Prepare and Create Applications Manual

Steps 14–15: clean legacy CMP artifacts, register the repo Secret, create the nginx-fake-ca ConfigMap, parent + per-release Apps.

07 TLS Certificate Manual

Step 16: wait for the opendesk-certificates Cert to be Ready; manual ACME recovery via label-scoped delete.

11 Wait for Sync and Validate Manual

Steps 22–23: poll customer Apps, force-sync ComparisonErrors, post-sync sanity checks.

12 Cleanup Failed Pods Manual

Step 28: delete status.phase=Failed pods left behind by Job retries.

Cross-instance contamination (SEV-1) Issue High

SEV-1 triage; verifying the atomic-admission fix; anti-pattern table for tenant isolation.

Redis ghostunnel proxy unhealthy Issue High

Three-bucket probe-failure classification; STACKIT UI "Health" trap (PING is the source of truth).

ACME / TLS certificate stuck pending Issue High

Scoped delete recipe (label selector, never --all); LE rate-limit caveats; sibling-cert recovery.

Master password problems Issue High

Whitespace rejection (audit A-2) and v1.14 strict-required mode.

Token or domain with embedded whitespace Issue High

All four token fields and the three domain fields. od -c triage.

Inventory guard blocks deploy: orphan sidecars Issue High

Element-OFF orphan-sidecar case (with explicit list of stale _apps/ files) and the vanished-stack-data-ums inventory-preflight case.

customerId rejected by Kubernetes Issue Medium

DNS-1123 reject/accept table; recovery when bad value already shipped.

ArgoCD Application stuck Terminating Issue Medium

Finalizer-strip recipe; namespace unwedge with last-resort replace --raw /finalize.

Check ArgoCD Sync Status Ops

Compact + full + conditions + cached HTTP probe + argocd CLI views with status-pattern table.

Force-Sync ArgoCD Application Ops

Force-sync via operation patch + hard-refresh; SSA apply.force incompatibility warning.

Force ESO to Sync a Secret Ops

force-sync annotation trigger; 30-sec ESO mental model + flow diagram; escalation ladder; pod-restart reminder for envFrom mounts.

Remove ArgoCD Cascade Finalizer Ops

Strip resources-finalizer.argocd.argoproj.io before deletion; cascade vs survival diagram.

List Customer Applications Ops

List all Apps for a customerId; documents the full label vocabulary.

Access Controller PVC Ops

Inspect / copy / clean state on the controller cluster's shared PVC; what lives where; sa-key.json security warning.

Recover Stuck ACME Order Ops

Manual ACME recovery via labeled delete of Challenge / Order / CertificateRequest; --all warning.

Reset Deployer Checkpoint Ops

Delete deployer-checkpoint.json to force fresh run; runner skip-logic diagram.

Safe Destroy (Blast Radius Preview) Ops

Plan-then-execute token flow; UI + API forms; 409-on-prod-token-missing troubleshooting.

Roll Back a Deploy Ops

Three rollback paths (git-revert, prior-tag re-deploy, targeted argocd app rollback); explicit kubectl rollout undo ban.

Choose an openDesk Version (1.13 vs 1.14) Ops

Parallel-lines model, what actually differs (5/20 patches diverge, component versions, feature scope), decision checklist, upgrade cost.

openDesk Apps

The actual openDesk applications — Nextcloud, Open-Xchange (OX), OpenProject, XWiki, Element/Matrix — their per-app bootstrap chains, app-specific failure modes, and license / Enterprise-Edition handling.

09 OX Bootstrap Manual

Steps 18, 19, 25: pre-create PRIMARYDB_9, run initconfigdb, register filestore/server/database, ensure context 1 exists.

10 Nextcloud Init and Restart Manual

Steps 20–21: build init Job from Deployment spec, wait for management Job, rollout-restart for trusted-domain fix.

License / Enterprise Edition issues Issue High

EE registry creds, XWiki / OpenProject license whitespace and YAML-folding. Decision tree across four sub-incidents.

OX initconfigdb / bootstrap deadlock Issue High

PRIMARYDB_9 chicken-and-egg; INFRA_NS vs deployment NS; Path A (DB missing) and Path B (DB exists, bootstrap still failing).

Nextcloud "untrusted domain" / trusted-domain race Issue High

Three failure modes (schema missing, management-Job race, config drift); per-pod config.php triage.

OpenProject /auth/keycloak 404 Issue High

Backoff exhaustion + EE-token YAML-folded; full inline-seeder stopgap with selfHeal disable; token/domain mismatch detection.

Replace a License in STACKIT Secrets Manager Ops

Update the value in Secrets Manager (Cockpit UI or stackit CLI) → force ESO to sync → restart the consuming app. Reference table mapping each license to its SM path.

Recover OpenProject OIDC Seeder Ops

Inline rails db:seed bypassing TokenSeeder; selfHeal disable + verify pinning; stopgap framing.

Check Nextcloud Trusted Domains Ops

Per-pod config.php grep loop (essential because v1.14 pods can disagree); rollout-restart fix.

Monitoring

No monitoring runbooks yet. This is where Prometheus / Grafana / alerting / observability runbooks would live — file new ones with the monitoring- filename prefix.

Backup

No backup runbooks yet. This is where backup_and_restore service runbooks would live — snapshot, restore, retention checks. File new ones with the backup- filename prefix.


  • ../../wiki/ — the project's Obsidian LLMWiki (~155 atomic pages). Browse wiki/MOC.md for the full index. The wiki carries the depth; these runbooks are the streamlined operator surface.
  • ../../CLAUDE.md — single-file always-loaded reference for Claude Code. Concise summary of architecture, the 28 steps, and the patch model.
  • ../../docs/ — long-form runbooks, ADRs, reliability reports.

Conventions: all commands use ${PLACEHOLDER} for values you must substitute. Each runbook lists its placeholders at the top. Code blocks are copy-paste-ready — no shell-prompt prefix.