Skip to main content

Deployment

The end-to-end story of how openDesk gets onto a STACKIT cluster: the two Go microservices that drive it, the conceptual model (rendered-manifests mode, per-release Applications + sync-wave, helmfile, two-process model), the 35-step pipeline, the source patches (fork pipeline) and render-time YAML fixes (declarative runtime-patches/) that make the open-source charts actually deploy, the architectural decisions, and the runbooks/incidents that pile up around all of it. This is the bulk of the wiki because it's where the project's complexity lives.

Answers: "What does step N do?", "Why is patch X applied?", "How do I force-sync an Application?", "Why was CMP mode removed?", "Where do source patches live vs render-time fixes?", "Where does the deployer record its progress?".

Audience: anyone working on the deployer itself, debugging a deployment in flight, or onboarding to the project.

Pages

Components

  • component-deployer-service — opendesk-deployer Go service on port 8092; runs the 35-step pipeline
  • component-web-service — opendesk-web Go service on port 8090; Vue 3 UI, project/instance CRUD, EventSource fan-out

Concepts

  • concept-rendered-manifests-mode — pre-render YAML, push to git, ArgoCD syncs without CMP — the only supported mode today
  • concept-applicationset — historical: ApplicationSet model, since superseded by per-release Applications + sync-waves; kept for context
  • concept-helmfile — declarative Helm release manager; how helmfile template -e prod is invoked
  • concept-rendering-pipeline — the render step's full pipeline: helmfile template --output-dir → patches → JIT → overrides → dedup → lint → inventory → ESO → redact → guards (SCIM, orphan-sidecars, namespace markers, required-patch fatal)
  • concept-component-split — historical 8-bucket split, since superseded by one directory per helmfile release
  • concept-rolling-sync — historical 4-phase RollingSync, since superseded by sync-wave annotations from release-graph.yaml
  • concept-argocd-hooks — PreSync / Sync / PostSync; helm.sh/hook vs argocd.argoproj.io/hook
  • concept-checkpointing — HISTORICAL: checkpointing removed; every step runs every deploy (liveness = inflight marker + K8s Lease)
  • concept-deploy-modules — per-component deploys; a module is a mode (module:<key>, 10 deployer + 6 monitoring modules)
  • concept-deploy-runtime-switches — the 2026-06-12 fix-series kill-switches (mid-run kubeconfig refresh, run budget, loud-failure switches, skew guard)
  • concept-fork-repo — the upstream fork branch model + the manifest-driven source patches in upstream-patches/ (manifest.yamllib.py, run by the Apply-Patches CI pipeline; scripts/create-fork-branch.py is the legacy predecessor)
  • concept-fixes-repo — the shared runtime-patches/rendered-patches/*.yaml declarative patch model + apply.py primitives DSL (per-patch version scoping via appliesTo)
  • concept-two-process-model — server vs --mode-runner subprocess, keyed per-instance sessions, graceful stop, inflight admission

Decisions

  • decision-rendered-manifests-only — why CMP mode was removed; rendered-manifests is the only path
  • decision-applicationset-vs-app — historical context for the per-release Application model that replaced both single-Application and ApplicationSet

Steps (35)

Authoritative order is in runner.go (NewDeployerRunner). Neither file-number prefixes (00_, 01b_, 05_, 07_…) nor the page filenames' step numbers match the current execution positions — page filenames are historical; the [NN] below is the current 0-based position. There is no checkpointing — every step runs every deploy (concept-checkpointing); the per-instance deploy Lease is taken before step [00].

In-Cluster Infrastructure:

  • [00] step-00-preflight-external-deps — verify SKE / Postgres / S3 / Redis / DNS / SM (OPENDESK_SKIP_PREFLIGHT=1 bypasses)
  • [01] step-01-refresh-kubeconfig — fetch kubeconfig from STACKIT SKE (3h TTL; MaybeRefreshKubeconfig re-fetches mid-run)
  • [02] step-02-deploy-mariadb-galera — in-cluster Galera (gated by terraform.provisionGalera; tier-sized memory/buffer-pool/max_connections/storage)
  • [03] step-03-register-mariadb-credentials — read passwords, copy Secrets cross-namespace, wire cfg.ExternalServices.MariaDB
  • [04] step-04-deploy-coturn — TURN/STUN LoadBalancer for Element 1-on-1 calls (single replica by design; one-sided-TURN limitation)
  • [05] step-05-deploy-redis-proxy — ghostunnel TLS termination + upstream reachability probe (tier-sized, bounded --max-concurrent-conns)

Deployment Setup:

  • [06] step-06-clone-opendesk — clone fork, resolve __NAMESPACE__, neutralise post-renderers
  • [07] step-07-ensure-jit-pull-secret — pull Secret for the JIT plugin image (no-op unless externalIdp.userSync)
  • [08] step-07-generate-helmfile-values — write cluster/secrets/policy/sizing.yaml.gotmpl
  • [09] step-08-setup-enterprise — apply enterprise license keys, registry, OpenKruise, Dovecot keys
  • [10] step-09-render-manifests — helmfile template --output-dir + the full rendering pipeline
  • [11] step-10-push-rendered-manifests — one directory per helmfile release + one Application YAML per release + render-meta.yaml
  • [12] step-11-install-argocd — helm upgrade --install argo-cd (no CMP)
  • [13] step-12-install-eso — External Secrets Operator (skipped if SecretsManager unset)
  • [14] step-13-prepare-argocd-secrets — register repo creds, create nginx-fake-ca, clean legacy CMP artifacts
  • [15] step-14-create-argocd-application — create the per-release Applications (Application-of-Applications)
  • [16] step-15-ensure-certificate-issued — wait for opendesk-certificates Cert; auto-recover stuck ACME

Prerequisites (run BEFORE the ArgoCD sync wait):

  • [17] step-17-ensure-ox-mariadb-database — pre-create PRIMARYDB_9 to break OX bootstrap deadlock
  • [18] step-18-ox-bootstrap-fix — initconfigdb, register filestore/server/database
  • [19] step-24-ensure-ox-context — create OX context 1 if missing; restart ox-connector (PRE-sync — the sync wedges without it)
  • [20] step-20-reconcile-ox-admin-password — roll OX master-admin hash forward across master rotations
  • [21] step-21-ensure-scim-token — per-customer SCIM bearer-token Secret (no-op unless scim.enabled)
  • [22] step-19-nextcloud-init — create fs_config_store schema if NC pods crash-loop
  • [23] step-20-restart-nextcloud — rollout restart to fix trusted-domain race
  • [24] step-24-openproject-migrate — run pending OpenProject migrations via one-off Job (starved-Sync-hook fix)
  • [25] step-25-invalidate-keycloak-bootstrap — delete keycloak-bootstrap Jobs when realm-config inputs changed

ArgoCD sync:

  • [26] step-21-wait-for-argocd-sync — poll every 30s, max 90 min; fatal on timeout

Post-fixes (need a synced cluster; run BEFORE health validation since 2026-06-12):

  • [27] step-16-ensure-2fa-browser-flow — Keycloak safety net: copy built-in browser flow into 2fa-browser if empty/missing
  • [28] step-23-configure-idp-federation — verify sso-federation-idp IdP entry; (when set) kcadm-patch UMS LDAP usernameAttribute
  • [29] step-29-surface-wire-saml-metadata — surface the Wire SAML IdP artifacts (gated by wireSSO.enabled)
  • [30] step-25-register-matrix-accounts — register UVS + neodatefix-bot Synapse accounts
  • [31] step-26-store-keycloak-credentials — write Keycloak + Administrator creds to STACKIT SM

Health checks:

  • [32] step-22-validate-deployment-health — OX DB, Nextcloud domain, OpenProject bootstrap, Notes/Impress schema; moved after the post-fixes (2026-06-12)

Finalisation:

  • [33] step-27-ensure-keycloak-admin-ingress — manages the keycloak-admin Ingress (gated by keycloak.exposeAdminConsole) AND the opendesk-force-login Ingress (created when externalIdp.enabled)
  • [34] step-28-cleanup-failed-pods — delete status.phase=Failed pods (cosmetic, non-fatal)

Source patches — applied by the Apply-Patches CI pipeline (scripts/apply-patches.pyupstream-patches/manifest.yamlupstream-patches/lib.py; baked into the fork's stable/v1.X.Y branch, manifest-gated per version via applies_to). scripts/create-fork-branch.py is the legacy/superseded predecessor, called by no pipeline. The migration renamed/split the older patches:

  • patch-nubus-values — SPLIT into patch_nubus_data_loader (dataLoader.enabled: true) + patch_nubus_security_context (drop securityContext.enabled + duplicate seccompProfile, < v1.14.0). AWS_DEFAULT_REGION dedupe is NO longer a source patch — handled post-render by render patch 021-dedup-env-vars.
  • patch-secrets-file — now patch_sha1sum: remove | sha1sum from LDAP password derivation (cracklib rejects the hex output)
  • patch-helmfile-children — now patch_wait_for_jobs: waitForJobs: true failing when batch/job CRD not yet installed
  • patch-chart-verification — now patch_chart_verify: verify: true needs .prov; OCI registries don't serve them
  • patch-namespace-refs — patch_namespace_refs (same name): .Release.Namespace resolving to argocd instead of deployment ns (deployer-side resolveNamespacePlaceholder is the runtime backstop)
  • patch-haproxy-rewrite-target — NOT in upstream-patches/; HAProxy ingress rewrite-target is now handled by a Go stage in modules/opendesk/internal/steps/07b_render_manifests.go
  • patch-nextcloud-php-ca — patch_nextcloud_php_ca: Nextcloud PHP custom CA bundle wiring (restored to upstream-patches/ 2026-06-23 after being dropped in the migration)

Removed earlier as redundant (pages kept as historical records): patch-ox-core-mw-values (runtime patches 004/009), patch-post-renderers (now source patch patch_post_renderers; the deployer's neutralizePostRenderers is the clone-step backstop), patch-intercom-redis-username (runtime patches 018/019).

Render-time fixes — applied as declarative YAML in the shared runtime-patches/rendered-patches/*.yaml (per-patch version scoping via appliesTo)

  • fix-subpath-casing — subpath:subPath: (SSA rejects lowercase)
  • fix-spurious-port — remove spurious port: in containerPort blocks
  • fix-probe-enabled — remove enabled: from probes (ox-connector chart bug)
  • fix-empty-protocol — protocol: (empty) → protocol: TCP (SSA merge-key conflict)
  • fix-nginx-mountpath — fix matrix widget chart mountPath / ConfigMap key placement
  • fix-ox-initconfigdb-root-pwd — fix initconfigdb root password and --skip-ssl-verify-server-cert
  • fix-mysql-root-pwd-injection — inject MYSQL_ROOT_PASSWD into OX bootstrap Job
  • fix-bootstrap-ssl-certs — inject self-signed CA into Keycloak/OpenProject/Nextcloud bootstrap Jobs
  • fix-hook-deletion-policy — HookSucceededBeforeHookCreation,HookSucceeded on Jobs (SSA immutability)
  • fix-helm-hook-to-argo-hook — inject argocd.argoproj.io/hook: PostSync for helm.sh/hook-only Jobs
  • fix-jgroups-timeout — Keycloak JGroups sock_conn_timeout=1000 for fast stale-member detection
  • fix-resource-dedup — deduplicate resource documents (some helmfile child-charts render twice)
  • fix-secrets-to-eso — convert plaintext Secrets to ExternalSecret CRs (when SM configured)
  • fix-openproject-bootstrap-backoff — raise OpenProject bootstrap backoffLimit from 6 to 20
  • fix-inventory-guard-orphan-sidecars — guard against orphan Synapse sidecars without homeserver
  • fix-ums-probes — inject probes into ums-udm-rest-api Deployment
  • fix-ox-appsuite-ingress — (legacy <2.28) OX appsuite-api ingress regex
  • fix-intercom-redis-username — username Secret stringData + REDIS_USER env
  • fix-synapse-federation-whitelist — Synapse federation_domain_whitelist
  • fix-dedup-env-vars — workload env-var dedup
  • fix-jobs-remove-hook-ttl — strip ttlSecondsAfterFinished on hook Jobs

Runbooks

  • runbook-local-dev-stack — build/run/stop/status the 5-service dev stack locally with start.sh (start/stop/status commands + --container-based / --background / --git-token / --reset-admin)
  • runbook-force-sync-application — force-sync an ArgoCD Application
  • runbook-hard-refresh-application — hard-refresh (clear cache) an Application
  • runbook-reset-checkpoint — HISTORICAL (checkpointing removed); now points to the Lease/--force-lease escape hatch
  • runbook-list-customer-apps — list every Application for a customerId
  • runbook-debug-deployer-stream — tail the deployer's SSE event stream
  • runbook-check-argocd-sync-status — query sync/health/operationPhase for an Application
  • runbook-remove-finalizer — remove resources-finalizer.argocd.argoproj.io before delete
  • runbook-get-argocd-password — extract the ArgoCD initial admin password

Incidents

  • incident-force-sync-vs-ssa — force-sync interacting badly with Server-Side Apply
  • incident-comparisonerror-duplicate-env-vars — ComparisonError from duplicate env vars in init containers
  • incident-helm-template-no-release-name — helm template fails without --release-name; root cause
  • incident-applicationset-finalizer-wedge — ApplicationSet wedge from leftover resources-finalizer (legacy era)
  • incident-cmp-mode-removed — context for the CMP-mode removal; what cleanup remains
  • incident-sev1-cross-instance-contamination — sev1 where one instance's state leaked into another
  • incident-synapse-sidecars-without-homeserver — orphan Synapse sidecars rendered without their homeserver
  • incident-logs-menu-bug — Cockpit UI Logs page hidden when changing menu
  • infrastructure — runtime-state.json is the input gate for the deployer; STACKIT-side recovery lives there
  • apps — the apps these steps actually deploy; per-app incidents and runbooks
  • idp — Keycloak/Nubus realm bootstrap and IdP-related incidents
  • security — concept-master-password and ESO setup are referenced by the rendering pipeline
  • config — input fields the deployer consumes
  • sizing — platformSizing is read by step-07-generate-helmfile-values

When to add a page here

  • A new deployer step, helmfile feature, or ArgoCD mechanism — concept-*, step-*
  • A new upstream-chart workaround — patch-* (fork pipeline) or fix-* (render-time YAML)
  • A new operational procedure for the deployer / ArgoCD / git-side artifacts — runbook-*
  • A deployer or ArgoCD incident with a distinct root cause — incident-*
  • A deployer-architecture decision — decision-*

App-specific runtime issues that don't change the deployer's behavior belong in apps. STACKIT/infra-side issues belong in infrastructure. IdP-specific deployment issues belong in idp (cross-link from here when the deployer code is involved).