Skip to main content

Backup & Restore

openDesk's enterprise backup feature: K8up + Restic for every Restic-backed source (PG, MariaDB, LDAP, Cassandra, K8s Secrets, namespace YAML, STACKIT Secrets Manager, PVC snapshots, control-plane forensic snapshot) and rclone + crypt for application S3 buckets. End-to-end encrypted, opt-in, per-instance, multi-tenant-safe.

Audience: operators implementing or auditing the backup story for one or many openDesk clusters.

State of this topic (2026-06-03)

Active and fully wired. Two independent cluster verifications:

  • 2026-05-08 — instance ta8ce612e / cluster od-timi / namespace playground-timi. 37 distinct snapshots produced.
  • 2026-06-03 — instance tbcd30494 / cluster od-timy-3 / namespace playground-timi-3-c-timy. 33 distinct snapshots produced (per-snapshot content audit verified bytes per source). The shape difference (stage embedded in namespace) proved the dynamic-discovery contract works across cluster layouts.

The original 3,500-line spec (enterprise-backup-s3-postgresql.md) has been fully decomposed into the page list below. Anyone working on backup should start from this README or MOC; the spec file is being archived.

Pages

Concepts (design)

  • concept-backup-architecture — architecture overview, principles, scheduling
  • concept-source-methods — per-source backup methods (one section per source)
  • concept-restic-vs-rclone — why two engines, disjoint source sets
  • concept-bucket-and-encryption — bucket structure + full encryption matrix
  • concept-per-tier-retention — per-source 30-day defaults, all tunable
  • concept-two-tier-credentials — backup-credentials.json (file) ↔ K8s Secrets
  • concept-multi-tenant-discovery — dynamic per-instance discovery from runtime-state.json
  • concept-monitoring-alerting — health checks + the snapshot-size-not-just-CR-status pattern
  • concept-resources-portability — compute/network/storage budget + provider migration
  • concept-security-hardening — TLS audit, secretKeyRef migration, verify-ca, checklist
  • concept-limitations-and-alternatives — known limitations + considered + rejected alternatives

Component

  • component-backup-service — port 8094, 9-step pipeline, plan_enumerator

Runbooks

  • runbook-deploy-backup — first-time deploy + snapshot-size content verification
  • runbook-destroy-backup — destroy semantics + bucket rotation
  • runbook-rotate-bucket — manual key/bucket rotation paths
  • runbook-restore-preparation — what to keep offline + restore inputs matrix + manual restore flow
  • runbook-reimplement-from-scratch — fork-master phase order with all gotchas pre-fixed

Fixes / gotchas

  • fix-kubeconfig-anchor — anchor InstanceDir before LoadRuntimeState
  • fix-k8up-crds-not-bundled — kubectl apply CRDs first
  • fix-container-image-pins — image pins + alpine/k8s kubectl path quirk
  • fix-backup-command-wrapper — set +e + exit 0 pattern (trade-off: silent 0-byte snapshots)
  • fix-unique-container-names — silent data loss if not unique per source
  • fix-keycloak-extensions-0bytes — the one legitimate 0-byte snapshot (schema-ownership gap)
  • fix-rwo-multiattach — scratch PVC auto-annotation
  • fix-destroy-three-layer-contract — 5 layers, 3 cryptic errors if one is missing
  • fix-multi-namespace-footprint — K8up Schedules are namespace-scoped
  • fix-busybox-date-in-alpine — date -d "@<epoch>" only in rclone CronJobs
  • fix-pod-executor-sa-quirk — K8up auto-creates pod-executor only in target namespaces
  • fix-skip-if-running-lock — ConfigMap-backed lock-guard for rclone CronJobs

Incidents

  • incident-2026-06-03-silent-content-bugs — LDAP + SM dump silent-content audit + canonical "verify snapshot SIZE" lesson
  • infrastructure — provides the runtime-state.json the backup deployer reads for DB/SM/S3 discovery
  • security — STACKIT Secrets Manager dump is the identity-continuity layer the backup captures
  • deployment — the deployer's cfg.Namespace is what the backup renderer uses as the application namespace
  • monitoring — backupMissedMax alert threshold lives in monitoring/Alertmanager config

When to add a page here

  • A backup-related incident occurs (incident-*)
  • A new backup source is added (concept-* + renderer update)
  • A new fix is applied to the renderer or a runtime patch (fix-*)
  • A new operator runbook is exercised — restore drills, bucket migration, provider migration (runbook-*)