Skip to main content

Infrastructure

Everything STACKIT-side: what the infra service provisions before the deployer runs, where its outputs land, and how to recover when external dependencies (cert-manager / ACME, Redis, controller PVCs) misbehave. Answers: "What's in runtime-state.json and who writes it?", "How do I recover from a stuck ACME order?", "How do I safely destroy a cluster without orphaning STACKIT resources?", "Why does the STACKIT Redis UI say healthy when PING fails?", "How do I roll back a partial deploy without losing the underlying infra?".

Audience: operators provisioning, recovering, or destroying STACKIT-side resources for an openDesk deployment.

What lives here

  • The infra service itself — the Go binary that drives Terraform and talks to STACKIT.
  • The runtime-state.json contract — the file that bridges infra and deployer.
  • Incidents with STACKIT services or cert-manager.
  • Runbooks for recovery, controller inspection, safe destroy, and rollback.

What does NOT live here:

  • The deployer-side steps that consume runtime-state.json — see deployment.
  • The Secrets Manager usage model (master password, ESO, derivation) — see security.
  • Per-app data-store recovery — see apps.
  • The instance.yaml schema — see config.

Pages

Components

  • component-infra-service — opendesk-infra Go service on port 8091; runs Terraform, talks to STACKIT, writes runtime-state.json; 5 module modes; tier-sized HAProxy ingress (+ tune.bufsize 65536 / tune.http.maxhdr 256) and cert-manager QoS floors
  • component-mariadb-setup — standalone mariadb-setup sub-module: 7-step Galera deploy with backup/restore, invoked by the deployer

Concepts

  • concept-runtime-state — runtime-state.json schema; the infra-output gate that the deployer reads at startup; overwrite guards + atomicity
  • concept-infra-runner — the 14-step infra sequence with one-liners and retry policy

Reference

  • reference-storage-classes — which storage class each PVC uses; STACKIT SKE eu01 class ladder (premium-perfN-stackit, default premium-perf1-stackit); storage class is opt-in per-tier sizing (capacity is always tier-driven)

Incidents

  • incident-acme-stuck — cert-manager ACME order stuck in pending; root cause and detection
  • incident-cert-delete-unbounded — Certificate delete operation runs unbounded; safety implications
  • incident-stackit-redis-health-misleading — STACKIT Redis UI "Health" reflects last_operation, NOT instance availability — PING is source of truth
  • incident-haproxy-publish-service-crashloop — HAProxy controller crash-loops on publish-service startup; fix = drop publishService + pin chart 0.16.1

Runbooks

  • runbook-recover-stuck-acme — delete Challenges / Order / CertificateRequest to unstick ACME (paired with step 11's auto-recovery)
  • runbook-access-controller-pvc — get into the controller's PVC for inspection
  • runbook-safe-destroy — ordered destroy procedure that doesn't orphan STACKIT resources
  • runbook-rollback-deploy — roll back a partial / failed deploy without wiping infra
  • deployment — step-15-ensure-certificate-issued is the in-deployer auto-recovery for ACME; step-01-refresh-kubeconfig reads SKE state; step-00-preflight-external-deps validates infra preconditions
  • config — runtime-state.json schema is also covered from the config-hierarchy angle in concept-runtime-state
  • security — STACKIT Secrets Manager is provisioned here; secret-storage semantics are documented there
  • backup — STACKIT-managed backups (Postgres flexible-server, S3) are part of the infrastructure layer
  • apps — per-app data stores (Nextcloud PVC, OX MariaDB, Synapse Postgres) sit on top of the managed services provisioned here

When to add a page here

  • A new STACKIT service or Terraform module is added (component-*, concept-*)
  • A new external-dependency failure mode is discovered — managed-DB outage, network policy, DNS, ACME (incident-*)
  • A new infra recovery / destroy / rollback procedure is documented (runbook-*)
  • An infra-side decision is recorded (decision-*) — e.g., switching managed-DB providers, changing the SKE node-pool topology
  • A change to the runtime-state.json schema (add a field, deprecate one) — update concept-runtime-state and add a decision-* if non-trivial

Per-app data-store recovery (Nextcloud files, OX MariaDB schemas) belongs in apps. Anything that happens inside the cluster after kubeconfig is fetched belongs in deployment. Anything about the value of the credentials Secrets Manager holds (rotation, derivation) belongs in security.