Skip to main content
Version: 0.61.0-rc.33

CI/CD & Release Engineering

This page walks through the full journey of a code change — from the moment a developer opens a pull request, all the way to production. There are three main systems involved:

  • GitHub Actions validates code on every pull request.
  • Concourse builds release images, validates deploy manifests, and deploys environments.
  • Cepler controls which environments get updated and in what order.

The design philosophy is simple: every step must pass before the next one starts, and there are human checkpoints at the most important moments. Nothing reaches production by accident.

High-Level Overview

Now let's walk through each step in detail.


Step 1: Pull Request Checks (GitHub Actions)

When a developer opens a pull request against main, GitHub Actions kicks off a suite of checks that all run in parallel. Every single one must pass before the PR can be merged — there are no exceptions.

What gets checked

WorkflowWhat it doesWhy it matters
nextestRuns all Rust unit and integration tests via nix run .#nextestCatches logic bugs and regressions in the backend
batsSpins up the full application stack and runs BATS end-to-end tests against itVerifies that the whole system works together, not just individual pieces
cypressRuns Cypress browser tests against the admin panelMakes sure the UI actually works; also generates screenshots for regulatory manuals
check-code-appsLints, type-checks, and builds the frontend appsCatches TypeScript errors, lint violations, and broken builds in the frontend
flake-checkRuns nix flake check to validate the Nix flakeEnsures the build system itself is healthy
kustomizeRenders the Kubernetes deploy tree and validates it with kubeconformCatches invalid Kubernetes manifests before release
pnpm-auditAudits npm dependencies for known vulnerabilitiesBlocks PRs that introduce dependencies with high-severity CVEs
data-pipelineParses the Postgres dbt project and runs data pipeline checks against public event schemasValidates that the Rust/dbt reporting pipeline still works with schema changes
cocogittoChecks that commit messages follow the conventional commits formatNeeded because the version number and changelog are generated automatically from commit messages
spellingRuns the typos tool to catch common misspellingsSimple but catches embarrassing typos in code and docs
lana-bank-docsBuilds the full documentation site (API docs, versioned docs, screenshot validation)Catches broken doc builds, missing API descriptions, and invalid doc site configuration

How Nix caching makes this fast

Compiling the Rust codebase from scratch takes a long time. To avoid doing that on every single PR, all the GitHub Actions workflows pull pre-built binaries from a shared Cachix binary cache called galoymoney.

Here's the pattern you'll see in every workflow file:

- uses: DeterminateSystems/nix-installer-action@v16
- uses: cachix/cachix-action@v15
with:
name: galoymoney
authToken: ${{ secrets.CACHIX_AUTH_TOKEN }}
skipPush: true

The skipPush: true part is key — GitHub Actions only reads from the cache, it never writes to it.

Most workflows also reclaim 10-20 GB of disk space at the start by removing pre-installed software (Docker images, Android SDK, etc.) that GitHub runners ship with. The large Rust compilations need that breathing room.

What happens when a check fails

If any check fails, the PR is blocked from merging. The developer fixes the issue, pushes again, and the checks re-run. There's no way to bypass a failing check.


Step 2: Building a Release (Concourse, lana-bank repo)

Once a PR is merged into main, the Concourse release pipeline takes over. This pipeline lives in the ci/release/ directory of the lana-bank repo and is written using YTT templates.

The pipeline has a clear dependency chain:

2a. Re-running the tests on main

You might wonder: we already ran tests in GitHub Actions on the PR, so why run them again? Because the PR was tested against a potentially stale version of main. Between the time the PR was opened and the time it was merged, other PRs may have landed. Running tests again on the actual merged commit catches integration issues that only appear when multiple changes combine.

Three jobs run in parallel:

  • test-integration runs cargo nextest — the same Rust test suite from the PR checks.
  • test-bats runs the BATS end-to-end tests (with up to 2 attempts, since E2E tests can be flaky).
  • flake-check validates the Nix flake.

All three must pass before anything gets built.

2b. Building the Release Candidate (build-rc)

Once tests pass, the pipeline builds a release candidate (RC). The idea behind RCs is that you can build and test multiple candidates before committing to a final release. Here's what happens:

  1. Figure out the version number. A Nix script called next-version uses cocogitto to scan the conventional commit messages since the last release and determine the next semantic version. For example, if the last release was 0.41.0 and there's been a feat: commit, the next version becomes 0.42.0-rc.1. If another RC was already built for this version, it increments to rc.2, rc.3, and so on.

  2. Set the release version. The pipeline exports RELEASE_BUILD_VERSION=0.42.0-rc.1 before building the Nix image. The Rust binary embeds this version in its build metadata and exposes it through the admin GraphQL buildInfo query.

  3. Compile the Rust binary and admin panel assets. nix build --impure .#lana-bank-image builds the statically-linked lana-cli binary and the Vite admin panel assets, then packages both into the lana-bank image.

  4. Build one Docker image and push it to Google Artifact Registry (gcr.io/galoyorg):

    • lana-bank — the server image built by Nix, containing the statically-linked binary, dbt runtime, report project, and packaged admin panel assets
  5. Tag the image. The image gets both an edge tag (meaning "latest RC") and a version-specific tag like 0.42.0-rc.1.

2c. Updating the deploy bundle (bump-image-in-deploy-rc)

After an RC image is built, the pipeline updates the Kubernetes deploy bundle in this repository:

  1. It reads the immutable SHA256 digest for the newly built image. Digests are used instead of mutable tags for deployed environments.
  2. It updates the images entries in deploy/**/kustomization.yaml files that pin released images.
  3. It writes deploy metadata files:
    • deploy/VERSION
    • deploy/is-release
    • deploy/METADATA
  4. It commits the change back to lana-bank with a chore(deploy): bump lana-bank image ... commit.

deploy/METADATA is the breadcrumb that links a running image back to the exact source commit and version that produced it. galoy-deployments vendors this deploy bundle directly from lana-bank, so no intermediate packaging repository is involved.

2d. Testflight: a throwaway deployment

The RC deploy bundle is validated with a testflight on the staging GKE cluster. The testflight uses the same OpenTofu modules and Kustomize components that real environments use, but with temporary instance names.

Here's what happens during a testflight:

  1. Prepare a testflight bundle. The pipeline copies deploy/, tf/modules/lana-instance, and tf/modules/keycloak-realms into a temporary testflight workspace, then generates two instance overlays.
  2. OpenTofu creates prerequisites. It creates namespaces, secrets, databases, Keycloak realms, storage resources, and other prerequisites for the temporary instances.
  3. Kustomize applies workloads. The pipeline runs kubectl apply -k for each generated overlay and waits for the lana-bank-server Deployment to roll out.
  4. Smoketests run against the deployed services.
  5. Cleanup runs regardless of result. The pipeline deletes the Kustomize resources and destroys the OpenTofu-managed prerequisites.

If the smoketest fails, the pipeline stops and someone needs to investigate what went wrong.

2e. Opening the Promote-RC PR (open-promote-rc-pr)

After the RC image is built and testflight has a candidate to validate, the pipeline automatically opens a pull request back in the lana-bank repo. This PR does a few things:

  • It generates a CHANGELOG entry using git-cliff, which reads the conventional commit messages and groups them into categories (features, bug fixes, etc.).
  • It regenerates API documentation and event schemas, and creates a versioned snapshot of the docs site.
  • It pushes everything to a branch called bot-promote-rc and opens a draft PR labeled promote-rc.

This PR is the human gate in the pipeline. An engineer reviews the changelog to make sure it looks right, checks that the RC looks good in any ad-hoc testing, and then merges the PR when they're ready to cut a release. Nothing happens automatically from here — the release only proceeds when a human says "go."

There's also a safety check: the promote-rc-file-check GitHub Action verifies that this PR only touches CHANGELOG.md and docs-site/** files. If the bot accidentally included other changes, the check fails and blocks the merge.

2f. Cutting the Final Release (release)

When someone merges the promote-rc PR, the release job triggers. It does four things:

  1. Builds the final Docker image. This is the same image as the RC, but now tagged with the clean version number (e.g., 0.42.0) and also latest. The lana-bank image is built by Nix (dockerTools.buildImage), producing a deterministic image.
  2. Creates a GitHub Release. This includes the lana-cli binary as a downloadable artifact and the changelog as the release notes. The release is tagged with the version number.
  3. Updates the version counter. The pipeline stores the current version in a dedicated git branch called version (just a text file with the version number in it). This gets bumped so the next RC starts from the right base.
  4. Updates the deploy bundle again. bump-image-in-deploy pins deploy/ to the final release image digest and records release=true in deploy/METADATA.

How version numbers work

Versions follow Semantic Versioning and are derived automatically from conventional commit messages using cocogitto:

  • feat: commits produce a minor bump (e.g., 0.41.0 -> 0.42.0)
  • fix: commits produce a patch bump (e.g., 0.42.0 -> 0.42.1)
  • feat!: or BREAKING CHANGE produce a major bump (e.g., 0.42.0 -> 1.0.0)

This is why the cocogitto GitHub Action enforces conventional commit format on every PR — if commit messages don't follow the convention, the version can't be computed automatically.

The current version is stored in a git branch called version as a plain text file. It's managed by the Concourse semver resource.


Step 3: Vendoring the Deploy Bundle into galoy-deployments

The release pipeline pushes the updated lana-bank git ref into galoy-deployments. galoy-deployments then uses Vendir to copy the exact deployment inputs it needs from that lana-bank commit:

  • deploy/ — Kustomize base, components, overlays, and deploy metadata
  • tf/modules/lana-instance — OpenTofu module for per-instance prerequisites
  • tf/modules/keycloak-realms — OpenTofu module for Keycloak realms and clients
  • tf/modules/postgres — local/test PostgreSQL module
  • tf/honeycomb — observability configuration used by the monitoring deployment

The config looks like this:

# vendir.yml (simplified)
directories:
- path: modules/lana-bank/vendor
contents:
- path: lana-bank/deploy
git:
url: git@github.com:GaloyMoney/lana-bank.git
ref: <lana-bank-git-ref>
includePaths:
- deploy/*
- deploy/**/*
newRootPath: deploy
- path: lana-instance
git:
url: git@github.com:GaloyMoney/lana-bank.git
ref: <lana-bank-git-ref>
includePaths:
- tf/modules/lana-instance/**/*
newRootPath: tf/modules/lana-instance

This keeps galoy-deployments self-contained at deploy time while preserving a direct trace back to the lana-bank source commit.


Step 4: Environment Deployment (galoy-deployments + Cepler)

The galoy-deployments repository is where the rubber meets the road. It contains OpenTofu configurations for every environment (staging, QA, production), vendored lana-bank deploy inputs, environment-specific Kustomize overlays, and the Cepler configuration that controls environment progression.

What is Cepler and why do we need it?

Cepler is a deployment promotion tool. The problem it solves is straightforward: when you have multiple environments (staging, QA, production), you don't want a change to reach production until it's been validated in the earlier environments first.

Cepler tracks which files have changed and which environments have successfully deployed those changes. It enforces rules like "QA can only deploy changes that have already succeeded in staging." This prevents the classic mistake of accidentally deploying untested code to production.

Cepler has a few core concepts:

  • Deployment: A named unit of work (e.g., lana-bank). Each deployment has its own configuration file and its own set of environments.
  • Environment: A deploy target like gcp-galoy-staging or gcp-volcano-qa. Each environment defines which file patterns it watches for changes.
  • latest: A list of glob patterns. When files matching these patterns change, Cepler considers this environment "out of date" and triggers a deployment.
  • passed: The name of another environment that must have successfully deployed the same changes first. This is how you create a promotion chain (staging -> QA -> production).
  • propagated: Files that should be inherited from the passed environment rather than tracked independently. This is how shared module code flows from staging to QA without QA needing to independently track those files.
  • State files: Cepler keeps state files in the .cepler/ directory that record exactly which commit and file versions have been deployed to each environment.

Cepler configuration in practice

Here's a simplified version of the lana-bank Cepler config:

# cepler/lana-bank.yml
deployment: lana-bank
environments:
gcp-galoy-staging:
latest:
- modules/lana-bank-gcp-pg/**
- modules/lana-bank-postgres-mcp/**
- modules/lana-bank/config/**
- modules/lana-bank/vendor/**
- modules/infra/vendor/tf/postgresql/**
- gcp/galoy-staging/lana-bank/*

gcp-volcano-qa:
passed: gcp-galoy-staging
propagated:
- modules/lana-bank-gcp-pg/**
- modules/lana-bank-postgres-mcp/**
- modules/lana-bank/config/**
- modules/lana-bank/vendor/**
- modules/infra/vendor/tf/postgresql/**
latest:
- gcp/volcano-qa/lana-bank/*

Reading this from top to bottom:

  • Staging (gcp-galoy-staging) watches the vendored lana-bank deploy inputs, the GCP PostgreSQL module, the Postgres MCP module, and its own environment-specific config. Whenever any of those files change, staging gets a new deployment.
  • QA (gcp-volcano-qa) has passed: gcp-galoy-staging, which means it will only deploy changes that have already been successfully deployed to staging. The propagated section lists the shared modules — Cepler inherits the staging-tested versions of these files rather than tracking them independently. QA also watches its own environment-specific config in latest, so changes to QA-only settings deploy immediately without waiting for staging.

How Cepler works with Concourse

The galoy-deployments Concourse pipeline uses two custom resource types to integrate with Cepler:

  1. cepler-in is a Concourse resource that periodically checks the cepler-gates git branch. When it detects that there are pending changes for a given environment (based on the rules in the cepler config), it triggers a deployment job.
  2. The deployment job runs OpenTofu first. OpenTofu provisions prerequisites such as databases, database roles, Keycloak realms, storage resources, Kubernetes namespaces, and Kubernetes Secrets.
  3. The deployment job then applies each environment overlay with kubectl apply -k. One-shot bootstrap/simulation Jobs are deleted before apply so they can run again when the overlay includes them.
  4. The job waits for Kustomize-managed bootstrap and simulation Jobs when those components are enabled.
  5. cepler-out is called after a successful deployment. It updates the cepler state file, recording that this environment is now at the new version. This is what unblocks downstream environments — when staging's state is updated, Cepler knows that QA can now proceed.

If a deployment fails, the state is not updated, and downstream environments remain blocked. This is the safety net that prevents broken code from cascading through environments.

Repository structure

galoy-deployments/
├── modules/
│ ├── lana-bank/
│ │ ├── config/ # Shared bootstrap/module config JSON
│ │ └── vendor/ # Vendored deploy + tf modules from lana-bank
│ │ ├── lana-bank/deploy/ # Kustomize base/components/metadata
│ │ ├── lana-instance/ # OpenTofu instance prerequisite module
│ │ ├── keycloak-realms/ # OpenTofu Keycloak realm module
│ │ └── postgres/ # Local/test PostgreSQL module
│ ├── lana-bank-gcp-pg/ # Provisions shared PostgreSQL instances on GCP
│ └── lana-bank-postgres-mcp/ # Optional Postgres MCP Deployment
├── gcp/
│ ├── galoy-staging/
│ │ ├── shared/ # GCP project config shared by all modules
│ │ └── lana-bank/
│ │ ├── main.tf # Staging prerequisite wiring
│ │ └── deploy/overlays/ # Staging Kustomize overlays
│ └── volcano-qa/
│ └── lana-bank/
│ ├── main.tf # QA prerequisite wiring
│ └── deploy/overlays/ # QA Kustomize overlays
├── cepler/
│ ├── lana-bank.yml # Environment progression rules
│ └── .cepler/lana-bank/ # State files (one per environment)
└── vendir.yml # Vendir config for lana-bank deploy inputs

How environment overrides work

The vendored deploy/ tree provides common Kubernetes resources:

  • deploy/base contains always-on resources such as the lana-bank-server Deployment and Services.
  • deploy/components contains optional features such as ingress, workload identity, bootstrap Jobs, and simulation Jobs.
  • deploy/overlays/validation/all is a CI-only overlay that renders every component for schema validation.

Real environment overlays live in galoy-deployments. They compose the shared base and components, then patch environment-specific values such as hosts, Keycloak realm names, storage settings, tracing service names, and image digests.

OpenTofu and Kustomize intentionally split responsibilities:

  • OpenTofu owns long-lived prerequisites and secrets.
  • Kustomize owns workload manifests and one-shot operational Jobs.

Step 5: Production Promotion

Production deployments follow the same Cepler-driven pattern as staging and QA, but with an extra layer of human oversight.

  1. Changes must pass staging first. The cepler config's passed: field ensures that any change headed for production has already been successfully deployed to staging (and potentially QA). If staging is broken, production won't even be attempted.
  2. The cepler-gates branch adds a manual gate. Even after staging succeeds, production doesn't deploy automatically. The galoy-deployments repo has a special git branch called cepler-gates that contains promotion controls. Cepler checks this branch to determine whether a production deployment is "allowed."
  3. A human approves the promotion. To release to production, an engineer updates the cepler-gates branch to indicate that the current staging version is approved for production. This is an explicit, auditable action.
  4. Cepler detects the gate is open and the Concourse pipeline deploys to production using OpenTofu and Kustomize, just like it does for staging. After a successful deployment, the cepler state is updated.

This design means that you always know exactly what's running in production, and you can always trace it back through QA, staging, the vendored lana-bank deploy metadata, the Docker image digest, and the GitHub Release.


Putting It All Together

Here's the complete journey one more time, but now you should understand what's happening at each step and why:

  1. Developer opens a PR. GitHub Actions runs 10+ parallel checks (tests, lint, security scans), using the shared Cachix binary cache when derivations are available.
  2. PR is merged to main. Concourse re-runs the tests against the actual merged commit to catch integration issues. On success, it builds a release candidate image tagged with an RC version.
  3. The deploy bundle is updated and testflight runs. The pipeline pins deploy/ to the RC image digest, provisions throwaway prerequisites, applies generated Kustomize overlays, runs smoketests, and cleans up.
  4. The pipeline opens a promote-rc PR. This PR contains the generated CHANGELOG and updated docs. An engineer reviews it and merges when they're ready to release. This is the first human checkpoint.
  5. The release job runs. It builds the final Docker image, creates a GitHub Release, and pins deploy/ to the final image digest.
  6. galoy-deployments vendors the lana-bank deploy inputs. Vendir copies deploy/ and the OpenTofu modules from the exact lana-bank commit.
  7. Cepler picks up the change in galoy-deployments. It deploys to staging first. Only after staging succeeds does QA become eligible.
  8. An engineer approves the production gate. They update the cepler-gates branch, Cepler detects the change, and Concourse deploys to production. This is the second human checkpoint.

At any point, you can trace what's running in an environment all the way back to the source commit — through the cepler state, the vendir config, deploy/METADATA, the Docker image digest, and the GitHub Release.


Quick Reference

ToolWhat it doesWhere it's configured
GitHub ActionsRuns PR validation checks.github/workflows/ in lana-bank
ConcourseBuilds releases, runs testflight, deploys to environmentsci/ in lana-bank and galoy-deployments
CachixStores pre-built Nix binaries (galoymoney cache)GitHub Actions workflows and CI jobs configured with Cachix
YTTTemplates Concourse pipeline YAMLci/release/ and other Concourse pipeline configs in lana-bank
CocogittoComputes the next version from conventional commitscog.toml in lana-bank
git-cliffGenerates the CHANGELOG from conventional commitsci/config/git-cliff.toml in lana-bank
VendirVendors lana-bank deploy inputs into galoy-deploymentsvendir.yml in galoy-deployments
KustomizeComposes and patches Kubernetes workload manifestsdeploy/ in lana-bank and environment overlays in galoy-deployments
CeplerControls environment promotion (staging -> QA -> production)cepler/*.yml in galoy-deployments
OpenTofuProvisions databases, secrets, identity, storage, and other prerequisitestf/ in lana-bank and modules/ in galoy-deployments