CI/CD & Release Engineering
This page walks through the full journey of a code change — from the moment a developer opens a pull request, all the way to production. There are three main systems involved:
- GitHub Actions validates code on every pull request.
- Concourse builds release images, validates deploy manifests, and deploys environments.
- Cepler controls which environments get updated and in what order.
The design philosophy is simple: every step must pass before the next one starts, and there are human checkpoints at the most important moments. Nothing reaches production by accident.
High-Level Overview
Now let's walk through each step in detail.
Step 1: Pull Request Checks (GitHub Actions)
When a developer opens a pull request against main, GitHub Actions kicks off a suite of checks that all run in parallel. Every single one must pass before the PR can be merged — there are no exceptions.
What gets checked
| Workflow | What it does | Why it matters |
|---|---|---|
| nextest | Runs all Rust unit and integration tests via nix run .#nextest | Catches logic bugs and regressions in the backend |
| bats | Spins up the full application stack and runs BATS end-to-end tests against it | Verifies that the whole system works together, not just individual pieces |
| cypress | Runs Cypress browser tests against the admin panel | Makes sure the UI actually works; also generates screenshots for regulatory manuals |
| check-code-apps | Lints, type-checks, and builds the frontend apps | Catches TypeScript errors, lint violations, and broken builds in the frontend |
| flake-check | Runs nix flake check to validate the Nix flake | Ensures the build system itself is healthy |
| kustomize | Renders the Kubernetes deploy tree and validates it with kubeconform | Catches invalid Kubernetes manifests before release |
| pnpm-audit | Audits npm dependencies for known vulnerabilities | Blocks PRs that introduce dependencies with high-severity CVEs |
| data-pipeline | Parses the Postgres dbt project and runs data pipeline checks against public event schemas | Validates that the Rust/dbt reporting pipeline still works with schema changes |
| cocogitto | Checks that commit messages follow the conventional commits format | Needed because the version number and changelog are generated automatically from commit messages |
| spelling | Runs the typos tool to catch common misspellings | Simple but catches embarrassing typos in code and docs |
| lana-bank-docs | Builds the full documentation site (API docs, versioned docs, screenshot validation) | Catches broken doc builds, missing API descriptions, and invalid doc site configuration |
How Nix caching makes this fast
Compiling the Rust codebase from scratch takes a long time. To avoid doing that on every single PR, all the GitHub Actions workflows pull pre-built binaries from a shared Cachix binary cache called galoymoney.
Here's the pattern you'll see in every workflow file:
- uses: DeterminateSystems/nix-installer-action@v16
- uses: cachix/cachix-action@v15
with:
name: galoymoney
authToken: ${{ secrets.CACHIX_AUTH_TOKEN }}
skipPush: true
The skipPush: true part is key — GitHub Actions only reads from the cache, it never writes to it.
Most workflows also reclaim 10-20 GB of disk space at the start by removing pre-installed software (Docker images, Android SDK, etc.) that GitHub runners ship with. The large Rust compilations need that breathing room.
What happens when a check fails
If any check fails, the PR is blocked from merging. The developer fixes the issue, pushes again, and the checks re-run. There's no way to bypass a failing check.
Step 2: Building a Release (Concourse, lana-bank repo)
Once a PR is merged into main, the Concourse release pipeline takes over. This pipeline lives in the ci/release/ directory of the lana-bank repo and is written using YTT templates.
The pipeline has a clear dependency chain:
2a. Re-running the tests on main
You might wonder: we already ran tests in GitHub Actions on the PR, so why run them again? Because the PR was tested against a potentially stale version of main. Between the time the PR was opened and the time it was merged, other PRs may have landed. Running tests again on the actual merged commit catches integration issues that only appear when multiple changes combine.
Three jobs run in parallel:
- test-integration runs
cargo nextest— the same Rust test suite from the PR checks. - test-bats runs the BATS end-to-end tests (with up to 2 attempts, since E2E tests can be flaky).
- flake-check validates the Nix flake.
All three must pass before anything gets built.
2b. Building the Release Candidate (build-rc)
Once tests pass, the pipeline builds a release candidate (RC). The idea behind RCs is that you can build and test multiple candidates before committing to a final release. Here's what happens:
-
Figure out the version number. A Nix script called
next-versionuses cocogitto to scan the conventional commit messages since the last release and determine the next semantic version. For example, if the last release was0.41.0and there's been afeat:commit, the next version becomes0.42.0-rc.1. If another RC was already built for this version, it increments torc.2,rc.3, and so on. -
Set the release version. The pipeline exports
RELEASE_BUILD_VERSION=0.42.0-rc.1before building the Nix image. The Rust binary embeds this version in its build metadata and exposes it through the admin GraphQLbuildInfoquery. -
Compile the Rust binary and admin panel assets.
nix build --impure .#lana-bank-imagebuilds the statically-linkedlana-clibinary and the Vite admin panel assets, then packages both into thelana-bankimage. -
Build one Docker image and push it to Google Artifact Registry (
gcr.io/galoyorg):lana-bank— the server image built by Nix, containing the statically-linked binary, dbt runtime, report project, and packaged admin panel assets
-
Tag the image. The image gets both an
edgetag (meaning "latest RC") and a version-specific tag like0.42.0-rc.1.
2c. Updating the deploy bundle (bump-image-in-deploy-rc)
After an RC image is built, the pipeline updates the Kubernetes deploy bundle in this repository:
- It reads the immutable SHA256 digest for the newly built image. Digests are used instead of mutable tags for deployed environments.
- It updates the
imagesentries indeploy/**/kustomization.yamlfiles that pin released images. - It writes deploy metadata files:
deploy/VERSIONdeploy/is-releasedeploy/METADATA
- It commits the change back to lana-bank with a
chore(deploy): bump lana-bank image ...commit.
deploy/METADATA is the breadcrumb that links a running image back to the exact source commit and version that produced it. galoy-deployments vendors this deploy bundle directly from lana-bank, so no intermediate packaging repository is involved.
2d. Testflight: a throwaway deployment
The RC deploy bundle is validated with a testflight on the staging GKE cluster. The testflight uses the same OpenTofu modules and Kustomize components that real environments use, but with temporary instance names.
Here's what happens during a testflight:
- Prepare a testflight bundle. The pipeline copies
deploy/,tf/modules/lana-instance, andtf/modules/keycloak-realmsinto a temporary testflight workspace, then generates two instance overlays. - OpenTofu creates prerequisites. It creates namespaces, secrets, databases, Keycloak realms, storage resources, and other prerequisites for the temporary instances.
- Kustomize applies workloads. The pipeline runs
kubectl apply -kfor each generated overlay and waits for thelana-bank-serverDeployment to roll out. - Smoketests run against the deployed services.
- Cleanup runs regardless of result. The pipeline deletes the Kustomize resources and destroys the OpenTofu-managed prerequisites.
If the smoketest fails, the pipeline stops and someone needs to investigate what went wrong.
2e. Opening the Promote-RC PR (open-promote-rc-pr)
After the RC image is built and testflight has a candidate to validate, the pipeline automatically opens a pull request back in the lana-bank repo. This PR does a few things:
- It generates a CHANGELOG entry using git-cliff, which reads the conventional commit messages and groups them into categories (features, bug fixes, etc.).
- It regenerates API documentation and event schemas, and creates a versioned snapshot of the docs site.
- It pushes everything to a branch called
bot-promote-rcand opens a draft PR labeledpromote-rc.
This PR is the human gate in the pipeline. An engineer reviews the changelog to make sure it looks right, checks that the RC looks good in any ad-hoc testing, and then merges the PR when they're ready to cut a release. Nothing happens automatically from here — the release only proceeds when a human says "go."
There's also a safety check: the promote-rc-file-check GitHub Action verifies that this PR only touches CHANGELOG.md and docs-site/** files. If the bot accidentally included other changes, the check fails and blocks the merge.
2f. Cutting the Final Release (release)
When someone merges the promote-rc PR, the release job triggers. It does four things:
- Builds the final Docker image. This is the same image as the RC, but now tagged with the clean version number (e.g.,
0.42.0) and alsolatest. Thelana-bankimage is built by Nix (dockerTools.buildImage), producing a deterministic image. - Creates a GitHub Release. This includes the
lana-clibinary as a downloadable artifact and the changelog as the release notes. The release is tagged with the version number. - Updates the version counter. The pipeline stores the current version in a dedicated git branch called
version(just a text file with the version number in it). This gets bumped so the next RC starts from the right base. - Updates the deploy bundle again.
bump-image-in-deploypinsdeploy/to the final release image digest and recordsrelease=trueindeploy/METADATA.
How version numbers work
Versions follow Semantic Versioning and are derived automatically from conventional commit messages using cocogitto:
feat:commits produce a minor bump (e.g., 0.41.0 -> 0.42.0)fix:commits produce a patch bump (e.g., 0.42.0 -> 0.42.1)feat!:orBREAKING CHANGEproduce a major bump (e.g., 0.42.0 -> 1.0.0)
This is why the cocogitto GitHub Action enforces conventional commit format on every PR — if commit messages don't follow the convention, the version can't be computed automatically.
The current version is stored in a git branch called version as a plain text file. It's managed by the Concourse semver resource.
Step 3: Vendoring the Deploy Bundle into galoy-deployments
The release pipeline pushes the updated lana-bank git ref into galoy-deployments. galoy-deployments then uses Vendir to copy the exact deployment inputs it needs from that lana-bank commit:
deploy/— Kustomize base, components, overlays, and deploy metadatatf/modules/lana-instance— OpenTofu module for per-instance prerequisitestf/modules/keycloak-realms— OpenTofu module for Keycloak realms and clientstf/modules/postgres— local/test PostgreSQL moduletf/honeycomb— observability configuration used by the monitoring deployment
The config looks like this:
# vendir.yml (simplified)
directories:
- path: modules/lana-bank/vendor
contents:
- path: lana-bank/deploy
git:
url: git@github.com:GaloyMoney/lana-bank.git
ref: <lana-bank-git-ref>
includePaths:
- deploy/*
- deploy/**/*
newRootPath: deploy
- path: lana-instance
git:
url: git@github.com:GaloyMoney/lana-bank.git
ref: <lana-bank-git-ref>
includePaths:
- tf/modules/lana-instance/**/*
newRootPath: tf/modules/lana-instance
This keeps galoy-deployments self-contained at deploy time while preserving a direct trace back to the lana-bank source commit.
Step 4: Environment Deployment (galoy-deployments + Cepler)
The galoy-deployments repository is where the rubber meets the road. It contains OpenTofu configurations for every environment (staging, QA, production), vendored lana-bank deploy inputs, environment-specific Kustomize overlays, and the Cepler configuration that controls environment progression.
What is Cepler and why do we need it?
Cepler is a deployment promotion tool. The problem it solves is straightforward: when you have multiple environments (staging, QA, production), you don't want a change to reach production until it's been validated in the earlier environments first.
Cepler tracks which files have changed and which environments have successfully deployed those changes. It enforces rules like "QA can only deploy changes that have already succeeded in staging." This prevents the classic mistake of accidentally deploying untested code to production.
Cepler has a few core concepts:
- Deployment: A named unit of work (e.g.,
lana-bank). Each deployment has its own configuration file and its own set of environments. - Environment: A deploy target like
gcp-galoy-stagingorgcp-volcano-qa. Each environment defines which file patterns it watches for changes. latest: A list of glob patterns. When files matching these patterns change, Cepler considers this environment "out of date" and triggers a deployment.passed: The name of another environment that must have successfully deployed the same changes first. This is how you create a promotion chain (staging -> QA -> production).propagated: Files that should be inherited from thepassedenvironment rather than tracked independently. This is how shared module code flows from staging to QA without QA needing to independently track those files.- State files: Cepler keeps state files in the
.cepler/directory that record exactly which commit and file versions have been deployed to each environment.
Cepler configuration in practice
Here's a simplified version of the lana-bank Cepler config:
# cepler/lana-bank.yml
deployment: lana-bank
environments:
gcp-galoy-staging:
latest:
- modules/lana-bank-gcp-pg/**
- modules/lana-bank-postgres-mcp/**
- modules/lana-bank/config/**
- modules/lana-bank/vendor/**
- modules/infra/vendor/tf/postgresql/**
- gcp/galoy-staging/lana-bank/*
gcp-volcano-qa:
passed: gcp-galoy-staging
propagated:
- modules/lana-bank-gcp-pg/**
- modules/lana-bank-postgres-mcp/**
- modules/lana-bank/config/**
- modules/lana-bank/vendor/**
- modules/infra/vendor/tf/postgresql/**
latest:
- gcp/volcano-qa/lana-bank/*
Reading this from top to bottom:
- Staging (
gcp-galoy-staging) watches the vendored lana-bank deploy inputs, the GCP PostgreSQL module, the Postgres MCP module, and its own environment-specific config. Whenever any of those files change, staging gets a new deployment. - QA (
gcp-volcano-qa) haspassed: gcp-galoy-staging, which means it will only deploy changes that have already been successfully deployed to staging. Thepropagatedsection lists the shared modules — Cepler inherits the staging-tested versions of these files rather than tracking them independently. QA also watches its own environment-specific config inlatest, so changes to QA-only settings deploy immediately without waiting for staging.
How Cepler works with Concourse
The galoy-deployments Concourse pipeline uses two custom resource types to integrate with Cepler:
cepler-inis a Concourse resource that periodically checks thecepler-gatesgit branch. When it detects that there are pending changes for a given environment (based on the rules in the cepler config), it triggers a deployment job.- The deployment job runs OpenTofu first. OpenTofu provisions prerequisites such as databases, database roles, Keycloak realms, storage resources, Kubernetes namespaces, and Kubernetes Secrets.
- The deployment job then applies each environment overlay with
kubectl apply -k. One-shot bootstrap/simulation Jobs are deleted before apply so they can run again when the overlay includes them. - The job waits for Kustomize-managed bootstrap and simulation Jobs when those components are enabled.
cepler-outis called after a successful deployment. It updates the cepler state file, recording that this environment is now at the new version. This is what unblocks downstream environments — when staging's state is updated, Cepler knows that QA can now proceed.
If a deployment fails, the state is not updated, and downstream environments remain blocked. This is the safety net that prevents broken code from cascading through environments.
Repository structure
galoy-deployments/
├── modules/
│ ├── lana-bank/
│ │ ├── config/ # Shared bootstrap/module config JSON
│ │ └── vendor/ # Vendored deploy + tf modules from lana-bank
│ │ ├── lana-bank/deploy/ # Kustomize base/components/metadata
│ │ ├── lana-instance/ # OpenTofu instance prerequisite module
│ │ ├── keycloak-realms/ # OpenTofu Keycloak realm module
│ │ └── postgres/ # Local/test PostgreSQL module
│ ├── lana-bank-gcp-pg/ # Provisions shared PostgreSQL instances on GCP
│ └── lana-bank-postgres-mcp/ # Optional Postgres MCP Deployment
├── gcp/
│ ├── galoy-staging/
│ │ ├── shared/ # GCP project config shared by all modules
│ │ └── lana-bank/
│ │ ├── main.tf # Staging prerequisite wiring
│ │ └── deploy/overlays/ # Staging Kustomize overlays
│ └── volcano-qa/
│ └── lana-bank/
│ ├── main.tf # QA prerequisite wiring
│ └── deploy/overlays/ # QA Kustomize overlays
├── cepler/
│ ├── lana-bank.yml # Environment progression rules
│ └── .cepler/lana-bank/ # State files (one per environment)
└── vendir.yml # Vendir config for lana-bank deploy inputs
How environment overrides work
The vendored deploy/ tree provides common Kubernetes resources:
deploy/basecontains always-on resources such as thelana-bank-serverDeployment and Services.deploy/componentscontains optional features such as ingress, workload identity, bootstrap Jobs, and simulation Jobs.deploy/overlays/validation/allis a CI-only overlay that renders every component for schema validation.
Real environment overlays live in galoy-deployments. They compose the shared base and components, then patch environment-specific values such as hosts, Keycloak realm names, storage settings, tracing service names, and image digests.
OpenTofu and Kustomize intentionally split responsibilities:
- OpenTofu owns long-lived prerequisites and secrets.
- Kustomize owns workload manifests and one-shot operational Jobs.
Step 5: Production Promotion
Production deployments follow the same Cepler-driven pattern as staging and QA, but with an extra layer of human oversight.
- Changes must pass staging first. The cepler config's
passed:field ensures that any change headed for production has already been successfully deployed to staging (and potentially QA). If staging is broken, production won't even be attempted. - The
cepler-gatesbranch adds a manual gate. Even after staging succeeds, production doesn't deploy automatically. The galoy-deployments repo has a special git branch calledcepler-gatesthat contains promotion controls. Cepler checks this branch to determine whether a production deployment is "allowed." - A human approves the promotion. To release to production, an engineer updates the
cepler-gatesbranch to indicate that the current staging version is approved for production. This is an explicit, auditable action. - Cepler detects the gate is open and the Concourse pipeline deploys to production using OpenTofu and Kustomize, just like it does for staging. After a successful deployment, the cepler state is updated.
This design means that you always know exactly what's running in production, and you can always trace it back through QA, staging, the vendored lana-bank deploy metadata, the Docker image digest, and the GitHub Release.
Putting It All Together
Here's the complete journey one more time, but now you should understand what's happening at each step and why:
- Developer opens a PR. GitHub Actions runs 10+ parallel checks (tests, lint, security scans), using the shared Cachix binary cache when derivations are available.
- PR is merged to
main. Concourse re-runs the tests against the actual merged commit to catch integration issues. On success, it builds a release candidate image tagged with an RC version. - The deploy bundle is updated and testflight runs. The pipeline pins
deploy/to the RC image digest, provisions throwaway prerequisites, applies generated Kustomize overlays, runs smoketests, and cleans up. - The pipeline opens a promote-rc PR. This PR contains the generated CHANGELOG and updated docs. An engineer reviews it and merges when they're ready to release. This is the first human checkpoint.
- The release job runs. It builds the final Docker image, creates a GitHub Release, and pins
deploy/to the final image digest. - galoy-deployments vendors the lana-bank deploy inputs. Vendir copies
deploy/and the OpenTofu modules from the exact lana-bank commit. - Cepler picks up the change in galoy-deployments. It deploys to staging first. Only after staging succeeds does QA become eligible.
- An engineer approves the production gate. They update the
cepler-gatesbranch, Cepler detects the change, and Concourse deploys to production. This is the second human checkpoint.
At any point, you can trace what's running in an environment all the way back to the source commit — through the cepler state, the vendir config, deploy/METADATA, the Docker image digest, and the GitHub Release.
Quick Reference
| Tool | What it does | Where it's configured |
|---|---|---|
| GitHub Actions | Runs PR validation checks | .github/workflows/ in lana-bank |
| Concourse | Builds releases, runs testflight, deploys to environments | ci/ in lana-bank and galoy-deployments |
| Cachix | Stores pre-built Nix binaries (galoymoney cache) | GitHub Actions workflows and CI jobs configured with Cachix |
| YTT | Templates Concourse pipeline YAML | ci/release/ and other Concourse pipeline configs in lana-bank |
| Cocogitto | Computes the next version from conventional commits | cog.toml in lana-bank |
| git-cliff | Generates the CHANGELOG from conventional commits | ci/config/git-cliff.toml in lana-bank |
| Vendir | Vendors lana-bank deploy inputs into galoy-deployments | vendir.yml in galoy-deployments |
| Kustomize | Composes and patches Kubernetes workload manifests | deploy/ in lana-bank and environment overlays in galoy-deployments |
| Cepler | Controls environment promotion (staging -> QA -> production) | cepler/*.yml in galoy-deployments |
| OpenTofu | Provisions databases, secrets, identity, storage, and other prerequisites | tf/ in lana-bank and modules/ in galoy-deployments |