Operations

The day-to-day mechanics of running Galley in production. Most of this is “do what you’d do for any compose project” with a few specifics around the master key and the agent fleet.

Upgrades

Releases ship as versioned Docker tags (ghcr.io/utibeabasi6/galley-server:1.4.0). The compose file pulls the major-version tag (1) by default; switch to a pinned version (1.4.0) if you want to control upgrades explicitly.

cd /opt/galley
docker compose pull
docker compose up -d
docker compose logs -f galley-server

The server runs migrations on boot. Migrations are forward-only in v1 — there’s no automatic rollback. Pin a backed-up Postgres if you need to revert.

Agents reconnect automatically when the server comes back. In-flight builds running on agents are unaffected by a server restart; events queue on the message bus and replay when the server returns.

For a planned upgrade window:

Pause: docker compose stop galley-server (the agent keeps running, but webhooks 502 — GitHub retries with backoff).
Pull: docker compose pull galley-server galley-agent.
Start: docker compose up -d.

Backups

Three things to back up, in order of importance:

The master key. If you lose this, secrets are unrecoverable. Treat it like a TLS private key — multiple offline copies, separate from the database.
Postgres. Use whatever you already use (pg_dump, WAL archiving, provider snapshots). All control-plane state lives here: projects, environments, deployments, audit log, encrypted secrets.
TLS volume. The galley-tls volume holds the ACME account + issued certs. Losing it means the next boot re-issues from scratch (a few minutes of self-signed) — annoying but recoverable.

Restore drill: bring up a fresh host, pull the compose, restore Postgres, set the master key, start. Test once a quarter.

What you don’t need to back up: agent data dirs (worktrees + caches — disposable), the message bus volume (in-flight events only).

Adding an agent

The fast path:

Dashboard → Admin → Agents → New agent → name it, copy the bootstrap token.
On the new host, follow Agents — systemd unit or compose service.
The agent registers with the bootstrap, exchanges for a long-lived credential, and shows up online in the dashboard within a heartbeat (~30s).

Removing an agent: drain it (mark offline manually), stop the service, delete the agent record. Containers tagged galley.managed=true on that host can be removed with docker rm -f $(docker ps -aq --filter label=galley.managed=true) once you’re sure no other agent is on the same host.

Rotating secrets

What	How
GitHub App private key	Regenerate in GitHub → paste into Admin → Instance → GitHub App → save.
GitHub webhook secret	Settings → Git connection → Rotate webhook secret → update on the App’s webhook config.
Project bypass token	Settings → Preview access → Rotate.
Project basic auth password	Settings → Preview access → Save (with a new password).
Agent bootstrap token	Delete + re-add the agent in the dashboard.
Master key	See the master key.

Audit log

Every admin action and auth event lands in audit_log, queryable at Admin → Audit log. Filter by actor, action, or time range. The list is exportable as CSV or JSON lines.

Things that get audited:

Logins (success + failure).
Project create / update / archive / delete.
Project member add / remove / role change.
Env var read / write.
Bypass token rotate / revoke.
Agent register / revoke.
API token create / revoke.
Preview access mode + credentials change.

The log is append-only at the schema level. Old rows age out per the configured retention (default: forever).

Ubuntu unprivileged user namespaces

If your host runs Ubuntu 23.10, 24.04 LTS, or newer, you’ll see the galley-buildkit-v2 container crash-looping with:

[rootlesskit:parent] error: failed to start the child: fork/exec /proc/self/exe: permission denied

Ubuntu 23.10 changed the default of kernel.apparmor_restrict_unprivileged_userns to 1, which blocks unprofiled binaries from creating user namespaces. Rootless BuildKit needs that capability — the user namespace is what isolates PR build code from the host.

The fix is one sysctl flip:

sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0
echo "kernel.apparmor_restrict_unprivileged_userns=0" | sudo tee /etc/sysctl.d/60-apparmor-userns.conf
docker restart galley-buildkit-v2

Why this is the right fix

The threat scenario the sysctl was added to mitigate (“an attacker who already has unprivileged code execution on the host parlays into a kernel CVE via unshare”) is a layer behind container isolation, not in front of it. Setting it back to 0 returns to the same posture every Linux distribution shipped before late 2023; isolation for PR build code still comes from the user namespace + seccomp profile inside the BuildKit container itself.

References:

Moby BuildKit — Running BuildKit without root privileges — upstream BuildKit docs. The “Distribution-specific hint → Ubuntu, 24.04 or later” section spells out the exact kernel.apparmor_restrict_unprivileged_userns=0 instruction, and the troubleshooting entry for fork/exec /proc/self/exe: permission denied matches the failure shape you’ll see.
Ubuntu — Restricted unprivileged user namespaces in 23.10 — Ubuntu’s own writeup of the kernel change. Worth knowing: Ubuntu’s preferred remediation is per-binary AppArmor profiles. That path doesn’t work cleanly for BuildKit’s rootlesskit because the binary lives inside a container image at a non-stable overlayfs path; the sysctl flip is the practical answer for containerized rootless runtimes.

When to not use this

If your Galley host is shared with untrusted local users, or if compliance forbids relaxing host-level kernel controls, the sysctl flip isn’t appropriate. In that case the path is:

Lock the host down so only Galley runs on it (no shells, SSH-keys-only, fail2ban).
Track the v1.1+ kaniko-only build path on the roadmap, which removes the BuildKit + user-namespace dependency entirely. Trade-off there is roughly 20–40% slower warm-cache builds.

Resource limits

Per-project knobs in the dashboard:

Default TTL — how long a preview lives without a new commit before auto-teardown. Default 72h.
Build forked PRs — off by default; flips to on at your risk.

Per-agent knobs:

GALLEY_MAX_PARALLEL_BUILDS — concurrent builds per agent. Default 4.
GALLEY_DEFAULT_CPUS / GALLEY_DEFAULT_MEMORY — defaults for services that don’t pin resources.
GALLEY_DATA_DIR retention — log chunks beyond 7 days get swept; webhook deliveries beyond 30 days too.

Troubleshooting

Webhook deliveries failing

Dashboard → Admin → GitHub → Deliveries. Each row shows the result and any error. Common causes:

rejected (signature) — webhook secret mismatch. Update the secret on the App side or rotate it in Galley (Settings → Git connection → Rotate webhook secret).
rejected (secret unavailable) — Galley has no webhook secret stored for the connection. The git-connection picker (and the project wizard) shows a no webhook secret badge on the affected connection so you’ll see this before deliveries land. Rotate one in via the same path, then paste the same value into the GitHub repo’s webhook config.
rejected (parse) — payload doesn’t decode as a known event type. Surfaces when GitHub adds new optional fields; the server falls back to “ignore”.
accepted but no env appears — the project isn’t connected to that repo, or the connection’s installation ID has drifted (re-install the App).

Galley always requires a webhook secret on both sides — the URL pattern includes a ULID, but ULIDs aren’t unguessable, and an unsigned-but-valid webhook would let anyone who scrapes the URL fire fake pull_request opened events that provision real env containers and burn build minutes. Strict signature verification is a deliberate posture, not a missing feature.

Agent shows offline

Heartbeat hasn’t landed in 60s.
Check journalctl -u galley-agent (systemd) or docker logs galley-agent.
Most failures: control-plane URL unreachable, bootstrap token already exchanged, clock skew (TLS).

Build hangs

The agent’s data dir is full. Clear /var/lib/galley/builds/ (or whatever GALLEY_DATA_DIR points at).
A previous build’s container leaked. docker ps -a --filter label=galley.managed=true and clean up.
Image registry is rate-limiting. Watch docker logs galley-agent for 429 Too Many Requests.

`galley-buildkit-v2` is in `Restarting` state

Almost always Ubuntu’s unprivileged-user-namespace restriction. Symptom in docker logs galley-buildkit-v2:

[rootlesskit:parent] error: failed to start the child: fork/exec /proc/self/exe: permission denied

See Ubuntu unprivileged user namespaces above for the one-line fix and why it’s the right call.

Cert issuance pending

DNS-01 challenges take 60-120s with most registrars. Past a few minutes:

Check the proxy’s logs (docker compose logs traefik) for ACME errors.
Wrong DNS provider env vars — typo in CLOUDFLARE_DNS_API_TOKEN is the most common.
Token scope too narrow — Cloudflare needs Zone:Read + DNS:Edit on the specific zone.

Preview shows the wrong service

When the env has multiple routable services, every non-bare one gets <svc>-<env> (a single label deeper than the bare env URL — single-level wildcard friendly). With a web and an api service in the same env: <env> → web, api-<env> → api. With two web services (admin and frontend): admin-<env>, frontend-<env>, and the bare <env> resolves to one of them deterministically. Rename to flip which one wins, or hit each service at its own hostname.

Lost contact with control plane mid-deploy

Agents finish their current build locally; cancels are advisory. The reconciler on the server marks deployments failed with agent_lost_contact after the agent’s heartbeat times out. When the agent reconnects, the failed deployment stays failed (already terminal) — you re-trigger from the dashboard or push another commit.

Getting help

Email utibeabasiumanah6@gmail.com for bugs, install help, or anything that needs eyes on it.
Security disclosures use the same address — see the Security page.
For paying customers (when the hosted version exists), email or in-product chat.