Docs / Operate
Operations
Upgrades, backups, agent fleet, troubleshooting.
The day-to-day mechanics of running Galley in production. Most of this is “do what you’d do for any compose project” with a few specifics around the master key and the agent fleet.
Upgrades
Releases ship as versioned Docker tags (ghcr.io/utibeabasi6/galley-server:1.4.0). The compose file pulls the major-version tag (1) by default; switch to a pinned version (1.4.0) if you want to control upgrades explicitly.
cd /opt/galley
docker compose pull
docker compose up -d
docker compose logs -f galley-server
The server runs migrations on boot. Migrations are forward-only in v1 — there’s no automatic rollback. Pin a backed-up Postgres if you need to revert.
Agents reconnect automatically when the server comes back. In-flight builds running on agents are unaffected by a server restart; events queue on the message bus and replay when the server returns.
For a planned upgrade window:
- Pause:
docker compose stop galley-server(the agent keeps running, but webhooks 502 — GitHub retries with backoff). - Pull:
docker compose pull galley-server galley-agent. - Start:
docker compose up -d.
Backups
Three things to back up, in order of importance:
- The master key. If you lose this, secrets are unrecoverable. Treat it like a TLS private key — multiple offline copies, separate from the database.
- Postgres. Use whatever you already use (
pg_dump, WAL archiving, provider snapshots). All control-plane state lives here: projects, environments, deployments, audit log, encrypted secrets. - TLS volume. The
galley-tlsvolume holds the ACME account + issued certs. Losing it means the next boot re-issues from scratch (a few minutes of self-signed) — annoying but recoverable.
Restore drill: bring up a fresh host, pull the compose, restore Postgres, set the master key, start. Test once a quarter.
What you don’t need to back up: agent data dirs (worktrees + caches — disposable), the message bus volume (in-flight events only).
Adding an agent
The fast path:
- Dashboard → Admin → Agents → New agent → name it, copy the bootstrap token.
- On the new host, follow Agents — systemd unit or compose service.
- The agent registers with the bootstrap, exchanges for a long-lived credential, and shows up online in the dashboard within a heartbeat (~30s).
Removing an agent: drain it (mark offline manually), stop the service, delete the agent record. Containers tagged galley.managed=true on that host can be removed with docker rm -f $(docker ps -aq --filter label=galley.managed=true) once you’re sure no other agent is on the same host.
Rotating secrets
| What | How |
|---|---|
| GitHub App private key | Regenerate in GitHub → paste into Admin → Instance → GitHub App → save. |
| GitHub webhook secret | Settings → Git connection → Rotate webhook secret → update on the App’s webhook config. |
| Project bypass token | Settings → Preview access → Rotate. |
| Project basic auth password | Settings → Preview access → Save (with a new password). |
| Agent bootstrap token | Delete + re-add the agent in the dashboard. |
| Master key | See the master key. |
Audit log
Every admin action and auth event lands in audit_log, queryable at Admin → Audit log. Filter by actor, action, or time range. The list is exportable as CSV or JSON lines.
Things that get audited:
- Logins (success + failure).
- Project create / update / archive / delete.
- Project member add / remove / role change.
- Env var read / write.
- Bypass token rotate / revoke.
- Agent register / revoke.
- API token create / revoke.
- Preview access mode + credentials change.
The log is append-only at the schema level. Old rows age out per the configured retention (default: forever).
Ubuntu unprivileged user namespaces
If your host runs Ubuntu 23.10, 24.04 LTS, or newer, you’ll see the galley-buildkit-v2 container crash-looping with:
[rootlesskit:parent] error: failed to start the child: fork/exec /proc/self/exe: permission denied
Ubuntu 23.10 changed the default of kernel.apparmor_restrict_unprivileged_userns to 1, which blocks unprofiled binaries from creating user namespaces. Rootless BuildKit needs that capability — the user namespace is what isolates PR build code from the host.
The fix is one sysctl flip:
sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0
echo "kernel.apparmor_restrict_unprivileged_userns=0" | sudo tee /etc/sysctl.d/60-apparmor-userns.conf
docker restart galley-buildkit-v2
Why this is the right fix
The threat scenario the sysctl was added to mitigate (“an attacker who already has unprivileged code execution on the host parlays into a kernel CVE via unshare”) is a layer behind container isolation, not in front of it. Setting it back to 0 returns to the same posture every Linux distribution shipped before late 2023; isolation for PR build code still comes from the user namespace + seccomp profile inside the BuildKit container itself.
References:
- Moby BuildKit — Running BuildKit without root privileges — upstream BuildKit docs. The “Distribution-specific hint → Ubuntu, 24.04 or later” section spells out the exact
kernel.apparmor_restrict_unprivileged_userns=0instruction, and the troubleshooting entry forfork/exec /proc/self/exe: permission deniedmatches the failure shape you’ll see. - Ubuntu — Restricted unprivileged user namespaces in 23.10 — Ubuntu’s own writeup of the kernel change. Worth knowing: Ubuntu’s preferred remediation is per-binary AppArmor profiles. That path doesn’t work cleanly for BuildKit’s
rootlesskitbecause the binary lives inside a container image at a non-stable overlayfs path; the sysctl flip is the practical answer for containerized rootless runtimes.
When to not use this
If your Galley host is shared with untrusted local users, or if compliance forbids relaxing host-level kernel controls, the sysctl flip isn’t appropriate. In that case the path is:
- Lock the host down so only Galley runs on it (no shells, SSH-keys-only, fail2ban).
- Track the v1.1+ kaniko-only build path on the roadmap, which removes the BuildKit + user-namespace dependency entirely. Trade-off there is roughly 20–40% slower warm-cache builds.
Resource limits
Per-project knobs in the dashboard:
- Default TTL — how long a preview lives without a new commit before auto-teardown. Default 72h.
- Build forked PRs — off by default; flips to on at your risk.
Per-agent knobs:
GALLEY_MAX_PARALLEL_BUILDS— concurrent builds per agent. Default 4.GALLEY_DEFAULT_CPUS/GALLEY_DEFAULT_MEMORY— defaults for services that don’t pin resources.GALLEY_DATA_DIRretention — log chunks beyond 7 days get swept; webhook deliveries beyond 30 days too.
Troubleshooting
Webhook deliveries failing
Dashboard → Admin → GitHub → Deliveries. Each row shows the result and any error. Common causes:
rejected (signature)— webhook secret mismatch. Update the secret on the App side or rotate it in Galley (Settings → Git connection → Rotate webhook secret).rejected (secret unavailable)— Galley has no webhook secret stored for the connection. The git-connection picker (and the project wizard) shows ano webhook secretbadge on the affected connection so you’ll see this before deliveries land. Rotate one in via the same path, then paste the same value into the GitHub repo’s webhook config.rejected (parse)— payload doesn’t decode as a known event type. Surfaces when GitHub adds new optional fields; the server falls back to “ignore”.acceptedbut no env appears — the project isn’t connected to that repo, or the connection’s installation ID has drifted (re-install the App).
Galley always requires a webhook secret on both sides — the URL pattern includes a ULID, but ULIDs aren’t unguessable, and an unsigned-but-valid webhook would let anyone who scrapes the URL fire fake pull_request opened events that provision real env containers and burn build minutes. Strict signature verification is a deliberate posture, not a missing feature.
Agent shows offline
- Heartbeat hasn’t landed in 60s.
- Check
journalctl -u galley-agent(systemd) ordocker logs galley-agent. - Most failures: control-plane URL unreachable, bootstrap token already exchanged, clock skew (TLS).
Build hangs
- The agent’s data dir is full. Clear
/var/lib/galley/builds/(or whateverGALLEY_DATA_DIRpoints at). - A previous build’s container leaked.
docker ps -a --filter label=galley.managed=trueand clean up. - Image registry is rate-limiting. Watch
docker logs galley-agentfor429 Too Many Requests.
galley-buildkit-v2 is in Restarting state
Almost always Ubuntu’s unprivileged-user-namespace restriction. Symptom in docker logs galley-buildkit-v2:
[rootlesskit:parent] error: failed to start the child: fork/exec /proc/self/exe: permission denied
See Ubuntu unprivileged user namespaces above for the one-line fix and why it’s the right call.
Cert issuance pending
DNS-01 challenges take 60-120s with most registrars. Past a few minutes:
- Check the proxy’s logs (
docker compose logs traefik) for ACME errors. - Wrong DNS provider env vars — typo in
CLOUDFLARE_DNS_API_TOKENis the most common. - Token scope too narrow — Cloudflare needs Zone:Read + DNS:Edit on the specific zone.
Preview shows the wrong service
When the env has multiple routable services, every non-bare one gets <svc>-<env> (a single label deeper than the bare env URL — single-level wildcard friendly). With a web and an api service in the same env: <env> → web, api-<env> → api. With two web services (admin and frontend): admin-<env>, frontend-<env>, and the bare <env> resolves to one of them deterministically. Rename to flip which one wins, or hit each service at its own hostname.
Lost contact with control plane mid-deploy
Agents finish their current build locally; cancels are advisory. The reconciler on the server marks deployments failed with agent_lost_contact after the agent’s heartbeat times out. When the agent reconnects, the failed deployment stays failed (already terminal) — you re-trigger from the dashboard or push another commit.
Getting help
- Email utibeabasiumanah6@gmail.com for bugs, install help, or anything that needs eyes on it.
- Security disclosures use the same address — see the Security page.
- For paying customers (when the hosted version exists), email or in-product chat.