Docs / Operate
Operations
Upgrades, backups, agent fleet, troubleshooting.
The day-to-day mechanics of running Galley in production. Most of this is “do what you’d do for any compose project” with a few specifics around the master key and the agent fleet.
Upgrades
Releases ship as versioned Docker tags (galleysh/server:v1.4.0). The compose file pulls a major-version tag (v1) by default; switch to a pinned version if you want to control upgrades explicitly.
cd /opt/galley
docker compose pull
docker compose up -d
docker compose logs -f galley-server
The server runs migrations on boot. Migrations are forward-only in v1 — there’s no automatic rollback. Pin a backed-up Postgres if you need to revert.
Agents reconnect automatically when the server comes back. In-flight builds running on agents are unaffected by a server restart; events queue on the message bus and replay when the server returns.
For a planned upgrade window:
- Pause:
docker compose stop galley-server(the agent keeps running, but webhooks 502 — GitHub retries with backoff). - Pull:
docker compose pull galley-server galley-agent. - Start:
docker compose up -d.
Backups
Three things to back up, in order of importance:
- The master key. If you lose this, secrets are unrecoverable. Treat it like a TLS private key — multiple offline copies, separate from the database.
- Postgres. Use whatever you already use (
pg_dump, WAL archiving, provider snapshots). All control-plane state lives here: projects, environments, deployments, audit log, encrypted secrets. - TLS volume. The
galley-tlsvolume holds the ACME account + issued certs. Losing it means the next boot re-issues from scratch (a few minutes of self-signed) — annoying but recoverable.
Restore drill: bring up a fresh host, pull the compose, restore Postgres, set the master key, start. Test once a quarter.
What you don’t need to back up: agent data dirs (worktrees + caches — disposable), the message bus volume (in-flight events only).
Adding an agent
The fast path:
- Dashboard → Admin → Agents → New agent → name it, copy the bootstrap token.
- On the new host, follow Agents — systemd unit or compose service.
- The agent registers with the bootstrap, exchanges for a long-lived credential, and shows up online in the dashboard within a heartbeat (~30s).
Removing an agent: drain it (mark offline manually), stop the service, delete the agent record. Containers tagged galley.managed=true on that host can be removed with docker rm -f $(docker ps -aq --filter label=galley.managed=true) once you’re sure no other agent is on the same host.
Rotating secrets
| What | How |
|---|---|
| GitHub App private key | Regenerate in GitHub → paste into Admin → Instance → GitHub App → save. |
| GitHub webhook secret | Settings → Git connection → Rotate webhook secret → update on the App’s webhook config. |
| Project bypass token | Settings → Preview access → Rotate. |
| Project basic auth password | Settings → Preview access → Save (with a new password). |
| Agent bootstrap token | Delete + re-add the agent in the dashboard. |
| Master key | See the master key. |
Audit log
Every admin action and auth event lands in audit_log, queryable at Admin → Audit log. Filter by actor, action, or time range. The list is exportable as CSV or JSON lines.
Things that get audited:
- Logins (success + failure).
- Project create / update / archive / delete.
- Project member add / remove / role change.
- Env var read / write.
- Bypass token rotate / revoke.
- Agent register / revoke.
- API token create / revoke.
- Preview access mode + credentials change.
The log is append-only at the schema level. Old rows age out per the configured retention (default: forever).
Resource limits
Per-project knobs in the dashboard:
- Default TTL — how long a preview lives without a new commit before auto-teardown. Default 72h.
- Build forked PRs — off by default; flips to on at your risk.
Per-agent knobs:
GALLEY_MAX_PARALLEL_BUILDS— concurrent builds per agent. Default 4.GALLEY_DEFAULT_CPUS/GALLEY_DEFAULT_MEMORY— defaults for services that don’t pin resources.GALLEY_DATA_DIRretention — log chunks beyond 7 days get swept; webhook deliveries beyond 30 days too.
Troubleshooting
Webhook deliveries failing
Dashboard → Admin → GitHub → Deliveries. Each row shows the result and any error. Common causes:
rejected (signature)— webhook secret mismatch. Update the secret on the App side or rotate it in Galley.rejected (parse)— payload doesn’t decode as a known event type. Surfaces when GitHub adds new optional fields; the server falls back to “ignore”.acceptedbut no env appears — the project isn’t connected to that repo, or the connection’s installation ID has drifted (re-install the App).
Agent shows offline
- Heartbeat hasn’t landed in 60s.
- Check
journalctl -u galley-agent(systemd) ordocker logs galley-agent. - Most failures: control-plane URL unreachable, bootstrap token already exchanged, clock skew (TLS).
Build hangs
- The agent’s data dir is full. Clear
/var/lib/galley/builds/(or whateverGALLEY_DATA_DIRpoints at). - A previous build’s container leaked.
docker ps -a --filter label=galley.managed=trueand clean up. - Image registry is rate-limiting. Watch
docker logs galley-agentfor429 Too Many Requests.
Cert issuance pending
DNS-01 challenges take 60-120s with most registrars. Past a few minutes:
- Check the proxy’s logs (
docker compose logs traefik) for ACME errors. - Wrong DNS provider env vars — typo in
CLOUDFLARE_DNS_API_TOKENis the most common. - Token scope too narrow — Cloudflare needs Zone:Read + DNS:Edit on the specific zone.
Preview shows the wrong service
Two routable services in the same env split the wildcard at named subdomains: web.<env> and api.<env>. Hitting the bare <env> resolves to one of them deterministically (web wins if there is one, else api). If you expected to see the API at the bare domain, either rename your web service or hit the API at api.<env>.
Lost contact with control plane mid-deploy
Agents finish their current build locally; cancels are advisory. The reconciler on the server marks deployments failed with agent_lost_contact after the agent’s heartbeat times out. When the agent reconnects, the failed deployment stays failed (already terminal) — you re-trigger from the dashboard or push another commit.
Getting help
- Email utibeabasiumanah6@gmail.com for bugs, install help, or anything that needs eyes on it.
- Security disclosures use the same address — see the Security page.
- For paying customers (when the hosted version exists), email or in-product chat.