Skip to content
Galley

Docs / Operate

Operations

Upgrades, backups, agent fleet, troubleshooting.

The day-to-day mechanics of running Galley in production. Most of this is “do what you’d do for any compose project” with a few specifics around the master key and the agent fleet.

Upgrades

Releases ship as versioned Docker tags (galleysh/server:v1.4.0). The compose file pulls a major-version tag (v1) by default; switch to a pinned version if you want to control upgrades explicitly.

cd /opt/galley
docker compose pull
docker compose up -d
docker compose logs -f galley-server

The server runs migrations on boot. Migrations are forward-only in v1 — there’s no automatic rollback. Pin a backed-up Postgres if you need to revert.

Agents reconnect automatically when the server comes back. In-flight builds running on agents are unaffected by a server restart; events queue on the message bus and replay when the server returns.

For a planned upgrade window:

  1. Pause: docker compose stop galley-server (the agent keeps running, but webhooks 502 — GitHub retries with backoff).
  2. Pull: docker compose pull galley-server galley-agent.
  3. Start: docker compose up -d.

Backups

Three things to back up, in order of importance:

  1. The master key. If you lose this, secrets are unrecoverable. Treat it like a TLS private key — multiple offline copies, separate from the database.
  2. Postgres. Use whatever you already use (pg_dump, WAL archiving, provider snapshots). All control-plane state lives here: projects, environments, deployments, audit log, encrypted secrets.
  3. TLS volume. The galley-tls volume holds the ACME account + issued certs. Losing it means the next boot re-issues from scratch (a few minutes of self-signed) — annoying but recoverable.

Restore drill: bring up a fresh host, pull the compose, restore Postgres, set the master key, start. Test once a quarter.

What you don’t need to back up: agent data dirs (worktrees + caches — disposable), the message bus volume (in-flight events only).

Adding an agent

The fast path:

  1. Dashboard → Admin → Agents → New agent → name it, copy the bootstrap token.
  2. On the new host, follow Agents — systemd unit or compose service.
  3. The agent registers with the bootstrap, exchanges for a long-lived credential, and shows up online in the dashboard within a heartbeat (~30s).

Removing an agent: drain it (mark offline manually), stop the service, delete the agent record. Containers tagged galley.managed=true on that host can be removed with docker rm -f $(docker ps -aq --filter label=galley.managed=true) once you’re sure no other agent is on the same host.

Rotating secrets

WhatHow
GitHub App private keyRegenerate in GitHub → paste into Admin → Instance → GitHub App → save.
GitHub webhook secretSettings → Git connection → Rotate webhook secret → update on the App’s webhook config.
Project bypass tokenSettings → Preview access → Rotate.
Project basic auth passwordSettings → Preview access → Save (with a new password).
Agent bootstrap tokenDelete + re-add the agent in the dashboard.
Master keySee the master key.

Audit log

Every admin action and auth event lands in audit_log, queryable at Admin → Audit log. Filter by actor, action, or time range. The list is exportable as CSV or JSON lines.

Things that get audited:

  • Logins (success + failure).
  • Project create / update / archive / delete.
  • Project member add / remove / role change.
  • Env var read / write.
  • Bypass token rotate / revoke.
  • Agent register / revoke.
  • API token create / revoke.
  • Preview access mode + credentials change.

The log is append-only at the schema level. Old rows age out per the configured retention (default: forever).

Resource limits

Per-project knobs in the dashboard:

  • Default TTL — how long a preview lives without a new commit before auto-teardown. Default 72h.
  • Build forked PRs — off by default; flips to on at your risk.

Per-agent knobs:

  • GALLEY_MAX_PARALLEL_BUILDS — concurrent builds per agent. Default 4.
  • GALLEY_DEFAULT_CPUS / GALLEY_DEFAULT_MEMORY — defaults for services that don’t pin resources.
  • GALLEY_DATA_DIR retention — log chunks beyond 7 days get swept; webhook deliveries beyond 30 days too.

Troubleshooting

Webhook deliveries failing

Dashboard → Admin → GitHub → Deliveries. Each row shows the result and any error. Common causes:

  • rejected (signature) — webhook secret mismatch. Update the secret on the App side or rotate it in Galley.
  • rejected (parse) — payload doesn’t decode as a known event type. Surfaces when GitHub adds new optional fields; the server falls back to “ignore”.
  • accepted but no env appears — the project isn’t connected to that repo, or the connection’s installation ID has drifted (re-install the App).

Agent shows offline

  • Heartbeat hasn’t landed in 60s.
  • Check journalctl -u galley-agent (systemd) or docker logs galley-agent.
  • Most failures: control-plane URL unreachable, bootstrap token already exchanged, clock skew (TLS).

Build hangs

  • The agent’s data dir is full. Clear /var/lib/galley/builds/ (or whatever GALLEY_DATA_DIR points at).
  • A previous build’s container leaked. docker ps -a --filter label=galley.managed=true and clean up.
  • Image registry is rate-limiting. Watch docker logs galley-agent for 429 Too Many Requests.

Cert issuance pending

DNS-01 challenges take 60-120s with most registrars. Past a few minutes:

  • Check the proxy’s logs (docker compose logs traefik) for ACME errors.
  • Wrong DNS provider env vars — typo in CLOUDFLARE_DNS_API_TOKEN is the most common.
  • Token scope too narrow — Cloudflare needs Zone:Read + DNS:Edit on the specific zone.

Preview shows the wrong service

Two routable services in the same env split the wildcard at named subdomains: web.<env> and api.<env>. Hitting the bare <env> resolves to one of them deterministically (web wins if there is one, else api). If you expected to see the API at the bare domain, either rename your web service or hit the API at api.<env>.

Lost contact with control plane mid-deploy

Agents finish their current build locally; cancels are advisory. The reconciler on the server marks deployments failed with agent_lost_contact after the agent’s heartbeat times out. When the agent reconnects, the failed deployment stays failed (already terminal) — you re-trigger from the dashboard or push another commit.

Getting help

  • Email utibeabasiumanah6@gmail.com for bugs, install help, or anything that needs eyes on it.
  • Security disclosures use the same address — see the Security page.
  • For paying customers (when the hosted version exists), email or in-product chat.