Skip to content

restart: fail loudly when managed Dolt does not come back up#5

Open
austinborn wants to merge 2 commits into
mainfrom
fix-gc-restart-dolt-healthcheck
Open

restart: fail loudly when managed Dolt does not come back up#5
austinborn wants to merge 2 commits into
mainfrom
fix-gc-restart-dolt-healthcheck

Conversation

@austinborn

Copy link
Copy Markdown

Summary

  • After gc restart, the supervisor can report a city as Running while managed Dolt is still unreachable, because the post-start health probe in prepareCityForSupervisor is marked non-fatal. With Dolt down, every bd-backed alerting path is silently blinded — agents can't nudge each other and the operator only notices the outage by realizing nothing is happening.
  • Add a post-restart Dolt healthcheck in cmdRestartJSON. It polls healthBeadsProvider against the city until managed Dolt is queryable or a configurable budget expires; on timeout, the command exits non-zero with a clear error naming the cause and pointing at recovery.
  • The check is a no-op for cities where gc does not own the Dolt lifecycle: file provider, postgres backend, external Dolt.

Behavior

  • Default budget: 30 seconds. Override per-invocation with the GC_RESTART_DOLT_HEALTH_TIMEOUT env var (any time.ParseDuration value, e.g. 45s, 2m). Invalid / zero / negative values fall back to the default.
  • On success (Dolt healthy): unchanged — gc restart continues to exit 0 and (with --json) emits the same lifecycle JSON.
  • On timeout (Dolt unreachable): exits 1, writes a message to stderr that names the timeout, the last probe error, the recovery command (gc start <city>), and the env var to extend the budget.
  • No-op cases: file-provider cities, postgres-backend cities, and externally-managed Dolt skip the probe entirely.

Test plan

  • go test ./cmd/gc/ -run 'TestVerifyDoltHealthyAfterRestart|TestRestartDoltHealthTimeoutFromEnv|TestCmdRestartJSON_HealthcheckFailureExitsNonZero' — all pass.
  • go test ./cmd/gc/ -run 'TestDoRigRestart' — existing rig-restart tests still pass.
  • go test ./cmd/gc/... — the only failures (TestRunStartDriftCheck_RestartReturnsContinue, TestDoStartJSONAlreadyRunningSupervisorKeepsStdoutJSONOnly) pre-exist on origin/main and are macOS-only (the tests assume Linux /proc/<pid>/exe). Confirmed by running the same selector on a fresh origin/main worktree.
  • Reviewer: confirm on a Linux machine that the env-var override works (GC_RESTART_DOLT_HEALTH_TIMEOUT=10s gc restart).
  • Reviewer: confirm the no-op branch covers your external-Dolt / postgres city configurations (no probe should fire, no behavior change).

Generated by the operator's software factory.
• City: `factory-main` · Agent: `local-core.builder-1`
• On behalf of: @austinborn

`gc restart` previously returned success even when the supervisor's
post-start health probe for managed Dolt failed. That probe is marked
non-fatal in prepareCityForSupervisor, so a "Running" city can sit with
the bead-store backend unreachable. Every bd-backed alerting path then
silently fails: agents can't nudge each other, escalation beads can't
be created, and the operator's only signal is "nothing is working."

Add a post-restart Dolt healthcheck in cmdRestartJSON. After the start
step reports success, poll healthBeadsProvider against the city until
managed Dolt is queryable or a configurable budget (default 30s; env
override GC_RESTART_DOLT_HEALTH_TIMEOUT) expires. On timeout, write a
clear error naming the cause and pointing at recovery (`gc start
<city>`), and exit non-zero. The check is a no-op for cities where gc
does not own the Dolt lifecycle (file provider, postgres backend,
external Dolt).

Tests cover: no-op for file provider, no-op for external Dolt,
success on first probe, success after retry, timeout produces a loud
error message naming cause and recovery, env-var parsing handles
invalid/zero/negative durations, and an integration view through
cmdRestartJSON that confirms the command surfaces the failure to
stderr and exits non-zero.

Generated by the operator's software factory.
City: factory-main · Agent: local-core.builder-1
On behalf of: @austinborn
Co-Authored-By: <operator-factory-bot> <factory-bot@<operator-domain>.invalid>
Two golangci-lint findings on the previous commit:
- restart_dolt_health.go: stderr parameter was unused (revive
  unused-parameter). Use it for a one-line "Verifying managed Dolt is
  healthy (budget %s)..." message on entry so operators watching
  `gc restart` see progress before the full budget elapses.
- restart_dolt_health_test.go: %v in fmt.Errorf where the formatted
  value is an error (errorlint non-wrapping). Switch to %w; the
  produced message string is unchanged.

The production verifyDoltHealthyAfterRestart timeout error already
uses %w for the probe error in the same commit's change.

Generated by the operator's software factory.
City: factory-main · Agent: local-core.builder-1
On behalf of: @austinborn
Co-Authored-By: <operator-factory-bot> <factory-bot@<operator-domain>.invalid>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant