restart: fail loudly when managed Dolt does not come back up#5
Open
austinborn wants to merge 2 commits into
Open
restart: fail loudly when managed Dolt does not come back up#5austinborn wants to merge 2 commits into
austinborn wants to merge 2 commits into
Conversation
`gc restart` previously returned success even when the supervisor's post-start health probe for managed Dolt failed. That probe is marked non-fatal in prepareCityForSupervisor, so a "Running" city can sit with the bead-store backend unreachable. Every bd-backed alerting path then silently fails: agents can't nudge each other, escalation beads can't be created, and the operator's only signal is "nothing is working." Add a post-restart Dolt healthcheck in cmdRestartJSON. After the start step reports success, poll healthBeadsProvider against the city until managed Dolt is queryable or a configurable budget (default 30s; env override GC_RESTART_DOLT_HEALTH_TIMEOUT) expires. On timeout, write a clear error naming the cause and pointing at recovery (`gc start <city>`), and exit non-zero. The check is a no-op for cities where gc does not own the Dolt lifecycle (file provider, postgres backend, external Dolt). Tests cover: no-op for file provider, no-op for external Dolt, success on first probe, success after retry, timeout produces a loud error message naming cause and recovery, env-var parsing handles invalid/zero/negative durations, and an integration view through cmdRestartJSON that confirms the command surfaces the failure to stderr and exits non-zero. Generated by the operator's software factory. City: factory-main · Agent: local-core.builder-1 On behalf of: @austinborn Co-Authored-By: <operator-factory-bot> <factory-bot@<operator-domain>.invalid>
Two golangci-lint findings on the previous commit: - restart_dolt_health.go: stderr parameter was unused (revive unused-parameter). Use it for a one-line "Verifying managed Dolt is healthy (budget %s)..." message on entry so operators watching `gc restart` see progress before the full budget elapses. - restart_dolt_health_test.go: %v in fmt.Errorf where the formatted value is an error (errorlint non-wrapping). Switch to %w; the produced message string is unchanged. The production verifyDoltHealthyAfterRestart timeout error already uses %w for the probe error in the same commit's change. Generated by the operator's software factory. City: factory-main · Agent: local-core.builder-1 On behalf of: @austinborn Co-Authored-By: <operator-factory-bot> <factory-bot@<operator-domain>.invalid>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gc restart, the supervisor can report a city as Running while managed Dolt is still unreachable, because the post-start health probe inprepareCityForSupervisoris marked non-fatal. With Dolt down, every bd-backed alerting path is silently blinded — agents can't nudge each other and the operator only notices the outage by realizing nothing is happening.cmdRestartJSON. It pollshealthBeadsProvideragainst the city until managed Dolt is queryable or a configurable budget expires; on timeout, the command exits non-zero with a clear error naming the cause and pointing at recovery.Behavior
GC_RESTART_DOLT_HEALTH_TIMEOUTenv var (anytime.ParseDurationvalue, e.g.45s,2m). Invalid / zero / negative values fall back to the default.gc restartcontinues to exit 0 and (with--json) emits the same lifecycle JSON.gc start <city>), and the env var to extend the budget.Test plan
go test ./cmd/gc/ -run 'TestVerifyDoltHealthyAfterRestart|TestRestartDoltHealthTimeoutFromEnv|TestCmdRestartJSON_HealthcheckFailureExitsNonZero'— all pass.go test ./cmd/gc/ -run 'TestDoRigRestart'— existing rig-restart tests still pass.go test ./cmd/gc/...— the only failures (TestRunStartDriftCheck_RestartReturnsContinue,TestDoStartJSONAlreadyRunningSupervisorKeepsStdoutJSONOnly) pre-exist onorigin/mainand are macOS-only (the tests assume Linux/proc/<pid>/exe). Confirmed by running the same selector on a freshorigin/mainworktree.GC_RESTART_DOLT_HEALTH_TIMEOUT=10s gc restart).