Skip to content

fix(maintenance): dedupe + label ESCALATION mails from reaper and jsonl-export#4

Open
austinborn wants to merge 1 commit into
mainfrom
maintenance-escalation-dedupe
Open

fix(maintenance): dedupe + label ESCALATION mails from reaper and jsonl-export#4
austinborn wants to merge 1 commit into
mainfrom
maintenance-escalation-dedupe

Conversation

@austinborn

Copy link
Copy Markdown

Summary

  • Adds a shared bash helper examples/gastown/packs/maintenance/assets/scripts/escalation.sh that wraps gc mail send with per-(subject, body) sha256 dedupe + a configurable cooldown (default 6h, GC_ESCALATION_COOLDOWN_SECONDS env override).
  • Sources the helper from reaper.sh (anomaly escalation) and jsonl-export.sh (spike alert + push-failure escalation), so both stop flooding the mayor inbox on a persistent condition.
  • Best-effort labels the resulting ESCALATION beads with wisp_type:escalation by parsing the new bead id from gc mail send's "Sent message to " stdout line, so wisp-compact's 7d retention class actually applies. Untagged escalation mails were falling into the 24h default bucket.
  • Adds an occurrence-counter footer ([Suppressed N time(s) since <ts>; cooldown Xs.]) to released-after-suppression mails so the operator sees cadence, not just the latest sample.
  • Adds clear_escalation_state(state_file, [subject]) so callers can wipe dedupe entries when the underlying condition resolves. Wired into reaper.sh (no-anomaly tick) and jsonl-export.sh's record_archive_push_success.

The behavior change is conservative: when a single anomaly condition is reported on every tick (the failure mode this fixes), suppress everything but the first send per cooldown window. When the condition's body text changes (different counts, different anomaly list), the sha256 changes and the new variant escalates fresh.

Why

The mol-dog-reaper order's anomaly escalation on reaper.sh line 555 sends a fresh mail to the mayor on every 30-minute tick where $ANOMALIES is non-empty. With no per-message dedupe, an unchanging condition — e.g. the hq Dolt schema gap where the dependencies table lacks the split target columns — generates dozens of identical ESCALATION mails over a few days. The most recent observed window produced 20+ duplicate mails over 44 hours; an earlier 158-mail JSONL flood from the same gap class motivated the wisp-compact retention work that landed for jsonl-export.sh, but that prior fix did not extend to reaper.sh, and an even earlier labelling attempt for jsonl-export.sh (branch label-jsonl-escalation-wisp-type, commit 61678a5) was never merged.

This change closes both gaps in one PR by factoring the helper and applying it to both scripts. It supersedes the unmerged label-jsonl-escalation-wisp-type branch — the helper subsumes that branch's labelling behavior and adds dedupe on top. Reviewer can close that PR in favor of this one.

Test plan

  • go test ./examples/gastown/... -run TestReaperEscalation -v — 4 new reaper-side tests covering: dedupe within cooldown, state-clearing on condition-resolved tick, release-after-suppression footer, and bd-label-add on a successful send.
  • go test ./examples/gastown/... -run TestJsonlSpikeEscalationSuppressesRepeats -v — dedupe end-to-end through send_spike_alert in jsonl-export.sh.
  • go test ./examples/gastown/... (full package) — confirms no regressions in the existing maintenance-script tests (passed locally in 113s).
  • Manual: tail the operator's mayor inbox over one reaper tick window after merge + gc import update. The first tick with anomalies should produce one mail; subsequent ticks within 6h should be silent; the first tick after the cooldown should produce a mail with the "Suppressed N time(s)" footer.

Out of scope

  • The underlying hq schema gap that drives the anomaly text on every tick. The dedupe layer here removes the FLOOD symptom regardless; the migration that would clear the anomaly condition itself is a separate operator decision.
  • Bundle refresh and ephemeral-bead housekeeping. Those happen after this PR merges and the operator runs gc import update against the city's maintenance pack.

Generated by the operator's software factory.
• City: `factory-main` · Agent: `local-core__builder-fm-9nrnx4`
• On behalf of: @austinborn

…jsonl-export.sh

The mol-dog-reaper order (reaper.sh) sends an ESCALATION mail to the
mayor on every 30-minute tick that finds non-empty $ANOMALIES, with no
per-message-key dedupe, no cooldown, and no wisp_type label. Persistent
conditions — e.g. an hq schema gap where the dependencies table lacks
the split target columns — therefore flood the mayor inbox. The most
recent incident produced 20+ identical "ESCALATION: Reaper anomalies
detected [MEDIUM]" mails over a 44-hour window; the earlier 158-mail
JSONL-spike flood that motivated the prior wisp-compact retention work
was the same class of problem in a sibling code path that the prior
fix did not cover.

This commit factors a shared bash helper, escalation.sh, sourced by
both reaper.sh and jsonl-export.sh. The helper:

- Computes sha256(subject + "\n" + body) as a dedupe key and stores
  the last-sent timestamp + a suppressed-while-in-cooldown counter
  in a caller-provided JSON state file (atomic mv writes).
- Suppresses repeat sends within a configurable cooldown window
  (default 6 hours, GC_ESCALATION_COOLDOWN_SECONDS env override).
- On the first send after a suppression streak, appends a one-line
  "[Suppressed N time(s) since <last_sent_at>; cooldown <s>s.]"
  footer so the operator sees the cadence rather than just the
  latest report.
- Best-effort labels the resulting bead with wisp_type:escalation by
  parsing the new bead id from `gc mail send`'s "Sent message <id>
  to <to>" stdout line and calling `bd label add`. wisp-compact
  already honors wisp_type:escalation as the 7d retention class;
  untagged escalation mails were falling into the 24h default bucket.
- Exposes clear_escalation_state(state_file, [subject]) so callers
  can wipe dedupe entries when the underlying condition resolves
  and the next firing should escalate fresh.

Wiring:

- reaper.sh sources escalation.sh, gets a new
  $PACK_STATE_DIR/reaper-state.json state file (reaper had no state
  file before), and routes its anomaly send through
  send_escalation_mail. When $ANOMALIES is empty on a tick,
  clear_escalation_state wipes the dedupe entry so a future genuine
  anomaly is not suppressed by the resolved one's cooldown.

- jsonl-export.sh's send_spike_alert and the push-failure escalation
  now both call send_escalation_mail. record_archive_push_success
  additionally calls clear_escalation_state for the push-failure
  subject so a re-introduced push failure escalates fresh. This
  subsumes the unmerged label-jsonl-escalation-wisp-type branch
  (commit 61678a5), which added labelling but no dedupe; that
  branch can be closed in favor of this PR.

Tests (examples/gastown/maintenance_scripts_test.go):

- TestReaperEscalationSuppressesRepeatAnomalyWithinCooldown — two
  reaper ticks with identical anomalies result in exactly one mail
  and a suppressed_count=1 dedupe entry in the state file.
- TestReaperEscalationClearsStateWhenAnomaliesResolve — anomaly
  tick, then no-anomaly tick (which must clear dedupe state), then
  anomaly tick again, must produce 2 fresh mails total.
- TestReaperEscalationLabelsBeadAfterSend — when the gc stub mirrors
  real `gc mail send`'s "Sent message <id> to <to>" stdout, the
  follow-up `bd label add <id> wisp_type:escalation` lands in the
  bd log.
- TestReaperEscalationReleaseFooterReportsSuppressedCount — backdate
  the state file's last_sent_at and bump suppressed_count, then run
  again; the released mail's body must carry the
  "Suppressed N time(s) since <ts>" footer.
- TestJsonlSpikeEscalationSuppressesRepeats — two back-to-back
  jsonl-export runs with a 400%-delta spike emit exactly one
  ESCALATION mail.

Out of scope:

- The underlying hq schema gap that drives the anomaly text on every
  tick. The dedupe layer here removes the FLOOD symptom regardless;
  the migration that would clear the anomaly condition itself is a
  separate operator decision.
- Bundle refresh and wisp-bead housekeeping. Those happen after this
  PR merges and the operator runs `gc import update` against the
  city's maintenance pack.

Generated by the operator's software factory.
City: factory-main · Agent: local-core__builder-fm-9nrnx4
On behalf of: @austinborn
Co-Authored-By: <operator-factory-bot> <factory-bot@actual-software.invalid>
austinborn pushed a commit that referenced this pull request May 26, 2026
…gastownhall#2082) (gastownhall#2559)

Thanks to @mike-matchpoint for the clear repro and the structured
options menu in gastownhall#2082 — this PR implements the smallest set of
fail-closed mitigations (options #2 + #3 from the issue body) that stop
the silent strand at the polecat boundary.

## Summary

The polecat formula's `workspace-setup` step creates a per-bead branch
`polecat/<bead-id>` and the refinery's standard scan only discovers
those. Providers that skip `workspace-setup` (the reported codex case)
commit on whatever branch happens to be checked out in the agent's home
worktree, and the refinery silently never finds the work — beads end up
"assigned to refinery, no merge target", requiring manual recovery.

Two mitigations, both fail-closed:

### 1. Branch-shape gate in `mol-polecat-work.toml` (`submit-and-exit`)

A new **step 1** runs before the push and before the refinery reassign:

- Reads `git branch --show-current`.
- Refuses to proceed if the current branch is not `polecat/{{issue}}` —
prints recovery instructions, signals `gc runtime drain-ack`, and exits
1.
- Also reconciles `metadata.branch` so the refinery's metadata view
matches what is about to be pushed (in case `workspace-setup` recorded a
divergent value).

This stops the strand at the polecat boundary instead of after the bead
has already advanced to "assigned to refinery". Existing step numbers
shift by one (3→4 cleanup, 4→5 metadata, 5→6 reassign, 6→7 signal, 7→8
reconciler+exit); the existing
`TestPolecatFormulaSignalsRefineryAfterReassign` assertions are updated
to match.

### 2. `CRITICAL: Branch Convention` section in `prompt.template.md`

The original prompt deferred all branch detail to the formula
description. A provider that skips reading the formula now still sees
the `polecat/<bead-id>` rule **inline** in the prompt, with a worked
example table.

## Tests

- **New** `TestPolecatFormulaSubmitHasBranchShapeGate` — asserts the
gate body appears in order before the push and before the refinery
reassign, AND that the `metadata.branch` reconciliation is present.
- **New** `TestPolecatPromptInlinesBranchConvention` — asserts the
CRITICAL section names the convention and references gastownhall#2082.
- **Updated** `TestPolecatFormulaSignalsRefineryAfterReassign` — step
renumber.

## Out of scope (deferred from the issue body)

- **Option #1** (move per-bead worktree+branch creation into a pre-claim
hook) — the more robust structural fix, but touches the
supervisor/dispatch layer and warrants a maintainer design call. Belongs
in a separate PR.
- **Option #4** (codex-specific memory instructions) — provider-side and
out of the formula+prompt surface this PR addresses.

This change covers the prompt + formula layer where polecats already
operate; the supervisor-side hook is its own PR.

## Files

`examples/gastown/packs/gastown/formulas/mol-polecat-work.toml`,
`examples/gastown/packs/gastown/agents/polecat/prompt.template.md`,
`examples/gastown/gastown_test.go` (+141 / -11)

Closes gastownhall#2082

---------

Co-authored-by: sjarmak <sjarmak@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant