docs: Mythos-style bug-hunt pipeline + spar ranking by avrabe · Pull Request #133 · pulseengine/spar

avrabe · 2026-04-21T18:35:32Z

Summary

Ships the four-prompt bug-hunt pipeline modeled on Anthropic's Claude Mythos (red.anthropic.com, April 2026) adapted for spar, plus the ranking output from its first end-to-end run.

The architecture: let an agent reason freely about code, but require a machine-checkable oracle (Kani harness + PoC test) for every reported bug so hallucinations don't ship.

What's in `scripts/mythos/`

File	Purpose
`HOWTO.md`	Pipeline overview, prerequisites, worked example (sigil case study)
`rank.md`	Tier-1-to-5 rubric adapted to spar's AADL-pipeline attack surface
`discover.md`	Mythos-verbatim discovery prompt + spar context + AADL hypothesis priors
`validate.md`	Fresh-session validator enforcing both oracle halves + interestingness
`emit.md`	Converts confirmed finding → draft STPA artifact under `safety/stpa/`
`rank.json`	144-file ranking produced by running `rank.md` on the spar workspace

Proof of life

The first parallel run of discover on the top-5 tier-5 files surfaced five findings:

3 confirmed — now fixed on PR #132: unary-sign regression, duplicate component sections, overflowing range-min.
2 confirmed-but-no-uca — validators correctly filtered them as over-claimed severity (AADL +=> is append not override; bare-array applies-to is a missing feature, not a resolver bug).

So the oracle gate and validation pass both earned their keep on run one.

How to run it

# Step 1 — rank
$ claude
> Read scripts/mythos/rank.md
# → produces JSON ranking (see rank.json for an example)

# Step 2 — discover in parallel, one agent per tier-≥4 file
# (the parallelism is what makes Mythos effective)
> discover.md with {{file}} = crates/spar-parser/src/grammar/properties.rs

# Step 3 — validate in fresh sessions
> validate.md with {{report}} = <finding>

# Step 4 — emit a draft tracking artifact
> emit.md with {{confirmed_report}} + {{uca_id}}

Gotchas called out in HOWTO

Failing tests in source break CI → use #[ignore] with rerun command in ignore reason.
The rubric is wrong the first time — expect to patch after pass 1.
Validators must be fresh sessions. Reusing discovery context lets the agent defend its own hypothesis.
One agent per file, not per codebase.
Keep the discovery prompt minimal. Elaborate CWE checklists underperform the plain "find a correctness bug" prompt because the agent has tools (oracle, debugger, runtime) and the environment filters truth.

Ships the four-prompt bug-hunt pipeline modeled on Anthropic's Claude Mythos (red.anthropic.com, April 2026) adapted for spar. The architecture: let an agent reason freely about code but require a machine-checkable oracle (Kani harness + PoC test) for every reported bug so hallucinations don't ship. Contents: - HOWTO.md — pipeline overview, prerequisites, worked example - rank.md — tier-1-through-5 rubric specific to spar's pipeline (parser/syntax/hir-def/annex = tier 5; analysis/solver = tier 4; …) - discover.md — Mythos-verbatim discovery prompt + spar context + AADL-specific hypothesis priors (parser edge cases, HIR drift, proof-code drift from the 180 Kani harnesses) - validate.md — fresh-session validator enforcing both oracle halves - emit.md — converts a confirmed finding into a draft STPA artifact entry under safety/stpa/*.yaml - rank.json — ranking of 144 spar source files produced by running rank.md (tier 5: 43 files; tier 4: 51; tier 3: 35; tier 2: 14; tier 1: 1). Kept as evidence the pipeline works end-to-end. First parallel run on the top-5 tier-5 files surfaced three shipped bugs (fixed in PR #132) and two confirmed-but-no-uca findings that were filtered by validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-04-21T18:47:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Same patch as fix/mythos-batch-1 — `thin-vec 0.2.14` has a UAF/Double Free in `IntoIter::drop` reachable from safe Rust (RustSec 2026-0103), 0.2.16 ships the fix. Indirect through salsa. Unblocks `Cargo Deny` + `Security Audit (RustSec)` CI checks on this branch too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

avrabe force-pushed the docs/mythos-pipeline branch from 4a65867 to 29b157b Compare April 21, 2026 19:35

avrabe merged commit 7b0e373 into main Apr 22, 2026
11 checks passed

avrabe deleted the docs/mythos-pipeline branch April 22, 2026 04:04

avrabe mentioned this pull request Apr 22, 2026

fix: 4 Mythos-hardening fixes (parser spec-conformance + helper footguns) #139

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Mythos-style bug-hunt pipeline + spar ranking#133

docs: Mythos-style bug-hunt pipeline + spar ranking#133
avrabe merged 2 commits intomainfrom
docs/mythos-pipeline

avrabe commented Apr 21, 2026

Uh oh!

codecov Bot commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avrabe commented Apr 21, 2026

Summary

What's in scripts/mythos/

Proof of life

How to run it

Gotchas called out in HOWTO

Related

Uh oh!

codecov Bot commented Apr 21, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

What's in `scripts/mythos/`