docs: Mythos-style bug-hunt pipeline + spar ranking#133
Merged
Conversation
Ships the four-prompt bug-hunt pipeline modeled on Anthropic's Claude Mythos (red.anthropic.com, April 2026) adapted for spar. The architecture: let an agent reason freely about code but require a machine-checkable oracle (Kani harness + PoC test) for every reported bug so hallucinations don't ship. Contents: - HOWTO.md — pipeline overview, prerequisites, worked example - rank.md — tier-1-through-5 rubric specific to spar's pipeline (parser/syntax/hir-def/annex = tier 5; analysis/solver = tier 4; …) - discover.md — Mythos-verbatim discovery prompt + spar context + AADL-specific hypothesis priors (parser edge cases, HIR drift, proof-code drift from the 180 Kani harnesses) - validate.md — fresh-session validator enforcing both oracle halves - emit.md — converts a confirmed finding into a draft STPA artifact entry under safety/stpa/*.yaml - rank.json — ranking of 144 spar source files produced by running rank.md (tier 5: 43 files; tier 4: 51; tier 3: 35; tier 2: 14; tier 1: 1). Kept as evidence the pipeline works end-to-end. First parallel run on the top-5 tier-5 files surfaced three shipped bugs (fixed in PR #132) and two confirmed-but-no-uca findings that were filtered by validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Same patch as fix/mythos-batch-1 — `thin-vec 0.2.14` has a UAF/Double Free in `IntoIter::drop` reachable from safe Rust (RustSec 2026-0103), 0.2.16 ships the fix. Indirect through salsa. Unblocks `Cargo Deny` + `Security Audit (RustSec)` CI checks on this branch too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4a65867 to
29b157b
Compare
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships the four-prompt bug-hunt pipeline modeled on Anthropic's Claude Mythos (red.anthropic.com, April 2026) adapted for spar, plus the ranking output from its first end-to-end run.
The architecture: let an agent reason freely about code, but require a machine-checkable oracle (Kani harness + PoC test) for every reported bug so hallucinations don't ship.
What's in
scripts/mythos/HOWTO.mdrank.mddiscover.mdvalidate.mdemit.mdsafety/stpa/rank.jsonrank.mdon the spar workspaceProof of life
The first parallel run of discover on the top-5 tier-5 files surfaced five findings:
+=>is append not override; bare-array applies-to is a missing feature, not a resolver bug).So the oracle gate and validation pass both earned their keep on run one.
How to run it
Gotchas called out in HOWTO
#[ignore]with rerun command in ignore reason.Related
🤖 Generated with Claude Code