Skip to content

docs: Mythos-style bug-hunt pipeline + spar ranking#133

Merged
avrabe merged 2 commits intomainfrom
docs/mythos-pipeline
Apr 22, 2026
Merged

docs: Mythos-style bug-hunt pipeline + spar ranking#133
avrabe merged 2 commits intomainfrom
docs/mythos-pipeline

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented Apr 21, 2026

Summary

Ships the four-prompt bug-hunt pipeline modeled on Anthropic's Claude Mythos (red.anthropic.com, April 2026) adapted for spar, plus the ranking output from its first end-to-end run.

The architecture: let an agent reason freely about code, but require a machine-checkable oracle (Kani harness + PoC test) for every reported bug so hallucinations don't ship.

What's in scripts/mythos/

File Purpose
HOWTO.md Pipeline overview, prerequisites, worked example (sigil case study)
rank.md Tier-1-to-5 rubric adapted to spar's AADL-pipeline attack surface
discover.md Mythos-verbatim discovery prompt + spar context + AADL hypothesis priors
validate.md Fresh-session validator enforcing both oracle halves + interestingness
emit.md Converts confirmed finding → draft STPA artifact under safety/stpa/
rank.json 144-file ranking produced by running rank.md on the spar workspace

Proof of life

The first parallel run of discover on the top-5 tier-5 files surfaced five findings:

  • 3 confirmed — now fixed on PR #132: unary-sign regression, duplicate component sections, overflowing range-min.
  • 2 confirmed-but-no-uca — validators correctly filtered them as over-claimed severity (AADL +=> is append not override; bare-array applies-to is a missing feature, not a resolver bug).

So the oracle gate and validation pass both earned their keep on run one.

How to run it

# Step 1 — rank
$ claude
> Read scripts/mythos/rank.md
# → produces JSON ranking (see rank.json for an example)

# Step 2 — discover in parallel, one agent per tier-≥4 file
# (the parallelism is what makes Mythos effective)
> discover.md with {{file}} = crates/spar-parser/src/grammar/properties.rs

# Step 3 — validate in fresh sessions
> validate.md with {{report}} = <finding>

# Step 4 — emit a draft tracking artifact
> emit.md with {{confirmed_report}} + {{uca_id}}

Gotchas called out in HOWTO

  • Failing tests in source break CI → use #[ignore] with rerun command in ignore reason.
  • The rubric is wrong the first time — expect to patch after pass 1.
  • Validators must be fresh sessions. Reusing discovery context lets the agent defend its own hypothesis.
  • One agent per file, not per codebase.
  • Keep the discovery prompt minimal. Elaborate CWE checklists underperform the plain "find a correctness bug" prompt because the agent has tools (oracle, debugger, runtime) and the environment filters truth.

Related

🤖 Generated with Claude Code

Ships the four-prompt bug-hunt pipeline modeled on Anthropic's Claude
Mythos (red.anthropic.com, April 2026) adapted for spar. The
architecture: let an agent reason freely about code but require a
machine-checkable oracle (Kani harness + PoC test) for every reported
bug so hallucinations don't ship.

Contents:
- HOWTO.md — pipeline overview, prerequisites, worked example
- rank.md — tier-1-through-5 rubric specific to spar's pipeline
  (parser/syntax/hir-def/annex = tier 5; analysis/solver = tier 4; …)
- discover.md — Mythos-verbatim discovery prompt + spar context +
  AADL-specific hypothesis priors (parser edge cases, HIR drift,
  proof-code drift from the 180 Kani harnesses)
- validate.md — fresh-session validator enforcing both oracle halves
- emit.md — converts a confirmed finding into a draft STPA artifact
  entry under safety/stpa/*.yaml
- rank.json — ranking of 144 spar source files produced by running
  rank.md (tier 5: 43 files; tier 4: 51; tier 3: 35; tier 2: 14;
  tier 1: 1). Kept as evidence the pipeline works end-to-end.

First parallel run on the top-5 tier-5 files surfaced three shipped
bugs (fixed in PR #132) and two confirmed-but-no-uca findings that
were filtered by validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Same patch as fix/mythos-batch-1 — `thin-vec 0.2.14` has a UAF/Double
Free in `IntoIter::drop` reachable from safe Rust (RustSec 2026-0103),
0.2.16 ships the fix. Indirect through salsa.

Unblocks `Cargo Deny` + `Security Audit (RustSec)` CI checks on this
branch too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@avrabe avrabe force-pushed the docs/mythos-pipeline branch from 4a65867 to 29b157b Compare April 21, 2026 19:35
@avrabe avrabe merged commit 7b0e373 into main Apr 22, 2026
11 checks passed
@avrabe avrabe deleted the docs/mythos-pipeline branch April 22, 2026 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant