Skip to content

Feature/run pipeline#20

Open
seanrivera wants to merge 33 commits into
mainfrom
feature/run-pipeline
Open

Feature/run pipeline#20
seanrivera wants to merge 33 commits into
mainfrom
feature/run-pipeline

Conversation

@seanrivera

Copy link
Copy Markdown
Member

This depends on the Scorer branch and the other open PRs. Meant to be a front end system. Currently we don't have a clean way to automatically generate mazes, but we can link to Helen's generated mazes, and the run manifest will just take a list of files. I recommend we land the scorer PR before reviewing this as it makes it easier to see the changes.

The core files for this change are the reports.py, the new metrics.py and the run_pipeline.py.

In the future we might split the run_pipeline into multiple files, but I thought it would be easier to have a single point of reference for now.

helenlu66 and others added 30 commits May 29, 2026 01:27
…into feature/run-pipeline

# Conflicts:
#	ogbench
# Conflicts:
#	interface/agents.py
#	pyproject.toml
PR #18 added last_usage telemetry to the old single-file interface/agents.py,
which no longer exists (now an interface/agents/ package). Port the pattern:
ClaudeAnthropicAgent captures usage from the Anthropic response via
normalize_token_usage; Qwen35VLAgent records prompt/output token counts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wire the canonical pipeline stages over the interface/ runner (Stack A) and the
scorer/ package into a single inspectable orchestrator:

- pipeline/run_stage3.py: run one live-model episode -> episode.json
- pipeline/episode_metrics.py: derive path_choice (test2),
  mechanism_interaction_order + failure_point (test3), token totals, and the
  Appendix A.3 episode_runs.jsonl row; enrich runs for the scorer
- pipeline/reports.py: scoring_calibration_summary / complexity_distance_summary
  / mechanism_ordering_pairs aggregators
- scripts/run_pipeline.py: Stage 1->5 CLI (multinet-run-pipeline)
- scripts/validate_fixtures.py: validate fixtures + derive test2 route cells
- gridworld/fixtures/: manifest + test2 shortcut maze + test3 ordering pairs
  (test1 reuses the existing validation_10 set)
- tests for episode metrics, reports, and an end-to-end pipeline run

Baselines (BFS/greedy) stay Stage-2 difficulty/canonical-path generators via the
scorer; Stage-3 episodes are live-model-only. No DAG runner (kept sequential).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a run-config layer that maps each model to its own task selection and
provider/params, keeping the manifest as a separate metadata catalog:

- scripts/run_pipeline.py: load_run_config + resolve_task_rows (entries may be
  task-file paths, catalog task_ids, or experiment keywords; catalog metadata is
  attached by path so test2/test3 signals survive); run_from_config drives
  multiple models, scoring the union suite once and aggregating one
  episode_runs.jsonl + report set. _build_agent_from_spec constructs claude/qwen
  agents from the model entry (provider/model/temperature/max_tokens).
- CLI: --run-config is the primary path; --agent/--experiment remain a
  single-model fallback.
- gridworld/fixtures/run_config.example.json: sample config.
- tests for task resolution and a config-driven multi-model run (stub factory).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cached artifacts are now reused only when their inputs hash still matches,
instead of skipping purely on file existence:

- Stage 2: reuse scored_static.json/canonical_paths.json only when the stored
  inputs_hash equals the hash recomputed from the current task spec + scorer
  config; otherwise regenerate the bundle. _expected_static_hash mirrors the
  scorer recipe (guarded by a parity test).
- Stage 3 (model calls, the expensive stage): stamp each episode with a sidecar
  run_inputs.json carrying an inputs_hash over {task spec, model_id, seed, prompt
  config, backend, pipeline_version}; reuse the cached episode only on a hash
  match. Scorer-config changes intentionally do NOT invalidate the episode.
- Stage 4 (cheap, deterministic): always re-score from the cached/fresh episode,
  so scorer-config / static / canonical changes propagate to run_score.json.
- canonical_paths.json now carries its own inputs_hash (scorer/artifacts.py +
  solvers.py), closing the last unhashed scorer artifact.

Tests: hash parity with the scorer, episode cache hit on unchanged re-run, task
edit invalidating both static and episode, and scorer-config change re-scoring
without re-running the model.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Prompts are not versioned yet while we iterate, so the Stage-3 run-inputs hash
no longer includes the ExperimentConfig; the prompt variant still separates runs
via the <condition> directory. Left a TODO to fold backend/adapter code versions
into the run hash at v1 release.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Apply /code-review max findings on the run pipeline:
- reports: fix test-3 expected_order_match_rate to compare only the
  expected mechanisms' relative order, not the full interaction order
  (which also carries downstream doors/gates, so it was always 0).
- episode_metrics: lift final_state.reward into the scorer-facing dict so
  run_score.json matches the jsonl row; guard an explicit final_state=null;
  align optimality_ratio with the scorer's step_ratio for optimal_steps==0.
- run_pipeline: require canonical_paths.json for the Stage-2 cache hit;
  raise a clear error for an unknown --conditions name; derive episode
  metrics once per run instead of twice.

Separate the prompt-variant axis from the manifest condition (F2): catalog
rows always define `condition`, so the old setdefault collapsed prompt
variants and collided composite keys. Thread a distinct `prompt_variant`
field through build_run_row, the composites key, and reports (_run_key,
success_rate_by_prompt_variant, test-2 grouping). Add regression tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Moving ExperimentConfig and run_episode imports inside their respective
functions allows Stage 1-2 (manifest + solvers + static score) to be
imported without pulling in the heavy interface stack.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A schema-valid task that Stage 2 marks is_beatable=false is ineligible:
skip its Stage 3/4 work (model/API calls + scoring) in _run_one_model
instead of spending runs on it, and surface the ineligible set via
scoring_calibration_summary's new ineligible_tasks field.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The installed multinet-run-pipeline CLI defaults to
gridworld/fixtures/manifest.json, but package-data only shipped
gridworld/tasks, so the default manifest and its test2/test3 fixture
task files were omitted from the wheel. Add fixtures/**/*.json.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@seanrivera seanrivera requested a review from pranavguru June 9, 2026 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants