Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions skills/autobrowse/evals/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Required for real (non-mock) runs — inner agent + outer agent both use it.
ANTHROPIC_API_KEY=sk-ant-...

# Required for tasks with env=remote (Tier B/C bot-protected sites).
BROWSERBASE_API_KEY=bb_...

# Path to the autobrowse skill (the directory containing scripts/evaluate.mjs).
# Defaults to the parent directory (evals/ ships inside the skill); set only
# to point the harness at a different autobrowse checkout.
# AUTOBROWSE_DIR=/path/to/skills/skills/autobrowse
4 changes: 4 additions & 0 deletions skills/autobrowse/evals/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
node_modules/
runs/
.env
vendor/
96 changes: 96 additions & 0 deletions skills/autobrowse/evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# autobrowse-evals

Eval harness for the [autobrowse](https://github.com/browserbase/skills/tree/main/skills/autobrowse) self-improving browser-automation loop. Measures the four things that matter — **convergence speed**, **accuracy**, **runtime speed**, and **token cost** — and makes them comparable across inner/outer models, prompts, and architectures.

## The four artifacts being evaluated

| Artifact | What it is | Metrics |
|---|---|---|
| Single run | One `evaluate.mjs` attempt, empty strategy | accuracy baseline, speed, tokens |
| Learning loop | evaluate → verify → improve, repeated | convergence speed, cumulative cost |
| Graduated strategy | frozen best strategy.md run by a fresh agent | **holdout** accuracy/speed/tokens |
| Codegen script | deterministic playwright/stagehand output | (future: wire `codegen.mjs --verify` in) |

The core design decision: **training and evaluation are separated.** Convergence is measured during the loop; the *result* is measured by freezing the best strategy and running it N fresh times (holdout). And **pass/fail is never self-reported** — every task has a programmatic verifier; the agent's own `success: true` is only used to compute the false-success (reward-hacking) rate.

## Layout

```
eval/
run-matrix.mjs orchestrator: condition × task × trial → train + holdout
outer-agent.mjs scripted outer loop (one structured-output call per iteration;
outer tokens metered — the interactive loop never records them)
report.mjs aggregates runs/results.jsonl into scorecards
conditions/*.json sweepable variables: inner_model, outer_model, outer_prompt, iters
prompts/outer-*.md outer-prompt variants (default = SKILL.md methodology, lean = ablation)
tasks/<task>/ task.md (autobrowse format) + verify.mjs + meta.json + mock-output.json
fixtures/ self-hosted deterministic sites (Tier A ground truth)
runs/ workspaces, traces, results.jsonl (gitignored)
```

## Benchmark suite (9 tasks, 3 tiers)

Tasks marked ◆ are drawn from the browse.sh prompt library (`prompts/<domain>/<task>.md`).

| Tier | Task | Env | Verification |
|---|---|---|---|
| **A — deterministic** | `fixture-checkout` | local | exact confirmation code (shared hash function) + total |
| | `fixture-flightdeck` | local | exact cheapest-nonstop answer; traps: cheaper 1-stop, cheaper wrong route |
| | `books-toscrape` | local | exact count/prices/titles (static demo site) |
| **B — live, stable** | `uspto-patent-lookup` ◆ | remote | patent facts are immutable (US 11,000,000) |
| | `google-flights` ◆ | local | invariants: nonstop, airline set, price band, internal consistency |
| | `opentable-availability` ◆ | local | invariants: date/party echoed, slot format, availability consistency |
| | `youtube-transcript` ◆ | local | immutable content ("Me at the zoo" transcript phrases) |
| **C — bot-protected** | `stockx-price` ◆ | remote | product identity + price band (PerimeterX) |
| | `yelp-reviews` ◆ | remote | rating/review-count bands + per-review structure (DataDome) |

Tier A gives model comparisons statistical teeth; Tier B measures real-site competence with invariant checks; Tier C measures infrastructure robustness (report it separately — variance is the site's, not the model's).

**Verifier protocol** (mirrors autobrowse's codegen runner protocol): `node eval/tasks/<task>/verify.mjs --run-dir <traceDir>` → one JSON line `{passed, checks: [{name, ok, detail}], reason}`. Each task's `mock-output.json` is its documented known-good output; `npm run test:verifiers` asserts every verifier passes it and rejects a garbage `{"success": true}` — i.e., verifiers are tested against reward-hacking.

## Setup

```bash
npm install
cp .env.example .env # ANTHROPIC_API_KEY (+ BROWSERBASE_API_KEY for remote tasks)
npm install -g browse # the browse CLI used by the inner agent
# AUTOBROWSE_DIR defaults to the parent dir (this folder ships inside the skill)
```

## Usage

```bash
npm run test:verifiers # verifier soundness (no keys needed)
node eval/run-matrix.mjs --conditions baseline --tasks fixture-checkout --mock # free pipeline check

# Real runs
node eval/run-matrix.mjs --conditions pilot --tasks fixture-checkout # cheap pilot
node eval/run-matrix.mjs --conditions baseline --tasks all --trials 3 # full baseline
node eval/run-matrix.mjs --conditions baseline,inner-haiku,inner-opus,outer-sonnet,outer-prompt-lean \
--tasks fixture-checkout,fixture-flightdeck,books-toscrape --trials 3 # model/prompt screen on Tier A

npm run report # markdown scorecards
node eval/report.mjs --json # raw aggregates
```

The fixture server (`npm run fixtures`, port 4173) auto-starts when a selected task needs it.

## Metrics (see report footer for definitions)

- **Convergence:** converged-rate, iters-to-first-verified-pass, regressions, cumulative train cost (inner + outer)
- **Accuracy:** holdout pass rate (frozen strategy, fresh runs), **false-success rate** (claimed success, verifier failed)
- **Speed:** holdout wall clock split into browser ms (sum of browse-CLI `duration_ms` in trace.json) vs model ms
- **Tokens/cost:** per-run tokens, recomputed centrally in `eval/lib/pricing.mjs` (don't trust evaluate.mjs's stale table), and **skill value** = how much the learned strategy cheapens a run vs the blind iteration-1 attempt (tests the README's "80%+ reduction" claim)

## Experiment design notes

- **Screen, don't grid.** Vary one axis at a time against `baseline` (5 conditions ship: baseline, inner-haiku, inner-opus, outer-sonnet, outer-prompt-lean). Deep-dive only the interesting 2–3 combos.
- **Pair comparisons on the same tasks**; live-site variance makes unpaired suite means meaningless. Tier C reports separately.
- **Trials:** ≥3 per cell for anything you'll make a decision on. `results.jsonl` is append-only — rerun cells freely, the report aggregates.
- **Cost calibration:** run `pilot` on one Tier A task first and read `inner_cost_usd`/`outer_cost_usd` from `runs/results.jsonl` before launching a sweep.

## Fidelity caveats / roadmap

- The scripted outer agent sees a curated evidence pack (summary, verifier verdict, failed commands), not the full tool-using trace exploration Claude Code does. A Claude-Agent-SDK outer agent with Read/Grep tools is the natural next architecture variant — and would also let `--browser-trace` evidence (unified-events.jsonl) become a sweepable axis.
- `codegen.mjs --verify` (deterministic script artifact) isn't wired into the matrix yet; its runner protocol is identical to the verifier protocol here, so it slots in as a fourth phase.
- The local checkout's `judge.mjs` (A/B strategy judge) and `--supervise` watcher are complementary: the judge compares strategy *versions* by run evidence; this harness compares *conditions* by verified outcomes. `supervised` already lands in evaluate.mjs's meta.json and could become another condition axis.
34 changes: 34 additions & 0 deletions skills/autobrowse/evals/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Eval results — 2026-06-09 (Fable 5 vs Opus 4.8)

First findings from this harness, comparing `claude-fable-5` and `claude-opus-4-8` in both autobrowse roles. ~200 verified runs, ~$220 API spend. Small n (2–3 trials/cell) — directional, not definitive.

## Headline

**Best configuration tested: Sonnet 4.6 as the inner (browsing) agent + Fable 5 as the outer (strategy-writing) agent.** On the OpenTable task it produced the most reliable *and* cheapest converged runs of any cell — beating even Opus-as-browser — because the expensive model's intelligence lands in `strategy.md` once instead of in every run.

## OpenTable 2×2 (Tier B, Akamai-walled, verified+proxied Browserbase sessions)

| Inner ↓ / Outer → | Opus 4.8 writes | Fable 5 writes |
|---|---|---|
| **Sonnet 4.6 browses** | 5/6 holdout, $1.40/run, 90s | **6/6 holdout, $0.96/run, 64s** |
| **Opus 4.8 browses** | 6/6, $1.20/run, 63s | — |
| **Fable 5 browses** | 5/6, $1.66/run, 93s | — |

- **Inner axis:** Opus beat Fable as the browser — same convergence (iter 2–3), half the training cost (~$5.5 vs ~$11/trial), perfect holdout. Fable reasons more per turn; at 2× token pricing that compounds (blind iteration-1 attempts: ~$7 vs ~$3).
- **Outer axis (same Sonnet inner in both):** Fable-authored strategies were more reliable (6/6 vs 5/6) and made the same agent ~30% faster and cheaper. Qualitatively, Fable's skills encode *mechanisms* — React hydration timing ("`wait load` returns before the widget renders; snapshot shows ~2 refs"), Akamai cookie behavior ("`browse stop` wipes cookies → never stop the session"), broken-command landmines ("`wait selector text=...` ETIMEDOUTs") — where Opus's skills describe symptoms. Same pattern appeared on the Tier A fixtures: Fable was the only outer model to identify a deliberately planted 900ms delayed-render trap and prescribe the exact fix.
- Fable's outer calls cost $0.13 vs Opus's $0.05 per improvement — negligible in absolute terms.

## Tier A fixtures (deterministic local sites)

- All models 100% on the easy task; differentiation is pure cost (Sonnet $0.16/run, Opus $0.60, Fable $0.97). On tasks the cheap model already does, frontier inner agents are pure overhead.
- On the trap-laden checkout fixture, inner reliability ranked Fable (6/6) > Opus (5/6) > Sonnet (4/6) — monotonic with price. This did **not** generalize to OpenTable, where Opus matched/beat Fable as inner.

## Other observations

- **Zero false-successes in ~200 runs** — no model claimed `success:true` against a failing verifier. Failures were honest (turn-budget exhaustion, no final JSON).
- **Live-site drift is real:** Akamai blocked every iteration-1 attempt in a morning round and none in an evening round. Only within-round (concurrent, paired) comparisons are valid on live sites.
- One Fable-cell strategy explicitly reasoned about the grader ("the verifier requires success:true — persist"). Benign here (persistence, not fabrication), but a preview of strategies evolving against the verifier's letter on harder tasks.

## Recommended default

`inner_model: claude-sonnet-4-6`, `outer_model: claude-fable-5`, escalating the inner to Opus only when a task fails to converge because the inner agent can't execute good instructions. Training cost per new skill: ~$1–2; converged verified runs: ~$1.
9 changes: 9 additions & 0 deletions skills/autobrowse/evals/eval/conditions/baseline.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"id": "baseline",
"notes": "Default autobrowse setup: Sonnet inner agent (evaluate.mjs default), Opus outer agent, full SKILL.md-style outer prompt.",
"inner_model": "claude-sonnet-4-6",
"outer_model": "claude-opus-4-8",
"outer_prompt": "outer-default",
"max_iters": 5,
"holdout_runs": 3
}
9 changes: 9 additions & 0 deletions skills/autobrowse/evals/eval/conditions/inner-fable-5.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"id": "inner-fable-5",
"notes": "Fable 5 as the INNER browsing agent (vs inner-opus / baseline). Measures raw browsing competence: iteration-1 pass rate, turns, holdout reliability. 2x Opus pricing — watch cost_to_converge.",
"inner_model": "claude-fable-5",
"outer_model": "claude-opus-4-8",
"outer_prompt": "outer-default",
"max_iters": 5,
"holdout_runs": 3
}
9 changes: 9 additions & 0 deletions skills/autobrowse/evals/eval/conditions/inner-haiku.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"id": "inner-haiku",
"notes": "Cheap inner agent hypothesis: a smart outer agent distills intelligence into strategy.md, so a Haiku inner agent should converge to the same place at a fraction of the cost.",
"inner_model": "claude-haiku-4-5",
"outer_model": "claude-opus-4-8",
"outer_prompt": "outer-default",
"max_iters": 7,
"holdout_runs": 3
}
9 changes: 9 additions & 0 deletions skills/autobrowse/evals/eval/conditions/inner-opus.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"id": "inner-opus",
"notes": "Expensive inner agent: does a frontier inner model converge in fewer iterations, and does that offset its per-run cost?",
"inner_model": "claude-opus-4-8",
"outer_model": "claude-opus-4-8",
"outer_prompt": "outer-default",
"max_iters": 5,
"holdout_runs": 3
}
9 changes: 9 additions & 0 deletions skills/autobrowse/evals/eval/conditions/outer-fable-5.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"id": "outer-fable-5",
"notes": "Fable 5 as the OUTER strategy-improver (vs baseline's Opus 4.8). Measures hypothesis-formation quality: convergence speed, regressions, holdout pass rate of the strategies it writes. Outer calls are small, so the 2x pricing barely matters here.",
"inner_model": "claude-sonnet-4-6",
"outer_model": "claude-fable-5",
"outer_prompt": "outer-default",
"max_iters": 5,
"holdout_runs": 3
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"id": "outer-prompt-lean",
"notes": "Prompt ablation: strip the SKILL.md-style guidance (one-hypothesis rule, build-on-wins, evidence grounding) from the outer prompt. Measures how much of convergence quality comes from the methodology vs the model.",
"inner_model": "claude-sonnet-4-6",
"outer_model": "claude-opus-4-8",
"outer_prompt": "outer-lean",
"max_iters": 5,
"holdout_runs": 3
}
9 changes: 9 additions & 0 deletions skills/autobrowse/evals/eval/conditions/outer-sonnet.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"id": "outer-sonnet",
"notes": "Cheaper outer agent: is Opus-level hypothesis formation actually load-bearing, or can Sonnet read traces and improve strategies just as well?",
"inner_model": "claude-sonnet-4-6",
"outer_model": "claude-sonnet-4-6",
"outer_prompt": "outer-default",
"max_iters": 5,
"holdout_runs": 3
}
9 changes: 9 additions & 0 deletions skills/autobrowse/evals/eval/conditions/pilot.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"id": "pilot",
"notes": "Cheap smoke-test condition for validating the real (non-mock) pipeline: 2 training iterations max, 1 holdout run.",
"inner_model": "claude-sonnet-4-6",
"outer_model": "claude-opus-4-8",
"outer_prompt": "outer-default",
"max_iters": 2,
"holdout_runs": 1
}
61 changes: 61 additions & 0 deletions skills/autobrowse/evals/eval/config.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import * as fs from "node:fs";
import * as path from "node:path";
import { fileURLToPath } from "node:url";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

export const ROOT = path.resolve(__dirname, "..");
export const EVAL_DIR = path.join(ROOT, "eval");
export const TASKS_DIR = path.join(EVAL_DIR, "tasks");
export const CONDITIONS_DIR = path.join(EVAL_DIR, "conditions");
export const PROMPTS_DIR = path.join(EVAL_DIR, "prompts");
export const FIXTURES_DIR = path.join(ROOT, "fixtures");
export const RUNS_DIR = path.join(ROOT, "runs");
export const RESULTS_FILE = path.join(RUNS_DIR, "results.jsonl");

const AUTOBROWSE_CANDIDATES = [
process.env.AUTOBROWSE_DIR,
path.resolve(ROOT, ".."), // evals/ ships inside the autobrowse skill
path.join(ROOT, "vendor", "skills", "skills", "autobrowse"),
].filter(Boolean);

export function resolveAutobrowseDir() {
for (const dir of AUTOBROWSE_CANDIDATES) {
if (fs.existsSync(path.join(dir, "scripts", "evaluate.mjs"))) return dir;
}
throw new Error(
"autobrowse skill not found. Set AUTOBROWSE_DIR to the directory containing scripts/evaluate.mjs " +
"(e.g. a checkout of github.com/browserbase/skills at skills/autobrowse)."
);
}

export function loadCondition(idOrPath) {
const p = idOrPath.endsWith(".json")
? path.resolve(idOrPath)
: path.join(CONDITIONS_DIR, `${idOrPath}.json`);
const cond = JSON.parse(fs.readFileSync(p, "utf-8"));
// Defaults
return {
max_iters: 5,
holdout_runs: 3,
converge_window: 3,
converge_passes: 2,
outer_prompt: "outer-default",
browser_trace: false,
...cond,
};
}

export function loadTaskMeta(task) {
const p = path.join(TASKS_DIR, task, "meta.json");
const meta = JSON.parse(fs.readFileSync(p, "utf-8"));
return { env: "local", max_turns: 30, timeout_min: 20, ...meta, task };
}

export function listTasks() {
return fs
.readdirSync(TASKS_DIR, { withFileTypes: true })
.filter((d) => d.isDirectory() && !d.name.startsWith("_"))
.map((d) => d.name)
.sort();
}
47 changes: 47 additions & 0 deletions skills/autobrowse/evals/eval/lib/extract-output.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
import * as fs from "node:fs";
import * as path from "node:path";

// Pull the last fenced ```json block (or last bare {...}) from text.
// Mirrors extractFinalJson in the newer evaluate.mjs.
export function extractJsonFromText(text) {
if (!text) return null;
const fences = [...text.matchAll(/```(?:json)?\s*([\s\S]*?)```/gi)];
let candidate = fences.length ? fences[fences.length - 1][1].trim() : null;
if (!candidate) {
const first = text.indexOf("{");
const last = text.lastIndexOf("}");
if (first !== -1 && last > first) candidate = text.slice(first, last + 1);
}
if (!candidate) return null;
try {
return JSON.parse(candidate);
} catch {
return null;
}
}

// Load the inner agent's final structured output from a run's trace dir.
// Newer evaluate.mjs writes result.json ({parsed, raw, parse_error});
// fall back to parsing summary.md's "Agent Final Output" section for the
// upstream version that doesn't.
export function loadRunOutput(runDir) {
const resultPath = path.join(runDir, "result.json");
if (fs.existsSync(resultPath)) {
try {
const r = JSON.parse(fs.readFileSync(resultPath, "utf-8"));
if (r && "parsed" in r) return r.parsed;
return r;
} catch {
/* fall through */
}
}
const summaryPath = path.join(runDir, "summary.md");
if (fs.existsSync(summaryPath)) {
const summary = fs.readFileSync(summaryPath, "utf-8");
const idx = summary.indexOf("## Agent Final Output");
const tail = idx === -1 ? summary : summary.slice(idx);
const parsed = extractJsonFromText(tail);
if (parsed) return parsed;
}
return null;
}
19 changes: 19 additions & 0 deletions skills/autobrowse/evals/eval/lib/pricing.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
// Central pricing table — single source of truth for the whole harness.
// USD per 1M tokens [input, output]. Do not trust per-script tables elsewhere
// (evaluate.mjs has its own stale copy; we recompute from raw token counts).
const PRICING = [
["claude-fable-5", [10, 50]],
["claude-opus-4-8", [5, 25]],
["claude-opus-4-7", [5, 25]],
["claude-opus-4-6", [5, 25]],
["claude-opus-4-5", [5, 25]],
["claude-sonnet-4-6", [3, 15]],
["claude-sonnet-4-5", [3, 15]],
["claude-haiku-4-5", [1, 5]],
];

export function costUsd(model, tokensIn, tokensOut) {
const entry = PRICING.find(([prefix]) => model?.startsWith(prefix));
const [inRate, outRate] = entry ? entry[1] : [3, 15];
return (tokensIn * inRate + tokensOut * outRate) / 1_000_000;
}
29 changes: 29 additions & 0 deletions skills/autobrowse/evals/eval/lib/results.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import * as fs from "node:fs";
import * as path from "node:path";
import { RESULTS_FILE } from "../config.mjs";

// One results.jsonl record per inner-agent run (training iteration, holdout
// run) — append-only, everything downstream is a query over this file.
//
// Schema (all rows):
// ts, condition_id, task, tier, trial, phase: "train"|"holdout", iter,
// run_id, env, inner_model, outer_model,
// verified_pass, claimed_success, false_success, verifier_reason,
// status, stop_reason, turns, duration_sec, browser_ms, model_ms,
// tool_calls, tool_errors, tokens_in, tokens_out, inner_cost_usd,
// outer_tokens_in, outer_tokens_out, outer_cost_usd, hypothesis,
// converged_at (train rows on the converging iteration), mock

export function appendResult(record, file = RESULTS_FILE) {
fs.mkdirSync(path.dirname(file), { recursive: true });
fs.appendFileSync(file, JSON.stringify({ ts: new Date().toISOString(), ...record }) + "\n");
}

export function readResults(file = RESULTS_FILE) {
if (!fs.existsSync(file)) return [];
return fs
.readFileSync(file, "utf-8")
.split("\n")
.filter(Boolean)
.map((line) => JSON.parse(line));
}
Loading