browserbase · shubh24 · Jun 10, 2026
diff --git a/skills/autobrowse/evals/.env.example b/skills/autobrowse/evals/.env.example
@@ -0,0 +1,10 @@
+# Required for real (non-mock) runs — inner agent + outer agent both use it.
+ANTHROPIC_API_KEY=sk-ant-...
+
+# Required for tasks with env=remote (Tier B/C bot-protected sites).
+BROWSERBASE_API_KEY=bb_...
+
+# Path to the autobrowse skill (the directory containing scripts/evaluate.mjs).
+# Defaults to the parent directory (evals/ ships inside the skill); set only
+# to point the harness at a different autobrowse checkout.
+# AUTOBROWSE_DIR=/path/to/skills/skills/autobrowse
diff --git a/skills/autobrowse/evals/.gitignore b/skills/autobrowse/evals/.gitignore
@@ -0,0 +1,4 @@
+node_modules/
+runs/
+.env
+vendor/
diff --git a/skills/autobrowse/evals/README.md b/skills/autobrowse/evals/README.md
@@ -0,0 +1,96 @@
+# autobrowse-evals
+
+Eval harness for the [autobrowse](https://github.com/browserbase/skills/tree/main/skills/autobrowse) self-improving browser-automation loop. Measures the four things that matter — **convergence speed**, **accuracy**, **runtime speed**, and **token cost** — and makes them comparable across inner/outer models, prompts, and architectures.
+
+## The four artifacts being evaluated
+
+| Artifact | What it is | Metrics |
+|---|---|---|
+| Single run | One `evaluate.mjs` attempt, empty strategy | accuracy baseline, speed, tokens |
+| Learning loop | evaluate → verify → improve, repeated | convergence speed, cumulative cost |
+| Graduated strategy | frozen best strategy.md run by a fresh agent | **holdout** accuracy/speed/tokens |
+| Codegen script | deterministic playwright/stagehand output | (future: wire `codegen.mjs --verify` in) |
+
+The core design decision: **training and evaluation are separated.** Convergence is measured during the loop; the *result* is measured by freezing the best strategy and running it N fresh times (holdout). And **pass/fail is never self-reported** — every task has a programmatic verifier; the agent's own `success: true` is only used to compute the false-success (reward-hacking) rate.
+
+## Layout
+
+```
+eval/
+  run-matrix.mjs        orchestrator: condition × task × trial → train + holdout
+  outer-agent.mjs       scripted outer loop (one structured-output call per iteration;
+                        outer tokens metered — the interactive loop never records them)
+  report.mjs            aggregates runs/results.jsonl into scorecards
+  conditions/*.json     sweepable variables: inner_model, outer_model, outer_prompt, iters
+  prompts/outer-*.md    outer-prompt variants (default = SKILL.md methodology, lean = ablation)
+  tasks/<task>/         task.md (autobrowse format) + verify.mjs + meta.json + mock-output.json
+fixtures/               self-hosted deterministic sites (Tier A ground truth)
+runs/                   workspaces, traces, results.jsonl (gitignored)
+```
+
+## Benchmark suite (9 tasks, 3 tiers)
+
+Tasks marked ◆ are drawn from the browse.sh prompt library (`prompts/<domain>/<task>.md`).
+
+| Tier | Task | Env | Verification |
+|---|---|---|---|
+| **A — deterministic** | `fixture-checkout` | local | exact confirmation code (shared hash function) + total |
+| | `fixture-flightdeck` | local | exact cheapest-nonstop answer; traps: cheaper 1-stop, cheaper wrong route |
+| | `books-toscrape` | local | exact count/prices/titles (static demo site) |
+| **B — live, stable** | `uspto-patent-lookup` ◆ | remote | patent facts are immutable (US 11,000,000) |
+| | `google-flights` ◆ | local | invariants: nonstop, airline set, price band, internal consistency |
+| | `opentable-availability` ◆ | local | invariants: date/party echoed, slot format, availability consistency |
+| | `youtube-transcript` ◆ | local | immutable content ("Me at the zoo" transcript phrases) |
+| **C — bot-protected** | `stockx-price` ◆ | remote | product identity + price band (PerimeterX) |
+| | `yelp-reviews` ◆ | remote | rating/review-count bands + per-review structure (DataDome) |
+
+Tier A gives model comparisons statistical teeth; Tier B measures real-site competence with invariant checks; Tier C measures infrastructure robustness (report it separately — variance is the site's, not the model's).
+
+**Verifier protocol** (mirrors autobrowse's codegen runner protocol): `node eval/tasks/<task>/verify.mjs --run-dir <traceDir>` → one JSON line `{passed, checks: [{name, ok, detail}], reason}`. Each task's `mock-output.json` is its documented known-good output; `npm run test:verifiers` asserts every verifier passes it and rejects a garbage `{"success": true}` — i.e., verifiers are tested against reward-hacking.
+
+## Setup
+
+```bash
+npm install
+cp .env.example .env        # ANTHROPIC_API_KEY (+ BROWSERBASE_API_KEY for remote tasks)
+npm install -g browse       # the browse CLI used by the inner agent
+# AUTOBROWSE_DIR defaults to the parent dir (this folder ships inside the skill)
+```
+
+## Usage
+
+```bash
+npm run test:verifiers                                  # verifier soundness (no keys needed)
+node eval/run-matrix.mjs --conditions baseline --tasks fixture-checkout --mock   # free pipeline check
+
+# Real runs
+node eval/run-matrix.mjs --conditions pilot --tasks fixture-checkout            # cheap pilot
+node eval/run-matrix.mjs --conditions baseline --tasks all --trials 3           # full baseline
+node eval/run-matrix.mjs --conditions baseline,inner-haiku,inner-opus,outer-sonnet,outer-prompt-lean \
+    --tasks fixture-checkout,fixture-flightdeck,books-toscrape --trials 3       # model/prompt screen on Tier A
+
+npm run report                                          # markdown scorecards
+node eval/report.mjs --json                             # raw aggregates
+```
+
+The fixture server (`npm run fixtures`, port 4173) auto-starts when a selected task needs it.
+
+## Metrics (see report footer for definitions)
+
+- **Convergence:** converged-rate, iters-to-first-verified-pass, regressions, cumulative train cost (inner + outer)
+- **Accuracy:** holdout pass rate (frozen strategy, fresh runs), **false-success rate** (claimed success, verifier failed)
+- **Speed:** holdout wall clock split into browser ms (sum of browse-CLI `duration_ms` in trace.json) vs model ms
+- **Tokens/cost:** per-run tokens, recomputed centrally in `eval/lib/pricing.mjs` (don't trust evaluate.mjs's stale table), and **skill value** = how much the learned strategy cheapens a run vs the blind iteration-1 attempt (tests the README's "80%+ reduction" claim)
+
+## Experiment design notes
+
+- **Screen, don't grid.** Vary one axis at a time against `baseline` (5 conditions ship: baseline, inner-haiku, inner-opus, outer-sonnet, outer-prompt-lean). Deep-dive only the interesting 2–3 combos.
+- **Pair comparisons on the same tasks**; live-site variance makes unpaired suite means meaningless. Tier C reports separately.
+- **Trials:** ≥3 per cell for anything you'll make a decision on. `results.jsonl` is append-only — rerun cells freely, the report aggregates.
+- **Cost calibration:** run `pilot` on one Tier A task first and read `inner_cost_usd`/`outer_cost_usd` from `runs/results.jsonl` before launching a sweep.
+
+## Fidelity caveats / roadmap
+
+- The scripted outer agent sees a curated evidence pack (summary, verifier verdict, failed commands), not the full tool-using trace exploration Claude Code does. A Claude-Agent-SDK outer agent with Read/Grep tools is the natural next architecture variant — and would also let `--browser-trace` evidence (unified-events.jsonl) become a sweepable axis.
+- `codegen.mjs --verify` (deterministic script artifact) isn't wired into the matrix yet; its runner protocol is identical to the verifier protocol here, so it slots in as a fourth phase.
+- The local checkout's `judge.mjs` (A/B strategy judge) and `--supervise` watcher are complementary: the judge compares strategy *versions* by run evidence; this harness compares *conditions* by verified outcomes. `supervised` already lands in evaluate.mjs's meta.json and could become another condition axis.
diff --git a/skills/autobrowse/evals/RESULTS.md b/skills/autobrowse/evals/RESULTS.md
@@ -0,0 +1,34 @@
+# Eval results — 2026-06-09 (Fable 5 vs Opus 4.8)
+
+First findings from this harness, comparing `claude-fable-5` and `claude-opus-4-8` in both autobrowse roles. ~200 verified runs, ~$220 API spend. Small n (2–3 trials/cell) — directional, not definitive.
+
+## Headline
+
+**Best configuration tested: Sonnet 4.6 as the inner (browsing) agent + Fable 5 as the outer (strategy-writing) agent.** On the OpenTable task it produced the most reliable *and* cheapest converged runs of any cell — beating even Opus-as-browser — because the expensive model's intelligence lands in `strategy.md` once instead of in every run.
+
+## OpenTable 2×2 (Tier B, Akamai-walled, verified+proxied Browserbase sessions)
+
+| Inner ↓ / Outer → | Opus 4.8 writes | Fable 5 writes |
+|---|---|---|
+| **Sonnet 4.6 browses** | 5/6 holdout, $1.40/run, 90s | **6/6 holdout, $0.96/run, 64s** |
+| **Opus 4.8 browses** | 6/6, $1.20/run, 63s | — |
+| **Fable 5 browses** | 5/6, $1.66/run, 93s | — |
+
+- **Inner axis:** Opus beat Fable as the browser — same convergence (iter 2–3), half the training cost (~$5.5 vs ~$11/trial), perfect holdout. Fable reasons more per turn; at 2× token pricing that compounds (blind iteration-1 attempts: ~$7 vs ~$3).
+- **Outer axis (same Sonnet inner in both):** Fable-authored strategies were more reliable (6/6 vs 5/6) and made the same agent ~30% faster and cheaper. Qualitatively, Fable's skills encode *mechanisms* — React hydration timing ("`wait load` returns before the widget renders; snapshot shows ~2 refs"), Akamai cookie behavior ("`browse stop` wipes cookies → never stop the session"), broken-command landmines ("`wait selector text=...` ETIMEDOUTs") — where Opus's skills describe symptoms. Same pattern appeared on the Tier A fixtures: Fable was the only outer model to identify a deliberately planted 900ms delayed-render trap and prescribe the exact fix.
+- Fable's outer calls cost $0.13 vs Opus's $0.05 per improvement — negligible in absolute terms.
+
+## Tier A fixtures (deterministic local sites)
+
+- All models 100% on the easy task; differentiation is pure cost (Sonnet $0.16/run, Opus $0.60, Fable $0.97). On tasks the cheap model already does, frontier inner agents are pure overhead.
+- On the trap-laden checkout fixture, inner reliability ranked Fable (6/6) > Opus (5/6) > Sonnet (4/6) — monotonic with price. This did **not** generalize to OpenTable, where Opus matched/beat Fable as inner.
+
+## Other observations
+
+- **Zero false-successes in ~200 runs** — no model claimed `success:true` against a failing verifier. Failures were honest (turn-budget exhaustion, no final JSON).
+- **Live-site drift is real:** Akamai blocked every iteration-1 attempt in a morning round and none in an evening round. Only within-round (concurrent, paired) comparisons are valid on live sites.
+- One Fable-cell strategy explicitly reasoned about the grader ("the verifier requires success:true — persist"). Benign here (persistence, not fabrication), but a preview of strategies evolving against the verifier's letter on harder tasks.
+
+## Recommended default
+
+`inner_model: claude-sonnet-4-6`, `outer_model: claude-fable-5`, escalating the inner to Opus only when a task fails to converge because the inner agent can't execute good instructions. Training cost per new skill: ~$1–2; converged verified runs: ~$1.
diff --git a/skills/autobrowse/evals/eval/conditions/baseline.json b/skills/autobrowse/evals/eval/conditions/baseline.json
@@ -0,0 +1,9 @@
+{
+  "id": "baseline",
+  "notes": "Default autobrowse setup: Sonnet inner agent (evaluate.mjs default), Opus outer agent, full SKILL.md-style outer prompt.",
+  "inner_model": "claude-sonnet-4-6",
+  "outer_model": "claude-opus-4-8",
+  "outer_prompt": "outer-default",
+  "max_iters": 5,
+  "holdout_runs": 3
+}
diff --git a/skills/autobrowse/evals/eval/conditions/inner-fable-5.json b/skills/autobrowse/evals/eval/conditions/inner-fable-5.json
@@ -0,0 +1,9 @@
+{
+  "id": "inner-fable-5",
+  "notes": "Fable 5 as the INNER browsing agent (vs inner-opus / baseline). Measures raw browsing competence: iteration-1 pass rate, turns, holdout reliability. 2x Opus pricing — watch cost_to_converge.",
+  "inner_model": "claude-fable-5",
+  "outer_model": "claude-opus-4-8",
+  "outer_prompt": "outer-default",
+  "max_iters": 5,
+  "holdout_runs": 3
+}
diff --git a/skills/autobrowse/evals/eval/conditions/inner-haiku.json b/skills/autobrowse/evals/eval/conditions/inner-haiku.json
@@ -0,0 +1,9 @@
+{
+  "id": "inner-haiku",
+  "notes": "Cheap inner agent hypothesis: a smart outer agent distills intelligence into strategy.md, so a Haiku inner agent should converge to the same place at a fraction of the cost.",
+  "inner_model": "claude-haiku-4-5",
+  "outer_model": "claude-opus-4-8",
+  "outer_prompt": "outer-default",
+  "max_iters": 7,
+  "holdout_runs": 3
+}
diff --git a/skills/autobrowse/evals/eval/conditions/inner-opus.json b/skills/autobrowse/evals/eval/conditions/inner-opus.json
@@ -0,0 +1,9 @@
+{
+  "id": "inner-opus",
+  "notes": "Expensive inner agent: does a frontier inner model converge in fewer iterations, and does that offset its per-run cost?",
+  "inner_model": "claude-opus-4-8",
+  "outer_model": "claude-opus-4-8",
+  "outer_prompt": "outer-default",
+  "max_iters": 5,
+  "holdout_runs": 3
+}
diff --git a/skills/autobrowse/evals/eval/conditions/outer-fable-5.json b/skills/autobrowse/evals/eval/conditions/outer-fable-5.json
@@ -0,0 +1,9 @@
+{
+  "id": "outer-fable-5",
+  "notes": "Fable 5 as the OUTER strategy-improver (vs baseline's Opus 4.8). Measures hypothesis-formation quality: convergence speed, regressions, holdout pass rate of the strategies it writes. Outer calls are small, so the 2x pricing barely matters here.",
+  "inner_model": "claude-sonnet-4-6",
+  "outer_model": "claude-fable-5",
+  "outer_prompt": "outer-default",
+  "max_iters": 5,
+  "holdout_runs": 3
+}
diff --git a/skills/autobrowse/evals/eval/conditions/outer-prompt-lean.json b/skills/autobrowse/evals/eval/conditions/outer-prompt-lean.json
@@ -0,0 +1,9 @@
+{
+  "id": "outer-prompt-lean",
+  "notes": "Prompt ablation: strip the SKILL.md-style guidance (one-hypothesis rule, build-on-wins, evidence grounding) from the outer prompt. Measures how much of convergence quality comes from the methodology vs the model.",
+  "inner_model": "claude-sonnet-4-6",
+  "outer_model": "claude-opus-4-8",
+  "outer_prompt": "outer-lean",
+  "max_iters": 5,
+  "holdout_runs": 3
+}
diff --git a/skills/autobrowse/evals/eval/conditions/outer-sonnet.json b/skills/autobrowse/evals/eval/conditions/outer-sonnet.json
@@ -0,0 +1,9 @@
+{
+  "id": "outer-sonnet",
+  "notes": "Cheaper outer agent: is Opus-level hypothesis formation actually load-bearing, or can Sonnet read traces and improve strategies just as well?",
+  "inner_model": "claude-sonnet-4-6",
+  "outer_model": "claude-sonnet-4-6",
+  "outer_prompt": "outer-default",
+  "max_iters": 5,
+  "holdout_runs": 3
+}
diff --git a/skills/autobrowse/evals/eval/conditions/pilot.json b/skills/autobrowse/evals/eval/conditions/pilot.json
@@ -0,0 +1,9 @@
+{
+  "id": "pilot",
+  "notes": "Cheap smoke-test condition for validating the real (non-mock) pipeline: 2 training iterations max, 1 holdout run.",
+  "inner_model": "claude-sonnet-4-6",
+  "outer_model": "claude-opus-4-8",
+  "outer_prompt": "outer-default",
+  "max_iters": 2,
+  "holdout_runs": 1
+}
diff --git a/skills/autobrowse/evals/eval/config.mjs b/skills/autobrowse/evals/eval/config.mjs
@@ -0,0 +1,61 @@
+import * as fs from "node:fs";
+import * as path from "node:path";
+import { fileURLToPath } from "node:url";
+
+const __dirname = path.dirname(fileURLToPath(import.meta.url));
+
+export const ROOT = path.resolve(__dirname, "..");
+export const EVAL_DIR = path.join(ROOT, "eval");
+export const TASKS_DIR = path.join(EVAL_DIR, "tasks");
+export const CONDITIONS_DIR = path.join(EVAL_DIR, "conditions");
+export const PROMPTS_DIR = path.join(EVAL_DIR, "prompts");
+export const FIXTURES_DIR = path.join(ROOT, "fixtures");
+export const RUNS_DIR = path.join(ROOT, "runs");
+export const RESULTS_FILE = path.join(RUNS_DIR, "results.jsonl");
+
+const AUTOBROWSE_CANDIDATES = [
+  process.env.AUTOBROWSE_DIR,
+  path.resolve(ROOT, ".."), // evals/ ships inside the autobrowse skill
+  path.join(ROOT, "vendor", "skills", "skills", "autobrowse"),
+].filter(Boolean);
+
+export function resolveAutobrowseDir() {
+  for (const dir of AUTOBROWSE_CANDIDATES) {
+    if (fs.existsSync(path.join(dir, "scripts", "evaluate.mjs"))) return dir;
+  }
+  throw new Error(
+    "autobrowse skill not found. Set AUTOBROWSE_DIR to the directory containing scripts/evaluate.mjs " +
+    "(e.g. a checkout of github.com/browserbase/skills at skills/autobrowse)."
+  );
+}
+
+export function loadCondition(idOrPath) {
+  const p = idOrPath.endsWith(".json")
+    ? path.resolve(idOrPath)
+    : path.join(CONDITIONS_DIR, `${idOrPath}.json`);
+  const cond = JSON.parse(fs.readFileSync(p, "utf-8"));
+  // Defaults
+  return {
+    max_iters: 5,
+    holdout_runs: 3,
+    converge_window: 3,
+    converge_passes: 2,
+    outer_prompt: "outer-default",
+    browser_trace: false,
+    ...cond,
+  };
+}
+
+export function loadTaskMeta(task) {
+  const p = path.join(TASKS_DIR, task, "meta.json");
+  const meta = JSON.parse(fs.readFileSync(p, "utf-8"));
+  return { env: "local", max_turns: 30, timeout_min: 20, ...meta, task };
+}
+
+export function listTasks() {
+  return fs
+    .readdirSync(TASKS_DIR, { withFileTypes: true })
+    .filter((d) => d.isDirectory() && !d.name.startsWith("_"))
+    .map((d) => d.name)
+    .sort();
+}
diff --git a/skills/autobrowse/evals/eval/lib/extract-output.mjs b/skills/autobrowse/evals/eval/lib/extract-output.mjs
@@ -0,0 +1,47 @@
+import * as fs from "node:fs";
+import * as path from "node:path";
+
+// Pull the last fenced ```json block (or last bare {...}) from text.
+// Mirrors extractFinalJson in the newer evaluate.mjs.
+export function extractJsonFromText(text) {
+  if (!text) return null;
+  const fences = [...text.matchAll(/```(?:json)?\s*([\s\S]*?)```/gi)];
+  let candidate = fences.length ? fences[fences.length - 1][1].trim() : null;
+  if (!candidate) {
+    const first = text.indexOf("{");
+    const last = text.lastIndexOf("}");
+    if (first !== -1 && last > first) candidate = text.slice(first, last + 1);
+  }
+  if (!candidate) return null;
+  try {
+    return JSON.parse(candidate);
+  } catch {
+    return null;
+  }
+}
+
+// Load the inner agent's final structured output from a run's trace dir.
+// Newer evaluate.mjs writes result.json ({parsed, raw, parse_error});
+// fall back to parsing summary.md's "Agent Final Output" section for the
+// upstream version that doesn't.
+export function loadRunOutput(runDir) {
+  const resultPath = path.join(runDir, "result.json");
+  if (fs.existsSync(resultPath)) {
+    try {
+      const r = JSON.parse(fs.readFileSync(resultPath, "utf-8"));
+      if (r && "parsed" in r) return r.parsed;
+      return r;
+    } catch {
+      /* fall through */
+    }
+  }
+  const summaryPath = path.join(runDir, "summary.md");
+  if (fs.existsSync(summaryPath)) {
+    const summary = fs.readFileSync(summaryPath, "utf-8");
+    const idx = summary.indexOf("## Agent Final Output");
+    const tail = idx === -1 ? summary : summary.slice(idx);
+    const parsed = extractJsonFromText(tail);
+    if (parsed) return parsed;
+  }
+  return null;
+}
diff --git a/skills/autobrowse/evals/eval/lib/pricing.mjs b/skills/autobrowse/evals/eval/lib/pricing.mjs
@@ -0,0 +1,19 @@
+// Central pricing table — single source of truth for the whole harness.
+// USD per 1M tokens [input, output]. Do not trust per-script tables elsewhere
+// (evaluate.mjs has its own stale copy; we recompute from raw token counts).
+const PRICING = [
+  ["claude-fable-5", [10, 50]],
+  ["claude-opus-4-8", [5, 25]],
+  ["claude-opus-4-7", [5, 25]],
+  ["claude-opus-4-6", [5, 25]],
+  ["claude-opus-4-5", [5, 25]],
+  ["claude-sonnet-4-6", [3, 15]],
+  ["claude-sonnet-4-5", [3, 15]],
+  ["claude-haiku-4-5", [1, 5]],
+];
+
+export function costUsd(model, tokensIn, tokensOut) {
+  const entry = PRICING.find(([prefix]) => model?.startsWith(prefix));
+  const [inRate, outRate] = entry ? entry[1] : [3, 15];
+  return (tokensIn * inRate + tokensOut * outRate) / 1_000_000;
+}
diff --git a/skills/autobrowse/evals/eval/lib/results.mjs b/skills/autobrowse/evals/eval/lib/results.mjs
@@ -0,0 +1,29 @@
+import * as fs from "node:fs";
+import * as path from "node:path";
+import { RESULTS_FILE } from "../config.mjs";
+
+// One results.jsonl record per inner-agent run (training iteration, holdout
+// run) — append-only, everything downstream is a query over this file.
+//
+// Schema (all rows):
+//   ts, condition_id, task, tier, trial, phase: "train"|"holdout", iter,
+//   run_id, env, inner_model, outer_model,
+//   verified_pass, claimed_success, false_success, verifier_reason,
+//   status, stop_reason, turns, duration_sec, browser_ms, model_ms,
+//   tool_calls, tool_errors, tokens_in, tokens_out, inner_cost_usd,
+//   outer_tokens_in, outer_tokens_out, outer_cost_usd, hypothesis,
+//   converged_at (train rows on the converging iteration), mock
+
+export function appendResult(record, file = RESULTS_FILE) {
+  fs.mkdirSync(path.dirname(file), { recursive: true });
+  fs.appendFileSync(file, JSON.stringify({ ts: new Date().toISOString(), ...record }) + "\n");
+}
+
+export function readResults(file = RESULTS_FILE) {
+  if (!fs.existsSync(file)) return [];
+  return fs
+    .readFileSync(file, "utf-8")
+    .split("\n")
+    .filter(Boolean)
+    .map((line) => JSON.parse(line));
+}