This directory contains the frozen v1 Day.js task pack plus the AgentV harness
for the public swe-evals companion project.
evals/dayjs-v1.eval.yaml runs three Multi-SWE-bench Day.js tasks. For each
test case, the setup hook:
- clones
https://github.com/iamkun/dayjsintorepo/ - checks out the task
previous_commit - applies the reviewed Multi-SWE-bench
test_patchfrompatches/ - runs
npm install --no-audit --no-fund - commits the prepared benchmark state so AgentV captures only agent changes
The code grader runs the focused Jest command for the task after the agent finishes. The v1 score is deterministic: focused command green is pass, non-zero is fail.
The eval defines three target aliases:
baselinecompound-engineeringsuperpowers
All three use the same selected task commits and delegate to AGENT_TARGET,
so Codex/Pi switching does not require editing eval YAML. The variants are
defined in the eval file's execution.targets; run the eval normally and set
AGENT_TARGET to choose the underlying provider:
cd swe-evals
AGENT_TARGET=codex GRADER_TARGET=azure bun ../agentv/apps/cli/src/cli.ts eval evals/dayjs-v1.eval.yaml
AGENT_TARGET=pi GRADER_TARGET=azure bun ../agentv/apps/cli/src/cli.ts eval evals/dayjs-v1.eval.yamlRun all variants:
cd swe-evals
AGENT_TARGET=codex GRADER_TARGET=azure bun ../agentv/apps/cli/src/cli.ts eval evals/dayjs-v1.eval.yamlValidate harness wiring without a live provider:
cd swe-evals
AGENT_TARGET=codex GRADER_TARGET=azure \
bun ../agentv/apps/cli/src/cli.ts eval evals/dayjs-v1.eval.yaml \
--test-id dayjs-year-format-leading-zeroes \
--dry-run \
--threshold 0--threshold 0 is intentional for dry-run validation: the mocked provider does
not edit Day.js, while the deterministic code grader still runs the focused
Jest command against the prepared red state.
The repository does not contain provider secrets, result-sync credentials, or
Bitwarden output. The setup and grading scripts run external Day.js
install/test commands with a minimal child-process environment containing only
CI, HOME, PATH, and npm cache/audit/fund settings.
The v1 pack uses Multi-SWE-bench Day.js tasks rather than ad hoc examples. Day.js was selected because it is public, small enough for local demo checkouts, JavaScript-based, and has multiple benchmark rows with clear previous commits, public issue statements, test patches, and focused fail-to-pass test files.
The frozen task metadata is in tasks/dayjs-v1.yaml. Do not change selected
task metadata unless validation proves it wrong.