swe-evals Day.js Harness

This directory contains the frozen v1 Day.js task pack plus the AgentV harness for the public swe-evals companion project.

What Runs

evals/dayjs-v1.eval.yaml runs three Multi-SWE-bench Day.js tasks. For each test case, the setup hook:

clones https://github.com/iamkun/dayjs into repo/
checks out the task previous_commit
applies the reviewed Multi-SWE-bench test_patch from patches/
runs npm install --no-audit --no-fund
commits the prepared benchmark state so AgentV captures only agent changes

The code grader runs the focused Jest command for the task after the agent finishes. The v1 score is deterministic: focused command green is pass, non-zero is fail.

Runtime Variants

The eval defines three target aliases:

baseline
compound-engineering
superpowers

All three use the same selected task commits and delegate to AGENT_TARGET, so Codex/Pi switching does not require editing eval YAML. The variants are defined in the eval file's execution.targets; run the eval normally and set AGENT_TARGET to choose the underlying provider:

cd swe-evals
AGENT_TARGET=codex GRADER_TARGET=azure bun ../agentv/apps/cli/src/cli.ts eval evals/dayjs-v1.eval.yaml
AGENT_TARGET=pi GRADER_TARGET=azure bun ../agentv/apps/cli/src/cli.ts eval evals/dayjs-v1.eval.yaml

Run all variants:

cd swe-evals
AGENT_TARGET=codex GRADER_TARGET=azure bun ../agentv/apps/cli/src/cli.ts eval evals/dayjs-v1.eval.yaml

Validate harness wiring without a live provider:

cd swe-evals
AGENT_TARGET=codex GRADER_TARGET=azure \
  bun ../agentv/apps/cli/src/cli.ts eval evals/dayjs-v1.eval.yaml \
    --test-id dayjs-year-format-leading-zeroes \
    --dry-run \
    --threshold 0

--threshold 0 is intentional for dry-run validation: the mocked provider does not edit Day.js, while the deterministic code grader still runs the focused Jest command against the prepared red state.

Secrets Boundary

The repository does not contain provider secrets, result-sync credentials, or Bitwarden output. The setup and grading scripts run external Day.js install/test commands with a minimal child-process environment containing only CI, HOME, PATH, and npm cache/audit/fund settings.

Source Selection

The v1 pack uses Multi-SWE-bench Day.js tasks rather than ad hoc examples. Day.js was selected because it is public, small enough for local demo checkouts, JavaScript-based, and has multiple benchmark rows with clear previous commits, public issue statements, test patches, and focused fail-to-pass test files.

The frozen task metadata is in tasks/dayjs-v1.yaml. Do not change selected task metadata unless validation proves it wrong.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.agentv		.agentv
evals		evals
patches		patches
runtime-variants		runtime-variants
scripts		scripts
tasks		tasks
workspace-template		workspace-template
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
agentv.config.ts		agentv.config.ts
bun.lock		bun.lock
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

swe-evals Day.js Harness

What Runs

Runtime Variants

Secrets Boundary

Source Selection

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

swe-evals Day.js Harness

What Runs

Runtime Variants

Secrets Boundary

Source Selection

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages