Skip to content

Add confidence intervals to main code#130

Open
lindsaydbrin wants to merge 10 commits into
mainfrom
pr/lindsay/add_CIs
Open

Add confidence intervals to main code#130
lindsaydbrin wants to merge 10 commits into
mainfrom
pr/lindsay/add_CIs

Conversation

@lindsaydbrin
Copy link
Copy Markdown
Collaborator

@lindsaydbrin lindsaydbrin commented May 29, 2026

Every eva metrics run now emits 95% bootstrap confidence intervals for composite and main metrics (not sub-metrics) in metrics_summary.json. CIs are deterministic within a run (same run_dir.name → byte-identical bounds) and independent across runs.

Scope (purely additive — no existing field renamed or changed):

  • overall_scores.{COMPOSITE}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios} - for 6 composites: [EVA-A, EVA-X, EVA-overall] x [_pass, _mean]
  • overall_scores.pass_k.{COMPOSITE}.{stat}_ci_{lower,upper} - for pass_at_1, pass_at_k, pass_power_k_observed (multi-trial only). pass_power_k_theoretical does not have CIs (clearly).
  • per_metric.{name}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios} for all metrics (typically/currently: task_completion, faithfulness, agent_speech_fidelity, conversation_progression, turn_taking, conciseness).
  • per_metric.{name}.pass_k.{stat}_ci_{lower,upper} (k>1 only, again of course).

Statistical design (matches the approach on analysis/eva-bench-stats/ / in the paper): percentile bootstrap on per-scenario sample means, N_BOOT=2000, ALPHA=0.05, seed derived from run_dir.name via SHA-256 (cross-process-stable).

lindsaydbrin and others added 9 commits May 29, 2026 11:59
New pure-Python module providing percentile bootstrap CI primitives:
bootstrap_resample, bootstrap_ci, assign_bootstrap_cis helper, plus a
SHA-stable run_seed for cross-process-deterministic per-run seeding.
Constants N_BOOT=2000, ALPHA=0.05, BASE_SEED=42. No eva imports — safe
to use from anywhere in the package.

13 unit tests cover the primitives plus a cross-process determinism
check that guards against accidental use of Python's salted hash().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds 95% percentile-bootstrap CIs to every CI-bearing composite scalar
in the aggregation layer:

- compute_run_level_aggregates emits mean_ci_lower/upper/n_scenarios on
  every composite entry (pass/derived composites get the CI on mean;
  success_rate stays bare).
- _compute_aggregate_pass_k emits stat_ci_lower/upper for pass_at_1,
  pass_at_k, and pass_power_k_observed (theoretical stays bare as a
  deterministic transform).
- Both functions accept a seed kwarg threaded by the runner in a
  follow-up commit.

Bootstrap unit is the scenario, not the trial: two new private helpers
(_scenario_means_for_metric, _scenario_values_for_composite) collapse
multi-trial records to one value per scenario before resampling. For
k=1 runs each record is its own scenario.

Adds 15 unit tests across TestScenarioGrouping, TestRunLevelCompositeCIs,
and TestRunLevelPassKCIs covering field shape, point-estimate bracketing,
seed determinism, and null-CI handling for empty-data composites.

Bumps metrics_version 2.0.0 -> 2.1.0 (additive schema change).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends MetricsRunner to populate the per-metric half of the CI schema
and wire the run-dependent seed end-to-end:

- _build_per_metric_aggregates emits mean_ci_lower/upper/n_scenarios
  on every per-metric entry and stat_ci_lower/upper inside pass_k
  sub-blocks (pass_at_1, pass_at_k, pass_power_k_observed).
- _save_summary and run_aggregate_only compute seed = run_seed(run_dir.name)
  once and thread it through both aggregators, so re-running aggregate-only
  on the same run yields byte-identical CIs and different runs get
  independent Monte-Carlo noise.

Per-metric aggregates reuse aggregation._scenario_means_for_metric to
collapse trials before bootstrapping; pass_k blocks share the
assign_bootstrap_cis helper with the composite path.

Adds 6 unit tests across TestPerMetricCIs and TestRunSeedIntegration
covering per-metric field shape, null-CI handling, same-seed
byte-identity, and across-run independence.

(metrics_version bumped to 2.1.0 in the preceding commit; --no-verify
used to skip the per-commit version-bump reminder.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract assign_mean_ci helper to utils.bootstrap, mirroring
  assign_bootstrap_cis. Apply at both mean-CI sites
  (compute_run_level_aggregates and _build_per_metric_aggregates):
  each call site shrinks from 11 lines to 2. bootstrap_ci is no
  longer imported in aggregation.py or runner.py.
- Drop the leading underscore from scenario_means_for_metric since
  it is the cross-module surface imported by runner.py.
  _scenario_values_for_composite stays underscored (internal only).
- Hoist _make_clean_records to module level in test_aggregation.py;
  the identical helper was duplicated in TestRunLevelCompositeCIs
  and TestRunSeedIntegration.

No behavior or output-schema change. 424/424 tests pass.
(--no-verify skips the per-commit version-bump reminder; version
was bumped 2.0.0 -> 2.1.0 in d34ae59 and is unchanged here.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces in-place mutating helpers with pure functions that return
the CI fields, leaving the merge to the caller:

- assign_bootstrap_cis(target, samples, *, seed) -> None
  becomes
  bootstrap_ci_fields(samples, *, seed) -> dict
- assign_mean_ci(target, scenario_values, *, seed) -> None
  becomes
  mean_ci_fields(scenario_values, *, seed) -> dict

Call sites now read top-to-bottom in data-flow order:

  entry.update(mean_ci_fields(scenario_values, seed=seed))
  entry.update(bootstrap_ci_fields({...}, seed=seed))

instead of relying on a hidden side-effect from the helper. The
helpers are also trivially testable in isolation without
constructing a target dict.

No behavior change. 424/424 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TestRunLevelCompositeCIs.test_within_run_determinism and
test_different_seeds_differ were strictly subsumed by
TestRunSeedIntegration.test_within_run_byte_identical and
test_across_run_independence respectively — the integration
versions exercise the same code paths through run_seed() and
also check full-dict equality / point-estimate stability that
the simpler versions skipped.

Removing them also drops an awkward seed=13 empirical-choice
hack (n=20 bimodal fixture is too stable for arbitrary seed
pairs) that was only needed in the subsumed test.

Net -19 lines, -1 awkward seed hack, no coverage loss.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The docstring was the only outstanding ruff failure on CI; D205
isn't auto-fixable by ruff so a manual blank line is needed
between the summary line and the body.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BASE_SEED used to play two roles: a component of run_seed (already
removed in commit f6f3fae when run_seed was simplified to ignore
base_seed), and a default value for bootstrap_ci / the three
aggregator entry points. The second role was a silent trap:
production always supplies seed = run_seed(run_dir.name), so the
default was never exercised in production but would have produced
correct-looking-but-wrong (because cross-run-correlated) CIs if a
caller forgot to pass one.

Changes:
- Remove BASE_SEED constant from src/eva/utils/bootstrap.py.
- Make seed a keyword-only required argument on bootstrap_ci,
  compute_run_level_aggregates, _compute_aggregate_pass_k, and
  _build_per_metric_aggregates.
- Add explicit seed=42 to the 10 pre-existing test callers that
  weren't passing a seed (their tests don't check CI behavior;
  this is purely satisfying the now-required signature).
- Rename test_defaults_match_module_constants ->
  test_n_boot_and_alpha_defaults_match_module_constants and
  drop the BASE_SEED check.

Production behavior unchanged. 422/422 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds three words to the seed kwarg docstrings so the rationale for
the run_seed(run_dir.name) recommendation is visible at the call
site: CIs need to be stable across re-computation of the same run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread src/eva/__init__.py
# Bump metrics_version when changes affect metric computation (metrics code,
# judge prompts, pricing tables, postprocessor).
metrics_version = "2.0.0"
metrics_version = "2.1.0"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct since it changes output fields?

…s_for_composite

Both functions differed only in the value extractor; factor the shared
grouping + averaging into a private _scenario_means(per_record_values)
helper. Each public function becomes a one-line wrapper that builds
the dict via comprehension. No new imports (no Callable needed).

Also clean up two docstring touches in bootstrap_ci:
- "95% percentile bootstrap CI" -> "95% bootstrap CI" (the percentile
  method is established at module level; "percentile" alongside "95%"
  reads as redundant).
- "so behavior is defined" -> "so behavior is deterministic" (more
  precise about what the seed gives you).

422/422 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lindsaydbrin lindsaydbrin marked this pull request as ready for review June 1, 2026 18:26
@lindsaydbrin
Copy link
Copy Markdown
Collaborator Author

@tara-servicenow asked about handling missing/errored samples. Here's Claude's explanation:

How missing/errored trials are handled in CI computation
CI computation uses get_score() to filter raw trial data: errored trials (score.error set) and missing values return None and are dropped. Valid trials are grouped by scenario ID and averaged to produce one value per scenario; scenarios where every trial is None are dropped entirely. The bootstrap CI is then computed over those per-scenario means.

This matches what the main metric aggregation does — both exclude errors and missing values before computing statistics — with two differences worth noting:

  1. Granularity. The main loop counts error_count and missing_count separately and tracks totals in the output. The CI path collapses both to None with no separate tracking; that distinction is only visible in the main aggregation stats.
  2. Unit of analysis. The main loop treats each record/trial as an independent data point for the mean. The CI groups trials by scenario first and averages them, so each scenario contributes one value regardless of how many trials it has. This is the statistically correct unit for the bootstrap — a design choice, not a discrepancy.

You can see the get_score() on line 116 of aggregation.py:

    return _scenario_means(
        {record_id: record_metrics.get_score(metric_name) for record_id, record_metrics in all_metrics.items()}
    )

and then lines 96-97 in _scenario_means():

        if val is None:
            continue

@gabegma does this make sense to you? @tara-servicenow mentioned that you had thought about handling None vs. errors.

@tara-servicenow
Copy link
Copy Markdown
Collaborator

This looks good to me but will let Gab review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants