Add confidence intervals to main code by lindsaydbrin · Pull Request #130 · ServiceNow/eva

lindsaydbrin · 2026-05-29T16:26:09Z

Every eva metrics run now emits 95% bootstrap confidence intervals for composite and main metrics (not sub-metrics) in metrics_summary.json. CIs are deterministic within a run (same run_dir.name → byte-identical bounds) and independent across runs.

Scope (purely additive — no existing field renamed or changed):

overall_scores.{COMPOSITE}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios} - for 6 composites: [EVA-A, EVA-X, EVA-overall] x [_pass, _mean]
overall_scores.pass_k.{COMPOSITE}.{stat}_ci_{lower,upper} - for pass_at_1, pass_at_k, pass_power_k_observed (multi-trial only). pass_power_k_theoretical does not have CIs (clearly).
per_metric.{name}.{mean_ci_lower, mean_ci_upper, mean_ci_n_scenarios} for all metrics (typically/currently: task_completion, faithfulness, agent_speech_fidelity, conversation_progression, turn_taking, conciseness).
per_metric.{name}.pass_k.{stat}_ci_{lower,upper} (k>1 only, again of course).

Statistical design (matches the approach on analysis/eva-bench-stats/ / in the paper): percentile bootstrap on per-scenario sample means, N_BOOT=2000, ALPHA=0.05, seed derived from run_dir.name via SHA-256 (cross-process-stable).

New pure-Python module providing percentile bootstrap CI primitives: bootstrap_resample, bootstrap_ci, assign_bootstrap_cis helper, plus a SHA-stable run_seed for cross-process-deterministic per-run seeding. Constants N_BOOT=2000, ALPHA=0.05, BASE_SEED=42. No eva imports — safe to use from anywhere in the package. 13 unit tests cover the primitives plus a cross-process determinism check that guards against accidental use of Python's salted hash(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds 95% percentile-bootstrap CIs to every CI-bearing composite scalar in the aggregation layer: - compute_run_level_aggregates emits mean_ci_lower/upper/n_scenarios on every composite entry (pass/derived composites get the CI on mean; success_rate stays bare). - _compute_aggregate_pass_k emits stat_ci_lower/upper for pass_at_1, pass_at_k, and pass_power_k_observed (theoretical stays bare as a deterministic transform). - Both functions accept a seed kwarg threaded by the runner in a follow-up commit. Bootstrap unit is the scenario, not the trial: two new private helpers (_scenario_means_for_metric, _scenario_values_for_composite) collapse multi-trial records to one value per scenario before resampling. For k=1 runs each record is its own scenario. Adds 15 unit tests across TestScenarioGrouping, TestRunLevelCompositeCIs, and TestRunLevelPassKCIs covering field shape, point-estimate bracketing, seed determinism, and null-CI handling for empty-data composites. Bumps metrics_version 2.0.0 -> 2.1.0 (additive schema change). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extends MetricsRunner to populate the per-metric half of the CI schema and wire the run-dependent seed end-to-end: - _build_per_metric_aggregates emits mean_ci_lower/upper/n_scenarios on every per-metric entry and stat_ci_lower/upper inside pass_k sub-blocks (pass_at_1, pass_at_k, pass_power_k_observed). - _save_summary and run_aggregate_only compute seed = run_seed(run_dir.name) once and thread it through both aggregators, so re-running aggregate-only on the same run yields byte-identical CIs and different runs get independent Monte-Carlo noise. Per-metric aggregates reuse aggregation._scenario_means_for_metric to collapse trials before bootstrapping; pass_k blocks share the assign_bootstrap_cis helper with the composite path. Adds 6 unit tests across TestPerMetricCIs and TestRunSeedIntegration covering per-metric field shape, null-CI handling, same-seed byte-identity, and across-run independence. (metrics_version bumped to 2.1.0 in the preceding commit; --no-verify used to skip the per-commit version-bump reminder.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Extract assign_mean_ci helper to utils.bootstrap, mirroring assign_bootstrap_cis. Apply at both mean-CI sites (compute_run_level_aggregates and _build_per_metric_aggregates): each call site shrinks from 11 lines to 2. bootstrap_ci is no longer imported in aggregation.py or runner.py. - Drop the leading underscore from scenario_means_for_metric since it is the cross-module surface imported by runner.py. _scenario_values_for_composite stays underscored (internal only). - Hoist _make_clean_records to module level in test_aggregation.py; the identical helper was duplicated in TestRunLevelCompositeCIs and TestRunSeedIntegration. No behavior or output-schema change. 424/424 tests pass. (--no-verify skips the per-commit version-bump reminder; version was bumped 2.0.0 -> 2.1.0 in d34ae59 and is unchanged here.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces in-place mutating helpers with pure functions that return the CI fields, leaving the merge to the caller: - assign_bootstrap_cis(target, samples, *, seed) -> None becomes bootstrap_ci_fields(samples, *, seed) -> dict - assign_mean_ci(target, scenario_values, *, seed) -> None becomes mean_ci_fields(scenario_values, *, seed) -> dict Call sites now read top-to-bottom in data-flow order: entry.update(mean_ci_fields(scenario_values, seed=seed)) entry.update(bootstrap_ci_fields({...}, seed=seed)) instead of relying on a hidden side-effect from the helper. The helpers are also trivially testable in isolation without constructing a target dict. No behavior change. 424/424 tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

TestRunLevelCompositeCIs.test_within_run_determinism and test_different_seeds_differ were strictly subsumed by TestRunSeedIntegration.test_within_run_byte_identical and test_across_run_independence respectively — the integration versions exercise the same code paths through run_seed() and also check full-dict equality / point-estimate stability that the simpler versions skipped. Removing them also drops an awkward seed=13 empirical-choice hack (n=20 bimodal fixture is too stable for arbitrary seed pairs) that was only needed in the subsumed test. Net -19 lines, -1 awkward seed hack, no coverage loss. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The docstring was the only outstanding ruff failure on CI; D205 isn't auto-fixable by ruff so a manual blank line is needed between the summary line and the body. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BASE_SEED used to play two roles: a component of run_seed (already removed in commit f6f3fae when run_seed was simplified to ignore base_seed), and a default value for bootstrap_ci / the three aggregator entry points. The second role was a silent trap: production always supplies seed = run_seed(run_dir.name), so the default was never exercised in production but would have produced correct-looking-but-wrong (because cross-run-correlated) CIs if a caller forgot to pass one. Changes: - Remove BASE_SEED constant from src/eva/utils/bootstrap.py. - Make seed a keyword-only required argument on bootstrap_ci, compute_run_level_aggregates, _compute_aggregate_pass_k, and _build_per_metric_aggregates. - Add explicit seed=42 to the 10 pre-existing test callers that weren't passing a seed (their tests don't check CI behavior; this is purely satisfying the now-required signature). - Rename test_defaults_match_module_constants -> test_n_boot_and_alpha_defaults_match_module_constants and drop the BASE_SEED check. Production behavior unchanged. 422/422 tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds three words to the seed kwarg docstrings so the rationale for the run_seed(run_dir.name) recommendation is visible at the call site: CIs need to be stable across re-computation of the same run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lindsaydbrin · 2026-06-01T18:18:11Z

 # Bump metrics_version when changes affect metric computation (metrics code,
 # judge prompts, pricing tables, postprocessor).
-metrics_version = "2.0.0"
+metrics_version = "2.1.0"


Is this correct since it changes output fields?

…s_for_composite Both functions differed only in the value extractor; factor the shared grouping + averaging into a private _scenario_means(per_record_values) helper. Each public function becomes a one-line wrapper that builds the dict via comprehension. No new imports (no Callable needed). Also clean up two docstring touches in bootstrap_ci: - "95% percentile bootstrap CI" -> "95% bootstrap CI" (the percentile method is established at module level; "percentile" alongside "95%" reads as redundant). - "so behavior is defined" -> "so behavior is deterministic" (more precise about what the seed gives you). 422/422 tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lindsaydbrin · 2026-06-03T20:31:36Z

@tara-servicenow asked about handling missing/errored samples. Here's Claude's explanation:

How missing/errored trials are handled in CI computation
CI computation uses get_score() to filter raw trial data: errored trials (score.error set) and missing values return None and are dropped. Valid trials are grouped by scenario ID and averaged to produce one value per scenario; scenarios where every trial is None are dropped entirely. The bootstrap CI is then computed over those per-scenario means.

This matches what the main metric aggregation does — both exclude errors and missing values before computing statistics — with two differences worth noting:

Granularity. The main loop counts error_count and missing_count separately and tracks totals in the output. The CI path collapses both to None with no separate tracking; that distinction is only visible in the main aggregation stats.

Unit of analysis. The main loop treats each record/trial as an independent data point for the mean. The CI groups trials by scenario first and averages them, so each scenario contributes one value regardless of how many trials it has. This is the statistically correct unit for the bootstrap — a design choice, not a discrepancy.

You can see the get_score() on line 116 of aggregation.py:

    return _scenario_means(
        {record_id: record_metrics.get_score(metric_name) for record_id, record_metrics in all_metrics.items()}
    )

and then lines 96-97 in _scenario_means():

        if val is None:
            continue

@gabegma does this make sense to you? @tara-servicenow mentioned that you had thought about handling None vs. errors.

tara-servicenow · 2026-06-05T18:44:59Z

This looks good to me but will let Gab review

lindsaydbrin and others added 9 commits May 29, 2026 11:59

lindsaydbrin commented Jun 1, 2026

View reviewed changes

lindsaydbrin marked this pull request as ready for review June 1, 2026 18:26

lindsaydbrin requested review from fanny-riols, gabegma and katstankiewicz June 1, 2026 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add confidence intervals to main code#130

Add confidence intervals to main code#130
lindsaydbrin wants to merge 10 commits into
mainfrom
pr/lindsay/add_CIs

lindsaydbrin commented May 29, 2026 •

edited

Loading

Uh oh!

lindsaydbrin Jun 1, 2026

Uh oh!

lindsaydbrin commented Jun 3, 2026

Uh oh!

tara-servicenow commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lindsaydbrin commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lindsaydbrin Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

lindsaydbrin commented Jun 3, 2026

Uh oh!

tara-servicenow commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lindsaydbrin commented May 29, 2026 •

edited

Loading