[fix] Resolve broken filtering for schemaless `evaluators` by junaway · Pull Request #4650 · Agenta-AI/agenta

junaway · 2026-06-11T15:01:33Z

Context

Run an evaluation from the SDK with a function evaluator that declares no output schema, open it in the UI, and that evaluator's columns are missing from the Scenarios "Filters" dropdown. The other evaluators (which declare a schema) show up fine. The run mappings were correct, but the UI sourced each column's filter type only from the evaluator's declared schema. With no schema, the type defaulted to "string", and the filter bar drops string-typed evaluator columns, so the columns disappeared.

Changes

The schema is inferred from traces during metrics refresh (this already happened, via genson). We now persist that inferred schema and let the UI use it.

Backend: EvaluationRunDataStep gains schemas: Optional[JsonSchemas]. During refresh_metrics, the schema inferred per annotation step is written onto run.data.steps[].schemas.outputs. It lives on the run, not the evaluator revision, so the immutable revision is never rewritten.
Frontend: when an evaluator revision carries no metric definitions, the column builder derives them from the run step's schemas.outputs (via the existing extractMetrics). Schema-declared evaluators are untouched.

Before: my_random_evaluator columns (myscore, success) resolved to type string and were filtered out.
After: they resolve to number and boolean and appear in the dropdown with the right operators.

Tests / notes

Verified end-to-end: after refresh, my_random_evaluator · myscore and · success appear in the Filters dropdown alongside the schema-declared evaluators.
Known limitation, not addressed here: edit_run raises EvaluationClosedConflict on closed runs and the router swallows it, so a run closed before its first metrics refresh won't persist the inferred schema (or the pre-existing mapping rewrite). Filtering the metadata write past the closed-run guard would fix that; flagging as a follow-up.

What to QA

Run an SDK eval with at least one evaluator that has no declared output schema and one that does. Open the run, go to Scenarios, open Filters. The schema-less evaluator's numeric/boolean columns now appear; string-only outputs stay hidden as before.
Regression: a UI-created eval with schema-declared evaluators still shows the same filter columns it did before.

vercel · 2026-06-11T15:01:41Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	Jun 11, 2026 7:02pm

coderabbitai · 2026-06-11T15:01:58Z

📝 Walkthrough

Walkthrough

The pull request adds persistent trace-inferred evaluator output schemas to evaluation run steps. The backend collects schemas during metrics refresh and stores them in run data when evaluator schemas are absent. The frontend extracts these persisted schemas and uses them to resolve metric types when evaluator-declared metrics are unavailable, with fallback resolution logic.

Changes

Trace-inferred step schemas

Layer / File(s)	Summary
Data model: step schemas DTO `api/oss/src/core/evaluations/types.py`	`EvaluationRunDataStep` gains an optional `schemas` field of type `JsonSchemas`, documented as inferred from traces when the evaluator declares no schemas, scoped to the run's observed outputs.
Backend: collect and persist inferred schemas `api/oss/src/core/evaluations/service.py`	`JsonSchemas` is imported; metrics refresh accumulates inferred output schemas per step into `inferred_schemas_by_step`; `_update_run_mappings_from_inferred_metrics` accepts the collected schemas, patches missing step schema outputs, and persists both updated steps and mappings in the run data payload. Persistence of inferred schemas is skipped for closed runs via an `EvaluationClosedConflict` handler while other metric updates still apply.
Frontend: metric resolution with schema fallback `web/oss/src/components/EvalRunDetails/atoms/table/columns.ts`	`extractMetrics` is imported; table column building derives a `stepSchemaMetricsByStepKey` map from `runData.camelRun.data.steps`; metric-definition resolution now prefers evaluator-declared metrics and falls back to schema-derived metrics when the declared metric is not found, using both canonical metric keys and bare value keys for matching.
Tests: inferred schema persistence behavior `api/oss/tests/pytest/unit/evaluations/test_inferred_schema_persistence.py`	Adds unit tests verifying inferred output schemas are persisted onto schemaless steps, do not overwrite existing declared schemas, and that the full-run edit preserves unrelated run fields like `tags`, `meta`, and `data.concurrency`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Agenta-AI/agenta#4477: Directly depends on the inferred per-step schemas added here to filter and resolve metric types in the scenario table and filters.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 60.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the context, changes, and testing approach for persisting inferred schemas to fix filtering issues.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The PR title references fixing broken filtering for schemaless evaluators, which directly relates to the core change: persisting inferred schemas on run steps to enable proper column type resolution in the UI filters dropdown.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/broken-filtering-for-schemaless-evaluators

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR fixes missing filter columns for schema-less (no declared output schema) evaluators by persisting the trace-inferred outputs schema on the evaluation run step and teaching the UI to derive metric types from that persisted schema when evaluator metric definitions are absent.

Changes:

Backend: add schemas: Optional[JsonSchemas] to EvaluationRunDataStep and persist trace-inferred schemas.outputs onto run steps during metrics refresh.
Frontend: derive metric definitions/types for annotation columns from run.data.steps[].schemas.outputs when evaluator revision metrics don’t provide a match.
UI typing: prefer evaluator-declared metrics, with a safe fallback to run-step inferred schema for schemaless evaluators.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
web/oss/src/components/EvalRunDetails/atoms/table/columns.ts	Builds per-step metric definitions from run-step `schemas.outputs` and uses them as a fallback to type annotation columns.
api/oss/src/core/evaluations/types.py	Extends `EvaluationRunDataStep` with optional `schemas` for run-scoped inferred output schema persistence.
api/oss/src/core/evaluations/service.py	Captures trace-inferred output schemas per step during `_refresh_metrics` and persists them onto run steps in `_update_run_mappings_from_inferred_metrics`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

api/oss/src/core/evaluations/service.py (2)

1790-1805: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve the full run payload when writing inferred schemas.

This rebuilds EvaluationRunData with only steps, repeats, and mappings, then calls edit_run without tags or meta. A successful metrics refresh can therefore erase unrelated run settings and metadata, including data.concurrency.

Suggested fix

         if updated_mappings != existing_mappings or updated_steps != existing_steps:
             run_data = EvaluationRunData(
                 steps=updated_steps,
                 repeats=run.data.repeats,
+                concurrency=run.data.concurrency,
                 mappings=updated_mappings,
             )
             await self.edit_run(
                 project_id=project_id,
                 user_id=user_id,
                 run=EvaluationRunEdit(
                     id=run.id,
                     name=run.name,
                     description=run.description,
+                    tags=run.tags,
+                    meta=run.meta,
                     status=run.status,
                     flags=run.flags,
                     data=run_data,
                 ),
             )

1512-1519: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't let schema persistence abort metric refresh on closed runs.

This await sits on the hot path before analytics() and set_metrics(), but _update_run_mappings_from_inferred_metrics() mutates the run via edit_run(). On a closed run that can raise EvaluationClosedConflict, which means refresh exits before any metrics are recomputed or stored.

Suggested fix

         if any_inferred and metrics_keys_by_step and run and run.data:
-            await self._update_run_mappings_from_inferred_metrics(
-                project_id=project_id,
-                user_id=user_id,
-                run=run,
-                inferred_metrics_keys_by_step=metrics_keys_by_step,
-                inferred_schemas_by_step=inferred_schemas_by_step,
-            )
+            try:
+                await self._update_run_mappings_from_inferred_metrics(
+                    project_id=project_id,
+                    user_id=user_id,
+                    run=run,
+                    inferred_metrics_keys_by_step=metrics_keys_by_step,
+                    inferred_schemas_by_step=inferred_schemas_by_step,
+                )
+            except EvaluationClosedConflict:
+                log.info(
+                    "[METRICS] Skipping inferred schema persistence for closed run",
+                    run_id=run.id,
+                )

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 952c4364-8877-4f4e-96ef-c18abaf1b229

📥 Commits

Reviewing files that changed from the base of the PR and between bbc9d12 and 1110bce.

📒 Files selected for processing (3)

api/oss/src/core/evaluations/service.py
api/oss/src/core/evaluations/types.py
web/oss/src/components/EvalRunDetails/atoms/table/columns.ts

github-actions · 2026-06-11T15:14:03Z

Railway Preview Environment


Preview URL	https://gateway-production-7225.up.railway.app/w
Project	`agenta-oss-pr-4650`
Image tag	`pr-4650-cd67118`
Status	Deployed
Railway logs	Open logs
Workflow logs	View workflow run
Updated at 2026-06-11T19:06:51.489Z

[fix] Resolve broken filtering for schemaless evaluators

1110bce

Copilot AI review requested due to automatic review settings June 11, 2026 15:01

Copilot started reviewing on behalf of junaway June 11, 2026 15:01 View session

dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. Backend bug Something isn't working Frontend labels Jun 11, 2026

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread api/oss/src/core/evaluations/service.py

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

junaway requested a review from ardaerzin June 11, 2026 15:25

fix CR

9d8b0aa

dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jun 11, 2026

vercel Bot deployed to Preview June 11, 2026 19:02 View deployment

junaway changed the title ~~[fix] Resolve broken filtering for schemaless evaluators~~ [fix] Resolve broken filtering for schemaless evaluators Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Resolve broken filtering for schemaless `evaluators`#4650

[fix] Resolve broken filtering for schemaless `evaluators`#4650
junaway wants to merge 2 commits into
mainfrom
fix/broken-filtering-for-schemaless-evaluators

junaway commented Jun 11, 2026

Uh oh!

vercel Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

github-actions Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

junaway commented Jun 11, 2026

Context

Changes

Tests / notes

What to QA

Uh oh!

vercel Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Railway Preview Environment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel Bot commented Jun 11, 2026 •

edited

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

github-actions Bot commented Jun 11, 2026 •

edited

Loading