Skip to content

[fix] Resolve broken filtering for schemaless evaluators#4650

Open
junaway wants to merge 2 commits into
mainfrom
fix/broken-filtering-for-schemaless-evaluators
Open

[fix] Resolve broken filtering for schemaless evaluators#4650
junaway wants to merge 2 commits into
mainfrom
fix/broken-filtering-for-schemaless-evaluators

Conversation

@junaway

@junaway junaway commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Context

Run an evaluation from the SDK with a function evaluator that declares no output schema, open it in the UI, and that evaluator's columns are missing from the Scenarios "Filters" dropdown. The other evaluators (which declare a schema) show up fine. The run mappings were correct, but the UI sourced each column's filter type only from the evaluator's declared schema. With no schema, the type defaulted to "string", and the filter bar drops string-typed evaluator columns, so the columns disappeared.

Changes

The schema is inferred from traces during metrics refresh (this already happened, via genson). We now persist that inferred schema and let the UI use it.

  • Backend: EvaluationRunDataStep gains schemas: Optional[JsonSchemas]. During refresh_metrics, the schema inferred per annotation step is written onto run.data.steps[].schemas.outputs. It lives on the run, not the evaluator revision, so the immutable revision is never rewritten.
  • Frontend: when an evaluator revision carries no metric definitions, the column builder derives them from the run step's schemas.outputs (via the existing extractMetrics). Schema-declared evaluators are untouched.

Before: my_random_evaluator columns (myscore, success) resolved to type string and were filtered out.
After: they resolve to number and boolean and appear in the dropdown with the right operators.

Tests / notes

  • Verified end-to-end: after refresh, my_random_evaluator · myscore and · success appear in the Filters dropdown alongside the schema-declared evaluators.
  • Known limitation, not addressed here: edit_run raises EvaluationClosedConflict on closed runs and the router swallows it, so a run closed before its first metrics refresh won't persist the inferred schema (or the pre-existing mapping rewrite). Filtering the metadata write past the closed-run guard would fix that; flagging as a follow-up.

What to QA

  • Run an SDK eval with at least one evaluator that has no declared output schema and one that does. Open the run, go to Scenarios, open Filters. The schema-less evaluator's numeric/boolean columns now appear; string-only outputs stay hidden as before.
  • Regression: a UI-created eval with schema-declared evaluators still shows the same filter columns it did before.

Copilot AI review requested due to automatic review settings June 11, 2026 15:01
@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Jun 11, 2026 7:02pm

Request Review

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The pull request adds persistent trace-inferred evaluator output schemas to evaluation run steps. The backend collects schemas during metrics refresh and stores them in run data when evaluator schemas are absent. The frontend extracts these persisted schemas and uses them to resolve metric types when evaluator-declared metrics are unavailable, with fallback resolution logic.

Changes

Trace-inferred step schemas

Layer / File(s) Summary
Data model: step schemas DTO
api/oss/src/core/evaluations/types.py
EvaluationRunDataStep gains an optional schemas field of type JsonSchemas, documented as inferred from traces when the evaluator declares no schemas, scoped to the run's observed outputs.
Backend: collect and persist inferred schemas
api/oss/src/core/evaluations/service.py
JsonSchemas is imported; metrics refresh accumulates inferred output schemas per step into inferred_schemas_by_step; _update_run_mappings_from_inferred_metrics accepts the collected schemas, patches missing step schema outputs, and persists both updated steps and mappings in the run data payload. Persistence of inferred schemas is skipped for closed runs via an EvaluationClosedConflict handler while other metric updates still apply.
Frontend: metric resolution with schema fallback
web/oss/src/components/EvalRunDetails/atoms/table/columns.ts
extractMetrics is imported; table column building derives a stepSchemaMetricsByStepKey map from runData.camelRun.data.steps; metric-definition resolution now prefers evaluator-declared metrics and falls back to schema-derived metrics when the declared metric is not found, using both canonical metric keys and bare value keys for matching.
Tests: inferred schema persistence behavior
api/oss/tests/pytest/unit/evaluations/test_inferred_schema_persistence.py
Adds unit tests verifying inferred output schemas are persisted onto schemaless steps, do not overwrite existing declared schemas, and that the full-run edit preserves unrelated run fields like tags, meta, and data.concurrency.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • Agenta-AI/agenta#4477: Directly depends on the inferred per-step schemas added here to filter and resolve metric types in the scenario table and filters.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 60.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the context, changes, and testing approach for persisting inferred schemas to fix filtering issues.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The PR title references fixing broken filtering for schemaless evaluators, which directly relates to the core change: persisting inferred schemas on run steps to enable proper column type resolution in the UI filters dropdown.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/broken-filtering-for-schemaless-evaluators

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. Backend bug Something isn't working Frontend labels Jun 11, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes missing filter columns for schema-less (no declared output schema) evaluators by persisting the trace-inferred outputs schema on the evaluation run step and teaching the UI to derive metric types from that persisted schema when evaluator metric definitions are absent.

Changes:

  • Backend: add schemas: Optional[JsonSchemas] to EvaluationRunDataStep and persist trace-inferred schemas.outputs onto run steps during metrics refresh.
  • Frontend: derive metric definitions/types for annotation columns from run.data.steps[].schemas.outputs when evaluator revision metrics don’t provide a match.
  • UI typing: prefer evaluator-declared metrics, with a safe fallback to run-step inferred schema for schemaless evaluators.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
web/oss/src/components/EvalRunDetails/atoms/table/columns.ts Builds per-step metric definitions from run-step schemas.outputs and uses them as a fallback to type annotation columns.
api/oss/src/core/evaluations/types.py Extends EvaluationRunDataStep with optional schemas for run-scoped inferred output schema persistence.
api/oss/src/core/evaluations/service.py Captures trace-inferred output schemas per step during _refresh_metrics and persists them onto run steps in _update_run_mappings_from_inferred_metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread api/oss/src/core/evaluations/service.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
api/oss/src/core/evaluations/service.py (2)

1790-1805: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve the full run payload when writing inferred schemas.

This rebuilds EvaluationRunData with only steps, repeats, and mappings, then calls edit_run without tags or meta. A successful metrics refresh can therefore erase unrelated run settings and metadata, including data.concurrency.

Suggested fix
         if updated_mappings != existing_mappings or updated_steps != existing_steps:
             run_data = EvaluationRunData(
                 steps=updated_steps,
                 repeats=run.data.repeats,
+                concurrency=run.data.concurrency,
                 mappings=updated_mappings,
             )
             await self.edit_run(
                 project_id=project_id,
                 user_id=user_id,
                 run=EvaluationRunEdit(
                     id=run.id,
                     name=run.name,
                     description=run.description,
+                    tags=run.tags,
+                    meta=run.meta,
                     status=run.status,
                     flags=run.flags,
                     data=run_data,
                 ),
             )

1512-1519: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't let schema persistence abort metric refresh on closed runs.

This await sits on the hot path before analytics() and set_metrics(), but _update_run_mappings_from_inferred_metrics() mutates the run via edit_run(). On a closed run that can raise EvaluationClosedConflict, which means refresh exits before any metrics are recomputed or stored.

Suggested fix
         if any_inferred and metrics_keys_by_step and run and run.data:
-            await self._update_run_mappings_from_inferred_metrics(
-                project_id=project_id,
-                user_id=user_id,
-                run=run,
-                inferred_metrics_keys_by_step=metrics_keys_by_step,
-                inferred_schemas_by_step=inferred_schemas_by_step,
-            )
+            try:
+                await self._update_run_mappings_from_inferred_metrics(
+                    project_id=project_id,
+                    user_id=user_id,
+                    run=run,
+                    inferred_metrics_keys_by_step=metrics_keys_by_step,
+                    inferred_schemas_by_step=inferred_schemas_by_step,
+                )
+            except EvaluationClosedConflict:
+                log.info(
+                    "[METRICS] Skipping inferred schema persistence for closed run",
+                    run_id=run.id,
+                )

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 952c4364-8877-4f4e-96ef-c18abaf1b229

📥 Commits

Reviewing files that changed from the base of the PR and between bbc9d12 and 1110bce.

📒 Files selected for processing (3)
  • api/oss/src/core/evaluations/service.py
  • api/oss/src/core/evaluations/types.py
  • web/oss/src/components/EvalRunDetails/atoms/table/columns.ts

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Railway Preview Environment

Preview URL https://gateway-production-7225.up.railway.app/w
Project agenta-oss-pr-4650
Image tag pr-4650-cd67118
Status Deployed
Railway logs Open logs
Workflow logs View workflow run
Updated at 2026-06-11T19:06:51.489Z

@junaway junaway requested a review from ardaerzin June 11, 2026 15:25
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jun 11, 2026
@junaway junaway changed the title [fix] Resolve broken filtering for schemaless evaluators [fix] Resolve broken filtering for schemaless evaluators Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend bug Something isn't working Frontend size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants