[fix] Resolve broken filtering for schemaless evaluators#4650
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThe pull request adds persistent trace-inferred evaluator output schemas to evaluation run steps. The backend collects schemas during metrics refresh and stores them in run data when evaluator schemas are absent. The frontend extracts these persisted schemas and uses them to resolve metric types when evaluator-declared metrics are unavailable, with fallback resolution logic. ChangesTrace-inferred step schemas
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR fixes missing filter columns for schema-less (no declared output schema) evaluators by persisting the trace-inferred outputs schema on the evaluation run step and teaching the UI to derive metric types from that persisted schema when evaluator metric definitions are absent.
Changes:
- Backend: add
schemas: Optional[JsonSchemas]toEvaluationRunDataStepand persist trace-inferredschemas.outputsonto run steps during metrics refresh. - Frontend: derive metric definitions/types for annotation columns from
run.data.steps[].schemas.outputswhen evaluator revision metrics don’t provide a match. - UI typing: prefer evaluator-declared metrics, with a safe fallback to run-step inferred schema for schemaless evaluators.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| web/oss/src/components/EvalRunDetails/atoms/table/columns.ts | Builds per-step metric definitions from run-step schemas.outputs and uses them as a fallback to type annotation columns. |
| api/oss/src/core/evaluations/types.py | Extends EvaluationRunDataStep with optional schemas for run-scoped inferred output schema persistence. |
| api/oss/src/core/evaluations/service.py | Captures trace-inferred output schemas per step during _refresh_metrics and persists them onto run steps in _update_run_mappings_from_inferred_metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
api/oss/src/core/evaluations/service.py (2)
1790-1805:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve the full run payload when writing inferred schemas.
This rebuilds
EvaluationRunDatawith onlysteps,repeats, andmappings, then callsedit_runwithouttagsormeta. A successful metrics refresh can therefore erase unrelated run settings and metadata, includingdata.concurrency.Suggested fix
if updated_mappings != existing_mappings or updated_steps != existing_steps: run_data = EvaluationRunData( steps=updated_steps, repeats=run.data.repeats, + concurrency=run.data.concurrency, mappings=updated_mappings, ) await self.edit_run( project_id=project_id, user_id=user_id, run=EvaluationRunEdit( id=run.id, name=run.name, description=run.description, + tags=run.tags, + meta=run.meta, status=run.status, flags=run.flags, data=run_data, ), )
1512-1519:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't let schema persistence abort metric refresh on closed runs.
This
awaitsits on the hot path beforeanalytics()andset_metrics(), but_update_run_mappings_from_inferred_metrics()mutates the run viaedit_run(). On a closed run that can raiseEvaluationClosedConflict, which means refresh exits before any metrics are recomputed or stored.Suggested fix
if any_inferred and metrics_keys_by_step and run and run.data: - await self._update_run_mappings_from_inferred_metrics( - project_id=project_id, - user_id=user_id, - run=run, - inferred_metrics_keys_by_step=metrics_keys_by_step, - inferred_schemas_by_step=inferred_schemas_by_step, - ) + try: + await self._update_run_mappings_from_inferred_metrics( + project_id=project_id, + user_id=user_id, + run=run, + inferred_metrics_keys_by_step=metrics_keys_by_step, + inferred_schemas_by_step=inferred_schemas_by_step, + ) + except EvaluationClosedConflict: + log.info( + "[METRICS] Skipping inferred schema persistence for closed run", + run_id=run.id, + )
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 952c4364-8877-4f4e-96ef-c18abaf1b229
📒 Files selected for processing (3)
api/oss/src/core/evaluations/service.pyapi/oss/src/core/evaluations/types.pyweb/oss/src/components/EvalRunDetails/atoms/table/columns.ts
Railway Preview Environment
|
evaluators
Context
Run an evaluation from the SDK with a function evaluator that declares no output schema, open it in the UI, and that evaluator's columns are missing from the Scenarios "Filters" dropdown. The other evaluators (which declare a schema) show up fine. The run mappings were correct, but the UI sourced each column's filter type only from the evaluator's declared schema. With no schema, the type defaulted to
"string", and the filter bar drops string-typed evaluator columns, so the columns disappeared.Changes
The schema is inferred from traces during metrics refresh (this already happened, via genson). We now persist that inferred schema and let the UI use it.
EvaluationRunDataStepgainsschemas: Optional[JsonSchemas]. Duringrefresh_metrics, the schema inferred per annotation step is written ontorun.data.steps[].schemas.outputs. It lives on the run, not the evaluator revision, so the immutable revision is never rewritten.schemas.outputs(via the existingextractMetrics). Schema-declared evaluators are untouched.Before:
my_random_evaluatorcolumns (myscore,success) resolved to typestringand were filtered out.After: they resolve to
numberandbooleanand appear in the dropdown with the right operators.Tests / notes
my_random_evaluator · myscoreand· successappear in the Filters dropdown alongside the schema-declared evaluators.edit_runraisesEvaluationClosedConflicton closed runs and the router swallows it, so a run closed before its first metrics refresh won't persist the inferred schema (or the pre-existing mapping rewrite). Filtering the metadata write past the closed-run guard would fix that; flagging as a follow-up.What to QA