feat: simplify evaluation schema to flat score/reasoning shape#1286
feat: simplify evaluation schema to flat score/reasoning shape#1286jsonbailey wants to merge 3 commits intofeat/ai-sdk-next-releasefrom
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@launchdarkly/js-sdk-common size report |
|
@launchdarkly/js-client-sdk size report |
|
@launchdarkly/browser size report |
|
@launchdarkly/js-client-sdk-common size report |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete EvaluationSchemaBuilder.ts and define EVALUATION_SCHEMA as a module-level const in Judge.ts. Remove per-field warnings from _parseEvaluationResponse (keep it pure) and emit a single warning in evaluate() that includes the judge key and raw response data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d81b202. Configure here.
| this._logger?.warn( | ||
| 'Judge evaluation did not return the expected evaluation', | ||
| tracker.getTrackData(), | ||
| `Could not parse evaluation response for judge "${this._aiConfig.key}": ${JSON.stringify(response.data)}`, |
There was a problem hiding this comment.
Parse-failure warning drops tracker context data
Low Severity
The new warn call at the parse-failure point no longer passes tracker.getTrackData() as a second argument, unlike the other two warn calls in the same method (for missing metric key and missing messages), which still include it. The track data contains runId, variationKey, version, modelName, and providerName — operational context useful for correlating warnings in production. Since tracker is available in scope, this appears to be an accidental omission during the refactor.
Reviewed by Cursor Bugbot for commit d81b202. Configure here.


Summary
EvaluationSchemaBuilder.build()no longer takes anevaluationMetricKeyparameter. Since there is only ever a single evaluation metric key per judge config, it does not need to be embedded in the schema sent to the LLM.{score, reasoning}shape. The old nested structure ({evaluations: {metricKey: {score, reasoning}}}) is replaced with a simple{score: number, reasoning: string}object. This is easier for LLMs to produce correctly and matches the Python SDK (fix: Remove evaluation metric key from schema which failed on some LLMs python-server-sdk-ai#105).Judge.ts._parseEvaluationResponsenow readsscoreandreasoningdirectly from the top-level response data. The metric key is still sourced from the judge config'sevaluationMetricKeyand used to key the result — it just no longer appears in the schema or LLM response.Test plan
yarn workspace @launchdarkly/server-sdk-ai test)yarn workspace @launchdarkly/server-sdk-ai lint)_parseEvaluationResponseunit tests updated for simplified signature and data shape🤖 Generated with Claude Code
Note
Medium Risk
Changes the wire format expected from the AI provider for judge evaluations, so any callers/providers still producing the old nested
evaluationsshape will now fail parsing and return unsuccessful results.Overview
Judge structured-output evaluation is simplified to a fixed, flat schema: the LLM is now asked to return top-level
scoreandreasoninginstead of{evaluations: {<metricKey>: ...}}, and response parsing is updated accordingly.This removes the dynamic
EvaluationSchemaBuilderand tightens failure handling/logging when the structured response cannot be parsed; tests are updated to reflect the new response shape and malformed/empty-response behavior.Reviewed by Cursor Bugbot for commit d81b202. Bugbot is set up for automated code reviews on this repo. Configure here.