Skip to content

feat: simplify evaluation schema to flat score/reasoning shape#1286

Open
jsonbailey wants to merge 3 commits intofeat/ai-sdk-next-releasefrom
jb/aic-2253/simplify-eval-schema
Open

feat: simplify evaluation schema to flat score/reasoning shape#1286
jsonbailey wants to merge 3 commits intofeat/ai-sdk-next-releasefrom
jb/aic-2253/simplify-eval-schema

Conversation

@jsonbailey
Copy link
Copy Markdown
Contributor

@jsonbailey jsonbailey commented Apr 16, 2026

Summary

  • Removed the metric key from the structured output schema. EvaluationSchemaBuilder.build() no longer takes an evaluationMetricKey parameter. Since there is only ever a single evaluation metric key per judge config, it does not need to be embedded in the schema sent to the LLM.
  • Flattened the schema to a top-level {score, reasoning} shape. The old nested structure ({evaluations: {metricKey: {score, reasoning}}}) is replaced with a simple {score: number, reasoning: string} object. This is easier for LLMs to produce correctly and matches the Python SDK (fix: Remove evaluation metric key from schema which failed on some LLMs python-server-sdk-ai#105).
  • Updated parsing in Judge.ts. _parseEvaluationResponse now reads score and reasoning directly from the top-level response data. The metric key is still sourced from the judge config's evaluationMetricKey and used to key the result — it just no longer appears in the schema or LLM response.

Test plan

  • All 144 existing tests pass (yarn workspace @launchdarkly/server-sdk-ai test)
  • Lint passes (yarn workspace @launchdarkly/server-sdk-ai lint)
  • Test mocks updated to use new flat response shape
  • _parseEvaluationResponse unit tests updated for simplified signature and data shape

🤖 Generated with Claude Code


Note

Medium Risk
Changes the wire format expected from the AI provider for judge evaluations, so any callers/providers still producing the old nested evaluations shape will now fail parsing and return unsuccessful results.

Overview
Judge structured-output evaluation is simplified to a fixed, flat schema: the LLM is now asked to return top-level score and reasoning instead of {evaluations: {<metricKey>: ...}}, and response parsing is updated accordingly.

This removes the dynamic EvaluationSchemaBuilder and tightens failure handling/logging when the structured response cannot be parsed; tests are updated to reflect the new response shape and malformed/empty-response behavior.

Reviewed by Cursor Bugbot for commit d81b202. Bugbot is set up for automated code reviews on this repo. Configure here.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

@launchdarkly/js-sdk-common size report
This is the brotli compressed size of the ESM build.
Compressed size: 25623 bytes
Compressed size limit: 29000
Uncompressed size: 125843 bytes

@github-actions
Copy link
Copy Markdown
Contributor

@launchdarkly/js-client-sdk size report
This is the brotli compressed size of the ESM build.
Compressed size: 31655 bytes
Compressed size limit: 34000
Uncompressed size: 112792 bytes

@github-actions
Copy link
Copy Markdown
Contributor

@launchdarkly/browser size report
This is the brotli compressed size of the ESM build.
Compressed size: 179375 bytes
Compressed size limit: 200000
Uncompressed size: 829982 bytes

@github-actions
Copy link
Copy Markdown
Contributor

@launchdarkly/js-client-sdk-common size report
This is the brotli compressed size of the ESM build.
Compressed size: 37169 bytes
Compressed size limit: 38000
Uncompressed size: 204305 bytes

jsonbailey and others added 2 commits April 16, 2026 16:07
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete EvaluationSchemaBuilder.ts and define EVALUATION_SCHEMA as a
module-level const in Judge.ts. Remove per-field warnings from
_parseEvaluationResponse (keep it pure) and emit a single warning in
evaluate() that includes the judge key and raw response data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jsonbailey jsonbailey marked this pull request as ready for review April 16, 2026 21:55
@jsonbailey jsonbailey requested a review from a team as a code owner April 16, 2026 21:55
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d81b202. Configure here.

this._logger?.warn(
'Judge evaluation did not return the expected evaluation',
tracker.getTrackData(),
`Could not parse evaluation response for judge "${this._aiConfig.key}": ${JSON.stringify(response.data)}`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parse-failure warning drops tracker context data

Low Severity

The new warn call at the parse-failure point no longer passes tracker.getTrackData() as a second argument, unlike the other two warn calls in the same method (for missing metric key and missing messages), which still include it. The track data contains runId, variationKey, version, modelName, and providerName — operational context useful for correlating warnings in production. Since tracker is available in scope, this appears to be an accidental omission during the refactor.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d81b202. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant