feat: Add NL2Query quality test runner#3116
Conversation
🎉 Build Summary🔗 Source
📦 Package Information
🧪 Test Results
✅ Build StatusAll checks completed successfully! |
🎭 E2E Tests (Playwright + VS Code)Commit: 2202ab3 🧪 Result
📥 Artifacts (run)
|
🔨 Build, Lint & Test🔗 Source
📦 Package Information
🧪 Test Results
📥 Artifacts (run)✅ Build StatusBuild and local tests passed. See sibling comments below for E2E and NoSQL integration results. |
There was a problem hiding this comment.
Pull request overview
Adds a manual NL2Query quality test runner to help evaluate generateQuery behavior across a user-provided test spec and schema, producing a Markdown report with timing/token/grade statistics. This fits as a dev-only workflow tool for validating NL2Query prompt/pipeline quality during development.
Changes:
- Adds a dev-only VS Code command (
cosmosDB.dev.runNl2QueryQualityTest) to run NL2Query quality tests and generate a Markdown report. - Adds documentation and gitignore scaffolding for the quality test suite under
test/quality/nl2query/. - Wires command visibility via a dev-mode context key (
cosmosDB.devMode) and package contributions.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| test/quality/nl2query/README.md | Adds instructions and structure for running the NL2Query quality suite. |
| test/quality/nl2query/.gitignore | Ignores generated reports under results/. |
| src/extension.ts | Registers a dev-only context + dynamically imports the quality test command in Development mode. |
| src/commands/nl2queryQualityTest.ts | Implements the interactive runner, batch grading, and Markdown report generation. |
| package.json | Contributes the new dev command and gates it in the Command Palette with cosmosDB.devMode. |
| package-lock.json | Lockfile update (platform metadata normalization). |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Introduce a test runner that takes test spec, schema as input and produces a quality report md file as output.
Test runs the same function calls used in NL2Query for eatch test spec.
The report includes statistics on time measurement, token count (input and output) and grading scores.
The test spec specifies the purpose of the test, prompt and expected query and we use AI to grade the actual query against the expected result.
The runner is designed to be lightweight and generic (user provides test spec).
The runner needs to be executed manually at runtime by the user who must be signed in github, in order to use gh copilot.
runNl2QueryQualityTestcommand. Command is only available in debug mode.