Add agent-tool-routing eval (12 samples, Match template)#1655
Open
MukundaKatta wants to merge 1 commit into
Open
Add agent-tool-routing eval (12 samples, Match template)#1655MukundaKatta wants to merge 1 commit into
MukundaKatta wants to merge 1 commit into
Conversation
Tests an LLM's ability to select the single correct tool from a small list when given a natural-language user request. Each prompt presents 4-6 thematically related tools and asks for the canonical tool name as the only output. Failure modes targeted: - emitting explanation text instead of a bare tool name - selecting a similar-but-incorrect tool that is also in the list - inventing tools that are not in the list Real-world use case: LLM tool routing in chat assistants and IDE copilots, which is the call site that downstream libraries like AgentVet (@mukundakatta/agentvet on npm) wrap with arg-shape validation. This eval scopes to the correct-tool-name choice as a foundational capability that precedes argument validation. Domain mix across the 12 samples: contact lookup, file system, analytics SQL, GitHub, web fetching, billing, ops/Kubernetes, travel booking, git workflow, weather, user management, audio transcription. Uses evals.elsuite.basic.match:Match (no custom code) as required by the docs/build-eval.md current contribution guidelines.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Eval details
Eval name
`agent-tool-routing` (id: `agent-tool-routing.dev.v0`)
Eval description
Tests an LLM's ability to select the single correct tool from a small list when given a natural-language user request. Each prompt presents 4 to 6 thematically related tools and asks the model to output the canonical tool name as the only thing it emits — no quotes, no JSON wrapper, no explanation, no surrounding whitespace.
What makes this challenging
Three concrete failure modes the eval is designed to surface:
What's in the dataset
12 hand-curated samples spanning a thematic mix of common production agent tool surfaces:
Each row has a 4-6-tool list and a single correct `ideal`. All 12 `ideal` values are distinct (no overlapping tool names across the prompt set).
Why this is a real-world use case
LLM tool routing is the first hop in every tool-using agent runtime. It precedes argument validation (which downstream libraries like `@mukundakatta/agentvet` and OpenAI function-calling schemas wrap) and structured-output enforcement. Agents that pick the wrong tool can never recover via better arg validation, so accuracy on the routing step is foundational. The samples here mirror tool-naming conventions that show up in production MCP servers and OpenAI function-calling schemas.
Eval class
`evals.elsuite.basic.match:Match` — no custom code, per the current `docs/build-eval.md` contribution guidelines.
Files
Self-review checklist
Disclaimer
12 hand-curated samples is intentionally a foundational starter set rather than a large-scale benchmark. The same Match template scales to additional samples by appending JSONL rows; the YAML `disclaimer` field documents that explicitly.