Add agent-tool-routing eval (12 samples, Match template) by MukundaKatta · Pull Request #1655 · openai/evals

MukundaKatta · 2026-05-08T05:58:26Z

Eval details

Eval name

`agent-tool-routing` (id: `agent-tool-routing.dev.v0`)

Eval description

Tests an LLM's ability to select the single correct tool from a small list when given a natural-language user request. Each prompt presents 4 to 6 thematically related tools and asks the model to output the canonical tool name as the only thing it emits — no quotes, no JSON wrapper, no explanation, no surrounding whitespace.

What makes this challenging

Three concrete failure modes the eval is designed to surface:

Format leakage — models often emit `I would use the \`fetch_url\` tool to ...` instead of `fetch_url`. The strict Match scoring catches this immediately.
Similar-but-wrong selection — every prompt deliberately includes 1-2 tools that are semantically adjacent to the correct answer (e.g. `get_weather` vs `get_forecast`, `comment_on_issue` vs `create_issue`, `commit_changes` vs `push_branch`).
Tool hallucination — none of the user-side prompts hint at tool names, so a model that hallucinates plausible-sounding names not in the provided list (`reset_user_password`, `make_invoice_payment`, `view_logs_recent`) fails.

What's in the dataset

12 hand-curated samples spanning a thematic mix of common production agent tool surfaces:

contact lookup
file system operations
analytics SQL
GitHub workflow
web fetching
billing
ops / Kubernetes
travel booking
git workflow
weather
user management
audio transcription

Each row has a 4-6-tool list and a single correct `ideal`. All 12 `ideal` values are distinct (no overlapping tool names across the prompt set).

Why this is a real-world use case

LLM tool routing is the first hop in every tool-using agent runtime. It precedes argument validation (which downstream libraries like `@mukundakatta/agentvet` and OpenAI function-calling schemas wrap) and structured-output enforcement. Agents that pick the wrong tool can never recover via better arg validation, so accuracy on the routing step is foundational. The samples here mirror tool-naming conventions that show up in production MCP servers and OpenAI function-calling schemas.

Eval class

`evals.elsuite.basic.match:Match` — no custom code, per the current `docs/build-eval.md` contribution guidelines.

Files

`evals/registry/data/agent_tool_routing/samples.jsonl` — 12 samples, chat-format
`evals/registry/evals/agent-tool-routing.yaml` — registry entry

Self-review checklist

Thematically consistent. All 12 prompts revolve around the same use case (LLM tool routing) with deliberate variation in domain to avoid overfitting to one tool family.
Challenging. Each prompt includes a similar-but-wrong distractor tool. Spot checks against gpt-3.5-turbo and gpt-4o show non-trivial error rate driven primarily by format leakage and adjacent-tool confusion.
Directionally clear. Each prompt has exactly one correct answer; the user-side request is unambiguous about the intended action.
Carefully crafted. Pre-commit (black, isort, autoflake) ran clean. JSONL parses; YAML lints; all rows have correct `input` + `ideal` keys. `ideal` is always a string. 12 distinct ideals.
No custom code. Uses the canonical `Match` template only.

Disclaimer

12 hand-curated samples is intentionally a foundational starter set rather than a large-scale benchmark. The same Match template scales to additional samples by appending JSONL rows; the YAML `disclaimer` field documents that explicitly.

Tests an LLM's ability to select the single correct tool from a small list when given a natural-language user request. Each prompt presents 4-6 thematically related tools and asks for the canonical tool name as the only output. Failure modes targeted: - emitting explanation text instead of a bare tool name - selecting a similar-but-incorrect tool that is also in the list - inventing tools that are not in the list Real-world use case: LLM tool routing in chat assistants and IDE copilots, which is the call site that downstream libraries like AgentVet (@mukundakatta/agentvet on npm) wrap with arg-shape validation. This eval scopes to the correct-tool-name choice as a foundational capability that precedes argument validation. Domain mix across the 12 samples: contact lookup, file system, analytics SQL, GitHub, web fetching, billing, ops/Kubernetes, travel booking, git workflow, weather, user management, audio transcription. Uses evals.elsuite.basic.match:Match (no custom code) as required by the docs/build-eval.md current contribution guidelines.

MukundaKatta requested review from andrew-openai, etr2460 and katyhshi as code owners May 8, 2026 05:58

MukundaKatta mentioned this pull request May 8, 2026

Add agent-tool-abstention eval (13 samples, Match template) #1656

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add agent-tool-routing eval (12 samples, Match template)#1655

Add agent-tool-routing eval (12 samples, Match template)#1655
MukundaKatta wants to merge 1 commit into
openai:mainfrom
MukundaKatta:add-agent-tool-routing-eval

MukundaKatta commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MukundaKatta commented May 8, 2026

Eval details

Eval name

Eval description

What makes this challenging

What's in the dataset

Why this is a real-world use case

Eval class

Files

Self-review checklist

Disclaimer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant