Skip to content

Add agent-tool-routing eval (12 samples, Match template)#1655

Open
MukundaKatta wants to merge 1 commit into
openai:mainfrom
MukundaKatta:add-agent-tool-routing-eval
Open

Add agent-tool-routing eval (12 samples, Match template)#1655
MukundaKatta wants to merge 1 commit into
openai:mainfrom
MukundaKatta:add-agent-tool-routing-eval

Conversation

@MukundaKatta
Copy link
Copy Markdown

Eval details

Eval name

`agent-tool-routing` (id: `agent-tool-routing.dev.v0`)

Eval description

Tests an LLM's ability to select the single correct tool from a small list when given a natural-language user request. Each prompt presents 4 to 6 thematically related tools and asks the model to output the canonical tool name as the only thing it emits — no quotes, no JSON wrapper, no explanation, no surrounding whitespace.

What makes this challenging

Three concrete failure modes the eval is designed to surface:

  1. Format leakage — models often emit `I would use the \`fetch_url\` tool to ...` instead of `fetch_url`. The strict Match scoring catches this immediately.
  2. Similar-but-wrong selection — every prompt deliberately includes 1-2 tools that are semantically adjacent to the correct answer (e.g. `get_weather` vs `get_forecast`, `comment_on_issue` vs `create_issue`, `commit_changes` vs `push_branch`).
  3. Tool hallucination — none of the user-side prompts hint at tool names, so a model that hallucinates plausible-sounding names not in the provided list (`reset_user_password`, `make_invoice_payment`, `view_logs_recent`) fails.

What's in the dataset

12 hand-curated samples spanning a thematic mix of common production agent tool surfaces:

  • contact lookup
  • file system operations
  • analytics SQL
  • GitHub workflow
  • web fetching
  • billing
  • ops / Kubernetes
  • travel booking
  • git workflow
  • weather
  • user management
  • audio transcription

Each row has a 4-6-tool list and a single correct `ideal`. All 12 `ideal` values are distinct (no overlapping tool names across the prompt set).

Why this is a real-world use case

LLM tool routing is the first hop in every tool-using agent runtime. It precedes argument validation (which downstream libraries like `@mukundakatta/agentvet` and OpenAI function-calling schemas wrap) and structured-output enforcement. Agents that pick the wrong tool can never recover via better arg validation, so accuracy on the routing step is foundational. The samples here mirror tool-naming conventions that show up in production MCP servers and OpenAI function-calling schemas.

Eval class

`evals.elsuite.basic.match:Match` — no custom code, per the current `docs/build-eval.md` contribution guidelines.

Files

  • `evals/registry/data/agent_tool_routing/samples.jsonl` — 12 samples, chat-format
  • `evals/registry/evals/agent-tool-routing.yaml` — registry entry

Self-review checklist

  • Thematically consistent. All 12 prompts revolve around the same use case (LLM tool routing) with deliberate variation in domain to avoid overfitting to one tool family.
  • Challenging. Each prompt includes a similar-but-wrong distractor tool. Spot checks against gpt-3.5-turbo and gpt-4o show non-trivial error rate driven primarily by format leakage and adjacent-tool confusion.
  • Directionally clear. Each prompt has exactly one correct answer; the user-side request is unambiguous about the intended action.
  • Carefully crafted. Pre-commit (black, isort, autoflake) ran clean. JSONL parses; YAML lints; all rows have correct `input` + `ideal` keys. `ideal` is always a string. 12 distinct ideals.
  • No custom code. Uses the canonical `Match` template only.

Disclaimer

12 hand-curated samples is intentionally a foundational starter set rather than a large-scale benchmark. The same Match template scales to additional samples by appending JSONL rows; the YAML `disclaimer` field documents that explicitly.

Tests an LLM's ability to select the single correct tool from a small
list when given a natural-language user request. Each prompt presents
4-6 thematically related tools and asks for the canonical tool name
as the only output.

Failure modes targeted:
  - emitting explanation text instead of a bare tool name
  - selecting a similar-but-incorrect tool that is also in the list
  - inventing tools that are not in the list

Real-world use case: LLM tool routing in chat assistants and IDE copilots,
which is the call site that downstream libraries like AgentVet
(@mukundakatta/agentvet on npm) wrap with arg-shape validation. This
eval scopes to the correct-tool-name choice as a foundational capability
that precedes argument validation.

Domain mix across the 12 samples: contact lookup, file system,
analytics SQL, GitHub, web fetching, billing, ops/Kubernetes, travel
booking, git workflow, weather, user management, audio transcription.

Uses evals.elsuite.basic.match:Match (no custom code) as required by
the docs/build-eval.md current contribution guidelines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant