Skip to content

CorrectRoadH/NiceEval

Repository files navigation

NiceEval

Progressive, agent-native, excellent DX lightweight AI agent evals tool

typescript license docs

中文 | Deutsch | Español | français | 日本語 | 한국어 | Português | Русский

NiceEval is an agent-native eval tool inspired by eve. It has an excellent DX design — anyone can get started and configured in about 10 minutes. It's also versatile: it can eval plugins, Hooks, and Skills written for Claude Code/Codex coding agents, and can directly eval your own AI agent application or framework (AI SDK, LangGraph, Pi, or any custom agent loop).

After the eval completes, it generates readable reports and lets you view agent behavior details. Convenient for debugging and optimization.

Why NiceEval when DeepEval, LangFuse, and BrainTrust already exist

NiceEval is an AI-native eval tool. In tools built around Dataset/golden-style Input vs. Expected Output, that shape doesn't fit real agent evaluation well. NiceEval is built for evaluating agents at a finer grain — multi-turn conversations, multi-agent setups, tool calls, skill loading, and more.

It also coexists with LangFuse and BrainTrust: use them for tracing, or upload eval results to both (in progress).

Architecture

NiceEval supports two integration modes, depending on whether the agent under test needs an isolated sandbox filesystem.

Mode 1: Sandbox (Docker, E2B) — run coding agents like Codex and Claude Code that need a sandbox

   evals/*.eval.ts
        │
        ▼
   ┌─────────────────────┐
   │     NiceEval        │
   └─────────────────────┘
        │
        │ Agent adapter (official)
        ▼
   ┌──────────────────────────────┐
   │        Docker Sandbox         │
   │   ┌────────────────────────┐  │
   │   │ Codex / Claude Code /  │  │
   │   │ apps needing isolation │  │
   │   └────────────────────────┘  │
   └──────────────────────────────┘

Mode 2: Direct — connect straight to your own AI Agent

   evals/*.eval.ts
        │
        ▼
   ┌─────────────────────┐
   │     NiceEval        │
   └─────────────────────┘
        │
        │ Agent adapter (official, or your own implementation)
        ▼
   ┌──────────────────────────────┐
   │       your own AI Agent       │
   │   (AI SDK·LangGraph·Pi and    │
   │    other agent frameworks —   │
   │         no Docker needed)     │
   └──────────────────────────────┘
  • NiceEval core owns discovery, scheduling, scoring, reporting, and artifacts.
  • Agent adapters are the open boundary: you decide how to call the system under test.
  • Coding agents that need filesystem isolation run inside the Docker Sandbox; your own AI agent can connect directly, without Docker.

Example

Running an eval takes two files: the eval itself (what to check) and an experiment (which agent to run it against). The CLI won't run a bare eval id — the experiment in niceeval exp <experiment> <eval prefix> is what picks the system under test. Here's a real eval against a directly-connected web agent (full project in examples/zh/ai-sdk/), checking that the agent calls a tool for live weather questions and answers from the tool result instead of making it up:

// evals/eval-tool-call.eval.ts
import { defineEval } from "niceeval";

export default defineEval({
  description: "Verify the agent calls the weather tool and answers from its result",

  async test(t) {
    const turn = await t.send("What's the weather in Beijing today?");
    t.succeeded();

    await t.group("calls get_weather with the right city", () => {
      t.calledTool("get_weather", { input: { city: "Beijing" } });
      t.messageIncludes(/°C|sunny|cloudy|rain/);
    });

    const second = await t.send("What about Shanghai tomorrow?");
    second.messageIncludes("Shanghai");

    t.judge.autoevals
      .closedQA("Does the reply use the tool's weather data instead of making up a temperature?")
      .atLeast(0.7);
  },
});
// experiments/local.ts
import { defineExperiment } from "niceeval";
import { webAgent } from "./adapter"; // your agent adapter, pointed at the system under test

export default defineExperiment({
  agent: webAgent({ baseUrl: "http://127.0.0.1:5188" }),
});
npx niceeval exp local eval-tool-call  // run only eval-tool-call under the local experiment
npx niceeval view

For coding agents that need an isolated workspace (Codex, Claude Code plugins/skills), see examples/zh/coding-agent-skill/: evals there use t.sandbox.uploadDirectory() to seed the workspace, t.fileChanged() / t.file() to check what changed, and t.sandbox.runCommand() to run tests.

Quick Start

READ https://raw.githubusercontent.com/CorrectRoadH/niceeval/refs/heads/main/INIT.md and install niceeval for this repo.

Start from the scenario that matches what you need to evaluate:

Roadmap

Official Adapters

  • Agent Software

    • Claude Code
    • Codex
    • Bub
    • OpenClaw
    • Hermess Agent
    • Alma
    • ...
  • Agent Frameworks

    • AI SDK
    • LangGraph
    • Claude SDK
    • Codex SDK
    • vm0
    • Cursor Agent SDK

Documentation

Acknowledgements

This project was inspired by — or had its code learned by AI from — the projects below: eve agent eval ponytail

Thanks to the following communities

About

build eval for your agent in 10 mins

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors