NiceEval

Progressive, agent-native, excellent DX lightweight AI agent evals tool

NiceEval is an agent-native eval tool inspired by eve. It has an excellent DX design — anyone can get started and configured in about 10 minutes. It's also versatile: it can eval plugins, Hooks, and Skills written for Claude Code/Codex coding agents, and can directly eval your own AI agent application or framework (AI SDK, LangGraph, Pi, or any custom agent loop).

After the eval completes, it generates readable reports and lets you view agent behavior details. Convenient for debugging and optimization.

Why NiceEval when DeepEval, LangFuse, and BrainTrust already exist

NiceEval is an AI-native eval tool. In tools built around Dataset/golden-style Input vs. Expected Output, that shape doesn't fit real agent evaluation well. NiceEval is built for evaluating agents at a finer grain — multi-turn conversations, multi-agent setups, tool calls, skill loading, and more.

It also coexists with LangFuse and BrainTrust: use them for tracing, or upload eval results to both (in progress).

Architecture

NiceEval supports two integration modes, depending on whether the agent under test needs an isolated sandbox filesystem.

Mode 1: Sandbox (Docker, E2B) — run coding agents like Codex and Claude Code that need a sandbox

   evals/*.eval.ts
        │
        ▼
   ┌─────────────────────┐
   │     NiceEval        │
   └─────────────────────┘
        │
        │ Agent adapter (official)
        ▼
   ┌──────────────────────────────┐
   │        Docker Sandbox         │
   │   ┌────────────────────────┐  │
   │   │ Codex / Claude Code /  │  │
   │   │ apps needing isolation │  │
   │   └────────────────────────┘  │
   └──────────────────────────────┘

Mode 2: Direct — connect straight to your own AI Agent

   evals/*.eval.ts
        │
        ▼
   ┌─────────────────────┐
   │     NiceEval        │
   └─────────────────────┘
        │
        │ Agent adapter (official, or your own implementation)
        ▼
   ┌──────────────────────────────┐
   │       your own AI Agent       │
   │   (AI SDK·LangGraph·Pi and    │
   │    other agent frameworks —   │
   │         no Docker needed)     │
   └──────────────────────────────┘

NiceEval core owns discovery, scheduling, scoring, reporting, and artifacts.
Agent adapters are the open boundary: you decide how to call the system under test.
Coding agents that need filesystem isolation run inside the Docker Sandbox; your own AI agent can connect directly, without Docker.

Example

Running an eval takes two files: the eval itself (what to check) and an experiment (which agent to run it against). The CLI won't run a bare eval id — the experiment in niceeval exp <experiment> <eval prefix> is what picks the system under test. Here's a real eval against a directly-connected web agent (full project in examples/zh/ai-sdk/), checking that the agent calls a tool for live weather questions and answers from the tool result instead of making it up:

// evals/eval-tool-call.eval.ts
import { defineEval } from "niceeval";

export default defineEval({
  description: "Verify the agent calls the weather tool and answers from its result",

  async test(t) {
    const turn = await t.send("What's the weather in Beijing today?");
    t.succeeded();

    await t.group("calls get_weather with the right city", () => {
      t.calledTool("get_weather", { input: { city: "Beijing" } });
      t.messageIncludes(/°C|sunny|cloudy|rain/);
    });

    const second = await t.send("What about Shanghai tomorrow?");
    second.messageIncludes("Shanghai");

    t.judge.autoevals
      .closedQA("Does the reply use the tool's weather data instead of making up a temperature?")
      .atLeast(0.7);
  },
});

// experiments/local.ts
import { defineExperiment } from "niceeval";
import { webAgent } from "./adapter"; // your agent adapter, pointed at the system under test

export default defineExperiment({
  agent: webAgent({ baseUrl: "http://127.0.0.1:5188" }),
});

npx niceeval exp local eval-tool-call  // run only eval-tool-call under the local experiment
npx niceeval view

For coding agents that need an isolated workspace (Codex, Claude Code plugins/skills), see examples/zh/coding-agent-skill/: evals there use t.sandbox.uploadDirectory() to seed the workspace, t.fileChanged() / t.file() to check what changed, and t.sandbox.runCommand() to run tests.

Quick Start

READ https://raw.githubusercontent.com/CorrectRoadH/niceeval/refs/heads/main/INIT.md and install niceeval for this repo.

Start from the scenario that matches what you need to evaluate:

Roadmap

Official Adapters

Documentation

Quickstart

Acknowledgements

This project was inspired by — or had its code learned by AI from — the projects below: eve agent eval ponytail

Thanks to the following communities

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
.agents/skills/effect-ts		.agents/skills/effect-ts
.claude/skills		.claude/skills
.github/workflows		.github/workflows
assets		assets
bin		bin
docs-site		docs-site
docs		docs
examples		examples
memory		memory
sandbox		sandbox
screenshots/zh		screenshots/zh
site		site
src		src
test		test
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
INIT.md		INIT.md
INIT.zh.md		INIT.zh.md
README.md		README.md
README.zh.md		README.zh.md
TODO.md		TODO.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
skills-lock.json		skills-lock.json
tsconfig.json		tsconfig.json
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NiceEval

Why NiceEval when DeepEval, LangFuse, and BrainTrust already exist

Architecture

Example

Quick Start

Roadmap

Documentation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NiceEval

Why NiceEval when DeepEval, LangFuse, and BrainTrust already exist

Architecture

Example

Quick Start

Roadmap

Documentation

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages