A lightweight rollout, evaluation, and diagnostics framework for code-agent RL.
Code-agent RL needs reproducible rollout, execution feedback, reward logging, and error diagnostics. CodeRL-Lite is an early-stage scaffold for that loop:
dataset loading -> model sampling -> code execution -> reward/eval result -> pass@k -> trajectory logging -> report.
This is not a full training framework. PPO, GRPO, DPO, Docker isolation, and UI work are out of scope for v0.1.
pip install -e ".[dev]"
pytest
python -m coderl_lite.cli run-toy --out runs/toy_rollouts.jsonl
python -m coderl_lite.cli report --input runs/toy_rollouts.jsonl --out benchmarks/results/report.mdThe toy command uses DummyBackend, so no API keys are needed.
run-toy writes JSONL trajectories like:
{"task_id": "toy/add", "prompt": "Write a Python function add(a, b) that returns a + b.", "completion": "def add(a, b):\n return a + b", "passed": true, "error_type": "passed", "reward": 1.0}The report includes:
Number of tasks: 1
Number of samples: 2
pass@1: 1.0000
pass@2: 1.0000
OpenAIBackend is optional and uses the official openai Python package.
pip install -e ".[openai]"
set OPENAI_API_KEY=...
set OPENAI_BASE_URL=...
set OPENAI_MODEL=...OPENAI_BASE_URL and OPENAI_MODEL are optional and useful for OpenAI-compatible endpoints.
The local judge executes generated Python code on your machine. This is unsafe for untrusted model output. The current judge is meant for trusted toy tasks only. Real usage should run generated code in Docker or another sandbox.
- v0.1: rollout/eval/diagnostics
- v0.2: vLLM backend + MBPP adapter + parallel execution
- v0.3: RL data filtering + rejection sampling baseline
This project is currently early-stage.