This repository contains code to reproduce the experiments from the paper Training Language Models to Self-Correct via Reinforcement Learning using the TRL Framework with DeepSpeed support. The reward functions and prompt builders are inspired by the Self-Taught Self-Correction for Small Language Models paper implementation.
A Docker container for running this repository is available at vityavitalich/trl:score.
- Clone this repository:
git clone <repository_url> cd SCoRE
- Install the required packages:
pip install -r requirements.txt
No need to install packages if using the Docker container. Just pull and run:
docker pull vityavitalich/trl:score
docker run -v $(pwd):/app -it vityavitalich/trl:scoreThe configuration files are essential for defining the training setup, model parameters, and evaluation settings. The most important configuration file is configs/score_config.yaml, which defines all parameters needed for training.
To start the training process, use the following commands:
export ACCELERATE_CONFIG='configs/score_deepspeed.yaml'
export SCORE_CONFIG='configs/score_config.yaml'
export WANDB_API_KEY=''<YOUR_WANDB_API_KEY>
accelerate launch --config_file $ACCELERATE_CONFIG score.py --config_path $SCORE_CONFIG| Parameter | Description | Default Value |
|---|---|---|
model_path |
Path to pretrained model checkpoint. | Qwen/Qwen2.5-Math-1.5B-Instruct |
cache_dir |
Directory to cache model weights. | /home/data/v.moskvoretskii/cache/ |
random_seed |
Seed for reproducibility. | 42 |
wandb_project_name |
Project name for logging. | SCoRE |
| Parameter | Description | Default Value |
|---|---|---|
task_type |
Task type, e.g., math or qa. |
math |
data_path |
Path to the dataset directory. | data/math500 |
id_col |
Unique identifier column. | unique_id |
question_col |
Column containing questions. | problem |
gold_col |
Column with reference answers. | answer |
The reward function is responsible for evaluating generated answers based on specific criteria. For math tasks, the evaluator_mode is always set to final and cannot be changed. For qa tasks, both default and final modes are supported.
| Parameter | Description | Default Value |
|---|---|---|
evaluator_mode |
Specifies how the generated answer is evaluated. default evaluates the entire generation, while final evaluates only the portion after a specific keyword defined by evaluator_answer_marker. Note: For math tasks, this is always set to final. |
final |
evaluator_function |
The metric used for evaluation. Options include: math_acc (for math tasks), in_acc, f1, em (for QA tasks). |
math_acc |
evaluator_answer_marker |
A keyword or phrase indicating where the final answer starts in the generated text. Text preceding this marker is ignored during evaluation. Example: Final Answer: The final answer is for mathematical tasks. |
Final Answer: The final answer is |
| Parameter | Description | Default Value |
|---|---|---|
few_shot_dir |
Directory for few-shot learning examples. | few_shots |
number_output_initial_generations |
Number of answers generated per prompt. | 1 |
temperature |
Sampling temperature for randomness. | 0.9 |
max_tokens |
Maximum tokens generated per prompt. | 1024 |
| Parameter | Description | Default Value |
|---|---|---|
per_device_train_batch_size |
Number of samples per batch for each GPU. | 1 |
gradient_accumulation_steps |
Steps to accumulate gradients. | 4 |
local_rollout_forward_batch_size |
Batch size for multiple rollouts. | 4 |
total_episodes |
Total training samples processed. | 100 |
learning_rate |
Learning rate for training. | 5.0e-5 |
num_warmup_steps |
Warmup steps for learning rate scheduler. | 100 |
save_steps |
Checkpoint saving interval. | 1 |
| Parameter | Description | Default Value |
|---|---|---|
use_lora |
Whether to use LoRA. | True |
lora_rank |
Rank of LoRA adaptation. | 32 |
lora_alpha |
Scaling factor for LoRA. | 8 |
lora_dropout |
Dropout rate for LoRA layers. | 0.1 |
- Training Language Models to Self-Correct via Reinforcement Learning
- Self-Taught Self-Correction for Small Language Models
If you use this repository, please cite it as follows:
@misc{SCoRE2025,
author = {Viktor Moskvoretskii},
title = {SCoRE: Open-Source Implementation of Training Language Models to Self-Correct via Reinforcement Learning},
year = {2025},
url = {https://github.com/VityaVitalich/SCoRe},
note = {Accessed: YYYY-MM-DD}
}