Add SAPO policy loss objective by taivu1998 · Pull Request #1864 · THUDM/slime

taivu1998 · 2026-04-26T19:23:24Z

Summary

Implements SAPO as a new policy-loss objective for slime.

Add --policy-loss-type sapo alongside the existing clipped PPO objective.
Add --sapo-tau-pos and --sapo-tau-neg to control the positive- and negative-advantage SAPO gates.
Route SAPO through the shared policy-loss helper and Megatron loss path.
Log the SAPO soft-ratio auxiliary metric only when SAPO is active.
Document the new RL hyperparameters in the English and Chinese usage docs.
Add focused unit tests for clipped-PPO parity, SAPO math, gradients, numerical stability, and invalid loss-type handling.

Issue

Closes #954.

Implementation Details

The policy-loss helper now keeps the existing clipped PPO behavior as the default path and adds a SAPO path selected by policy_loss_type. SAPO computes a clamped probability ratio from the policy KL term, applies sign-aware tau selection based on the advantage, and uses the soft-ratio gate from the issue design while keeping the result finite for extreme log-ratio inputs.

The Megatron loss wrapper passes the configured loss type and SAPO tau values into the helper, then reduces and logs the SAPO auxiliary metric when applicable. Argument validation rejects non-positive SAPO tau values and limits SAPO to the GRPO/GSPO advantage estimators, matching the supported rollout objective surface.

Validation

uv run --isolated --python 3.10 --with pytest --with torch --with numpy --no-project python -m pytest tests/test_policy_loss.py -q
uv run --isolated --python 3.10 --with ruff --no-project ruff check slime/utils/ppo_utils.py slime/backends/megatron_utils/loss.py slime/utils/arguments.py tests/test_policy_loss.py
uv run --isolated --python 3.10 --no-project python -m py_compile slime/utils/ppo_utils.py slime/backends/megatron_utils/loss.py slime/utils/arguments.py tests/test_policy_loss.py
git diff --check

GPU integration smoke was not run in this local environment.

Add SAPO policy loss objective

c31154c

taivu1998 marked this pull request as ready for review April 26, 2026 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SAPO policy loss objective#1864

Add SAPO policy loss objective#1864
taivu1998 wants to merge 1 commit intoTHUDM:mainfrom
taivu1998:tdv/issue-954-sapo

taivu1998 commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taivu1998 commented Apr 26, 2026

Summary

Issue

Implementation Details

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant