Skip to content

Add SAPO policy loss objective#1864

Open
taivu1998 wants to merge 1 commit intoTHUDM:mainfrom
taivu1998:tdv/issue-954-sapo
Open

Add SAPO policy loss objective#1864
taivu1998 wants to merge 1 commit intoTHUDM:mainfrom
taivu1998:tdv/issue-954-sapo

Conversation

@taivu1998
Copy link
Copy Markdown

Summary

Implements SAPO as a new policy-loss objective for slime.

  • Add --policy-loss-type sapo alongside the existing clipped PPO objective.
  • Add --sapo-tau-pos and --sapo-tau-neg to control the positive- and negative-advantage SAPO gates.
  • Route SAPO through the shared policy-loss helper and Megatron loss path.
  • Log the SAPO soft-ratio auxiliary metric only when SAPO is active.
  • Document the new RL hyperparameters in the English and Chinese usage docs.
  • Add focused unit tests for clipped-PPO parity, SAPO math, gradients, numerical stability, and invalid loss-type handling.

Issue

Closes #954.

Implementation Details

The policy-loss helper now keeps the existing clipped PPO behavior as the default path and adds a SAPO path selected by policy_loss_type. SAPO computes a clamped probability ratio from the policy KL term, applies sign-aware tau selection based on the advantage, and uses the soft-ratio gate from the issue design while keeping the result finite for extreme log-ratio inputs.

The Megatron loss wrapper passes the configured loss type and SAPO tau values into the helper, then reduces and logs the SAPO auxiliary metric when applicable. Argument validation rejects non-positive SAPO tau values and limits SAPO to the GRPO/GSPO advantage estimators, matching the supported rollout objective surface.

Validation

  • uv run --isolated --python 3.10 --with pytest --with torch --with numpy --no-project python -m pytest tests/test_policy_loss.py -q
  • uv run --isolated --python 3.10 --with ruff --no-project ruff check slime/utils/ppo_utils.py slime/backends/megatron_utils/loss.py slime/utils/arguments.py tests/test_policy_loss.py
  • uv run --isolated --python 3.10 --no-project python -m py_compile slime/utils/ppo_utils.py slime/backends/megatron_utils/loss.py slime/utils/arguments.py tests/test_policy_loss.py
  • git diff --check

GPU integration smoke was not run in this local environment.

@taivu1998 taivu1998 marked this pull request as ready for review April 26, 2026 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Implement SAPO algorithm as an alternative to hard clipping in GRPO/GSPO

1 participant