Multi step cube v2 by Masseeh · Pull Request #141 · ServiceNow/PipelineRL

Masseeh · 2026-05-06T21:17:15Z

Summary

This PR adds the cube-specific PipelineRL actor path under pipelinerl/cube_rl and wires it into the Ray actor entrypoint flow.

Key changes:

Adds a cube Ray actor loop with synchronous CubeBenchmarkActor rollout workers.
Adds a shared VLLMRouterActor plus RayVLLMRouter adapter for per-generation vLLM routing and admission control.
Uses a global cube benchmark actor pool instead of pinning actors to individual vLLM URLs.
Treats actor.llm_max_rollouts as per-vLLM generation capacity.
Configures cube-harness agent LLM settings once during cube actor setup, with train/test LLM parameter profiles inherited from PipelineRL config.
Simplifies cube actor rollout calls to rollout(task_id), leaving vLLM selection to the router.
Adds cube configs, launch entrypoint support, Ray worker log collection, and resource checks.
Adds supporting rollout/tool-call plumbing, TIR domain changes, and a lightweight results viewer.

Design Notes

The cube actor loop now handles rollout-level scheduling and backpressure over benchmark actors, while VLLMRouterActor owns vLLM-level routing. Each cube-harness LLM generation acquires
a route lease from the router and releases it after completion or error.

This keeps CubeActorLoop focused on task scheduling, retries, metrics, and eval boundaries, while keeping admission control close to the actual generation call.

Testing

I trained Qwen3-4B-Instruct-2507 on TIR using both the default PipelineRL implementation and Cube-based version of TIR, on a single node.

Shared hyperparameters

actor.buffer_tokens=200 \
force_restart=true \
fp32_lm_head=true \
finetune.learning_rate=1e-6 \
finetune.attempts=8 \
finetune.rl.policy_loss=gspo \
finetune.rl.epsilon_low=3e-3 \
finetune.rl.epsilon_high=4e-3 \
+finetune.rl.filter_zero_advantage_groups=true \
finetune.max_train_steps=100 \
finetune.seq_length=32000 \
finetune.seq_parallel=4 \
finetune.gradient_accumulation_passes=128 \
vllm_config.vllm_kwargs.max_model_len=32000 \
llm.parameters.max_tokens=16000 \
llm.parameters.temperature=0.7 \
llm.parameters.max_completion_tokens=16000 \
+llm.parameters.max_model_len=32000 \
test_llm.parameters.max_tokens=16000 \
test_llm.parameters.temperature=0.7 \
test_llm.parameters.max_completion_tokens=16000 \
+test_llm.parameters.max_model_len=32000 \
world.actor_fraction=4 \
world.preprocessor_fraction=0 \
world.finetune_fraction=4 \
streams=files \
eval_every_n_versions=0 \
vllm_config.vllm_kwargs.served_model_name=Qwen3-4B-Instruct-2507

Cube-RL-specific hyperparameters

actor.cube_actor_num_cpus=0.5

Results

with

actor.llm_max_rollouts = 32

Effect of scaling up llm_max_rollouts

from 32 to 64

Setup and Cube-compatible repositories

This setup has been tested with the uv package manager and Python 3.12.13.

Required repositories:

Cube Harness: https://github.com/Masseeh/cube-harness/tree/pipelinerl-cube-v2
Cube Standard: https://github.com/Masseeh/cube-standard/tree/piplinerl-cube

Clone cube-harness and cube-standard alongside the PipelineRL repository before running the project. For example:

parent-directory/
├── PipelineRL/
├── cube-harness/
└── cube-standard/

@rafapi @ehsk could you take a look at this PR?

rafapi · 2026-05-07T11:34:37Z

What is the result viewer for?

Masseeh · 2026-05-07T12:42:06Z

What is the result viewer for?

It’s a web interface for viewing the contents of rollouts produced by actor streams. For example, the image below shows the first 10 rollouts. By clicking on a rollout, you can inspect the contents of each group as well as the individual steps within an episode. Spawn by the following command:

uv run python -m results_viewer --results [experiment_folder] --host 127.0.0.1 --port 8765

rafapi · 2026-05-07T12:58:04Z

What is the result viewer for?

It’s a web interface for viewing the contents of rollouts produced by actor streams. For example, the image below shows the first 10 rollouts. By clicking on a rollout, you can inspect the contents of each group as well as the individual steps within an episode. Spawn by the following command:
uv run python -m results_viewer --results [experiment_folder] --host 127.0.0.1 --port 8765

This is such a weird prompt:

Masseeh · 2026-05-07T14:55:30Z

This is such a weird prompt:

Yes!! it's part of open_reasoner_zero_extended_72k dataset.

{
   "dataset":"open_reasoner_zero_extended_72k",
   "task":"9. Let $A=\\{1,2,3, \\cdots, 2 n, 2 n+1\\}, B \\subseteq A$, and for any three distinct elements $x, y, z$ in $B$, we have $x+y \\neq z$. Find the maximum value of $|B|$.\n\nTranslate the above text into English, please retain the original text's line breaks and format, and output the translation result directly.",
   "answer":"\\boxed{n+1}"
}

Masseeh · 2026-05-12T19:52:33Z

Closing this in favor of PR #143

Masseeh added 10 commits May 1, 2026 21:05

Clean final changes

101c31a

add length penalty

c80ab3a

add difficulty-aware penalty

a577268

new ray

63eb59b

fix timeout

55368e1

match tir and cube function calls

f4b50fc

move cube-rl inside pipelinerl package

0d04b10

refactor based on new llm_router

263ab86

fix agent config

80ccec8

remove top_k

5378cd4

Masseeh marked this pull request as ready for review May 6, 2026 21:18

Masseeh marked this pull request as draft May 6, 2026 21:18

Masseeh closed this May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi step cube v2#141

Multi step cube v2#141
Masseeh wants to merge 10 commits into
ServiceNow:multi-stepfrom
Masseeh:multi-step-cube-v2

Masseeh commented May 6, 2026 •

edited

Loading

Uh oh!

rafapi commented May 7, 2026

Uh oh!

Masseeh commented May 7, 2026

Uh oh!

rafapi commented May 7, 2026

Uh oh!

Masseeh commented May 7, 2026

Uh oh!

Masseeh commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Masseeh commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design Notes

Testing

Shared hyperparameters

Cube-RL-specific hyperparameters

Results

Effect of scaling up llm_max_rollouts

Setup and Cube-compatible repositories

Uh oh!

rafapi commented May 7, 2026

Uh oh!

Masseeh commented May 7, 2026

Uh oh!

rafapi commented May 7, 2026

Uh oh!

Masseeh commented May 7, 2026

Uh oh!

Masseeh commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Masseeh commented May 6, 2026 •

edited

Loading