Multi-step RL with CUBE by Masseeh · Pull Request #143 · ServiceNow/PipelineRL

Masseeh · 2026-05-12T19:51:17Z

Summary

This PR adds and extends the cube-specific PipelineRL actor path under pipelinerl/cube_rl, wiring cube-harness rollouts into the Ray actor entrypoint flow with per-generation vLLM routing, elastic multi-cube scheduling, richer rollout metadata, and results-viewer support.

Key changes:

Adds a cube Ray actor loop with synchronous cube-harness rollout workers.
Adds a shared VLLMRouterActor plus RayVLLMRouter adapter for per-generation vLLM routing and admission control.
Uses a global cube worker pool instead of pinning workers to individual vLLM URLs.
Treats actor.llm_max_rollouts as per-vLLM generation capacity.
Refactors cube execution around cube_params.cubes, allowing multiple train/test cubes in one run.
Adds cube-qualified task refs via CubeTaskRef(cube_id, task_id).
Moves task discovery into a cube registry owned by the PipelineRL actor process.
Makes Ray cube workers generic and lazy: workers load shared state at startup and prepare a cube only when assigned a rollout for that cube.
Adds simple warm-worker affinity to reduce cube setup churn.
Configures cube-harness agent LLM settings per cube with train/test LLM parameter profiles inherited from PipelineRL config.
Keeps vLLM selection inside the routed LLM call path rather than rollout scheduling.
Adds Ray worker log collection, resource checks, cube configs, and launch entrypoint support.
Adds route/LLM metadata plumbing from cube-harness LLM calls into PipelineRL TrainingText.metadata.
Adds rollout attribution metadata: cube/task/trajectory IDs, step indices, LLM call IDs, termination reason, finish reason, reward context, and token counts.
Adds basic router health suppression and router snapshot metrics.
Fixes reward adjustment ordering so length-penalty adjustments are applied before writing actor-stream samples.
Extends the lightweight results viewer to show route/server IDs, LLM latency, vLLM lease wait time, routed call counts, and router stats.

Design Notes

The cube actor loop handles rollout-level scheduling, retries, metrics, eval boundaries, and bounded pending work over a global Ray worker pool. The vLLM router owns generation-level admission control.

Each cube-harness LLM generation acquires a short-lived route lease from VLLMRouterActor and releases it after completion or error. This keeps routing tied to actual model calls rather than whole rollouts, which matters for multi-step agent trajectories.

The newer multi-cube path treats cubes more like datasets: each configured cube owns its benchmark and agent config, while Ray workers are generic execution slots. Workers lazily install/setup the cube they are assigned and the scheduler prefers workers already warm for the requested cube.

Rollout construction remains inside Ray workers for now. Workers run the cube-harness Episode, inspect the resulting trajectory, assign rewards, and emit TrainingText samples. The new metadata path makes those samples auditable without moving trajectory construction into the central actor loop.

Multi-node / Ray Cluster Support

Supports connecting the cube actor path to an existing Ray cluster through actor.ray_address, with ray.init(address="auto") fallback before starting a local Ray runtime.
Computes and logs required Ray CPU capacity from actor.cube_workers * actor.cube_workers_num_cpus.
Uses generic Ray cube workers that can be placed across cluster nodes and assigned to any configured cube.
Defers heavy cube setup to first use inside each worker, avoiding eager benchmark/container setup for every cube on every worker.
Adds warm-worker affinity so scheduling prefers idle workers already prepared for the requested cube.
Keeps vLLM routing centralized in VLLMRouterActor, so distributed Ray workers share one per-generation admission-control state.
Decouples Ray worker placement from vLLM routing: workers come from a global pool, while each LLM call gets its own route lease.
Adds Ray worker log collection and router stats to make distributed rollout behavior easier to debug.

Optional rollout artifact writing remains opt-in and file-based. In multi-node clusters it should only be enabled with a shared filesystem path or a cluster-aware storage backend; normal actor-stream training data does not depend on those artifacts.

Testing

I trained Qwen3-4B-Instruct-2507 on TIR using both the default PipelineRL implementation and the Cube-based version of TIR, on a single node.

Shared hyperparameters

actor.buffer_tokens=2000 \
force_restart=true \
fp32_lm_head=true \
finetune.learning_rate=1e-6 \
finetune.attempts=8 \
finetune.rl.policy_loss=gspo \
finetune.rl.epsilon_low=3e-3 \
finetune.rl.epsilon_high=4e-3 \
+finetune.rl.filter_zero_advantage_groups=true \
finetune.max_train_steps=100 \
finetune.seq_length=32000 \
finetune.seq_parallel=4 \
finetune.gradient_accumulation_passes=128 \
vllm_config.vllm_kwargs.max_model_len=32000 \
llm.parameters.max_tokens=16000 \
llm.parameters.temperature=0.7 \
llm.parameters.max_completion_tokens=16000 \
+llm.parameters.max_model_len=32000 \
test_llm.parameters.max_tokens=16000 \
test_llm.parameters.temperature=0.7 \
test_llm.parameters.max_completion_tokens=16000 \
+test_llm.parameters.max_model_len=32000 \
world.actor_fraction=4 \
world.preprocessor_fraction=0 \
world.finetune_fraction=4 \
streams=files \
eval_every_n_versions=0 \
vllm_config.vllm_kwargs.served_model_name=Qwen3-4B-Instruct-2507

Cube-RL-specific hyperparameters

actor.cube_workers_num_cpus=0.25 \
actor.cube_eval_workers_fraction=0.35 \
actor.cube_workers=384

Cube-RL Ray cluster configs

actor.launch_ray_cluster=true \
actor.ray_head_port=${ray_head_port} \
actor.ray_node_manager_port=${ray_node_manager_port} \
actor.ray_object_manager_port=${ray_object_manager_port} \
actor.ray_worker_port_start=${ray_worker_port_start} \
actor.ray_worker_port_count=${ray_worker_port_count} \
actor.ray_extra_cpus_per_node=${ray_extra_cpus_per_node} \

Results

with

actor.llm_max_rollouts = 32

Effect of scaling up llm_max_rollouts

from 32 to 64

Eval (Comparison between running with and without evaluation for both TIR and CUBE)

Multi-node run (2 nodes)

Setup and Cube-compatible repositories

This setup has been tested with the uv package manager and Python 3.12.13.

Required repositories:

Cube Harness: https://github.com/The-AI-Alliance/cube-harness/tree/rl/pipelinerl-cube
Cube Standard: https://github.com/Masseeh/cube-standard/tree/piplinerl-cube

Clone cube-harness and cube-standard alongside the PipelineRL repository before running the project. For example:

parent-directory/
├── PipelineRL/
├── cube-harness/
└── cube-standard/

Masseeh · 2026-05-12T20:04:48Z

cube-harness changes

rl/pipelinerl-cube adds lightweight LLM metadata plumbing needed by the PipelineRL cube rollout path.

Key changes:

Adds metadata to LLMResponse.
Adds metadata to LLMRouteLease.
Adds metadata to LLMCall.
Propagates routed LLM metadata from RoutedLLM into the returned LLMResponse.
Copies LLMResponse.metadata into agent-created LLMCalls for:
- TirAgent
- ReactAgent
- Genny
- GenericAgent

Design Notes

The PipelineRL cube actor path needs to trace each generated training sample back to the vLLM route and LLM call that produced it.

This PR keeps that plumbing minimal:

RayVLLMRouter / RoutedLLM
  -> LLMResponse.metadata
  -> LLMCall.metadata
  -> PipelineRL TrainingText.metadata

The router still owns infrastructure metadata such as route ID, vLLM server ID, lease wait time, and LLM latency. cube-harness only preserves that metadata on the response/call objects so downstream rollout builders can attach it to training samples.

No agent behavior changes are intended. The metadata field is optional and defaults to an empty dict, so existing non-routed cube-harness usage continues to work.

TIR Agent and math-tool-use Cube

TIR Agent

The cube-harness side includes TirAgent support for PipelineRL rollouts. TirAgent builds chat prompts from the task observation/history, calls the configured routed LLM, parses tool calls from the model response, and returns a standard AgentOutput containing both actions and the associated LLMCall.

math-tool-use Cube

The math-tool-use cube provides the Cube-native TIR environment used for math tool-use RL. It wraps math reasoning tasks as cube tasks, exposes a Python execution tool through the task action set, and evaluates final math answers through the cube benchmark/task interface.

In the PipelineRL config, math-tool-use is used through cube_params.cubes, with train and test datasets configured as separate cube entries. Each entry specifies:

the MathToolUseBenchmark,
dataset filtering through dataset_name,
the TirAgentConfig,
the routed LLM config used by PipelineRL/vLLM,
the task/tool setup for sandboxed Python execution.

Masseeh added 24 commits May 1, 2026 21:05

Clean final changes

101c31a

add length penalty

c80ab3a

add difficulty-aware penalty

a577268

new ray

63eb59b

fix timeout

55368e1

match tir and cube function calls

f4b50fc

move cube-rl inside pipelinerl package

0d04b10

refactor based on new llm_router

263ab86

fix agent config

80ccec8

remove top_k

5378cd4

flushing wandb logs

c0e6c45

add chapt template for results viewer

73dd4b3

Makefile

7918110

disentangle number of ray workers from llm_max_rollouts

4692834

multi-replica support

18a3654

update example

7ac8bea

Refactor cube rollouts around lazy workers

2459ed7

Configure cube math as multi-cube workload

0c8194a

fix lag during eval

4aaa84c

implemented max_lag for CUBE in

1ab614f

add ray cluster for multi-node

1d5537d

faster startup process for cube

699d793

improve ray teardown logic

647f03b

Surface cube rollout routing metadata

e502aeb

Masseeh mentioned this pull request May 12, 2026

Multi step cube v2 #141

Closed

Masseeh requested a review from rafapi May 12, 2026 19:52

Masseeh mentioned this pull request May 12, 2026

Multi-Step RL with CUBE #139

Closed

Masseeh added 2 commits May 13, 2026 12:55

Stop tracking debugging tools

ef7a01f

Stop tracking Makefile

03cf3f8

add rollout-level vLLM affinity routing

b0a73d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-step RL with CUBE#143

Multi-step RL with CUBE#143
Masseeh wants to merge 27 commits into
multi-stepfrom
multi-step-cube-v2

Masseeh commented May 12, 2026 •

edited

Loading

Uh oh!

Masseeh commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Masseeh commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design Notes

Multi-node / Ray Cluster Support

Testing

Shared hyperparameters

Cube-RL-specific hyperparameters

Cube-RL Ray cluster configs

Results

Effect of scaling up llm_max_rollouts

Eval (Comparison between running with and without evaluation for both TIR and CUBE)

Multi-node run (2 nodes)

Setup and Cube-compatible repositories

Uh oh!

Masseeh commented May 12, 2026

cube-harness changes

Design Notes

TIR Agent and math-tool-use Cube

TIR Agent

math-tool-use Cube

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Masseeh commented May 12, 2026 •

edited

Loading