Skip to content

Multi-step RL with CUBE#143

Draft
Masseeh wants to merge 27 commits into
multi-stepfrom
multi-step-cube-v2
Draft

Multi-step RL with CUBE#143
Masseeh wants to merge 27 commits into
multi-stepfrom
multi-step-cube-v2

Conversation

@Masseeh
Copy link
Copy Markdown
Collaborator

@Masseeh Masseeh commented May 12, 2026

Summary

This PR adds and extends the cube-specific PipelineRL actor path under pipelinerl/cube_rl, wiring cube-harness rollouts into the Ray actor entrypoint flow with per-generation vLLM routing, elastic multi-cube scheduling, richer rollout metadata, and results-viewer support.

Key changes:

  • Adds a cube Ray actor loop with synchronous cube-harness rollout workers.
  • Adds a shared VLLMRouterActor plus RayVLLMRouter adapter for per-generation vLLM routing and admission control.
  • Uses a global cube worker pool instead of pinning workers to individual vLLM URLs.
  • Treats actor.llm_max_rollouts as per-vLLM generation capacity.
  • Refactors cube execution around cube_params.cubes, allowing multiple train/test cubes in one run.
  • Adds cube-qualified task refs via CubeTaskRef(cube_id, task_id).
  • Moves task discovery into a cube registry owned by the PipelineRL actor process.
  • Makes Ray cube workers generic and lazy: workers load shared state at startup and prepare a cube only when assigned a rollout for that cube.
  • Adds simple warm-worker affinity to reduce cube setup churn.
  • Configures cube-harness agent LLM settings per cube with train/test LLM parameter profiles inherited from PipelineRL config.
  • Keeps vLLM selection inside the routed LLM call path rather than rollout scheduling.
  • Adds Ray worker log collection, resource checks, cube configs, and launch entrypoint support.
  • Adds route/LLM metadata plumbing from cube-harness LLM calls into PipelineRL TrainingText.metadata.
  • Adds rollout attribution metadata: cube/task/trajectory IDs, step indices, LLM call IDs, termination reason, finish reason, reward context, and token counts.
  • Adds basic router health suppression and router snapshot metrics.
  • Fixes reward adjustment ordering so length-penalty adjustments are applied before writing actor-stream samples.
  • Extends the lightweight results viewer to show route/server IDs, LLM latency, vLLM lease wait time, routed call counts, and router stats.

Design Notes

The cube actor loop handles rollout-level scheduling, retries, metrics, eval boundaries, and bounded pending work over a global Ray worker pool. The vLLM router owns generation-level admission control.

Each cube-harness LLM generation acquires a short-lived route lease from VLLMRouterActor and releases it after completion or error. This keeps routing tied to actual model calls rather than whole rollouts, which matters for multi-step agent trajectories.

The newer multi-cube path treats cubes more like datasets: each configured cube owns its benchmark and agent config, while Ray workers are generic execution slots. Workers lazily install/setup the cube they are assigned and the scheduler prefers workers already warm for the requested cube.

Rollout construction remains inside Ray workers for now. Workers run the cube-harness Episode, inspect the resulting trajectory, assign rewards, and emit TrainingText samples. The new metadata path makes those samples auditable without moving trajectory construction into the central actor loop.

Multi-node / Ray Cluster Support

  • Supports connecting the cube actor path to an existing Ray cluster through actor.ray_address, with ray.init(address="auto") fallback before starting a local Ray runtime.
  • Computes and logs required Ray CPU capacity from actor.cube_workers * actor.cube_workers_num_cpus.
  • Uses generic Ray cube workers that can be placed across cluster nodes and assigned to any configured cube.
  • Defers heavy cube setup to first use inside each worker, avoiding eager benchmark/container setup for every cube on every worker.
  • Adds warm-worker affinity so scheduling prefers idle workers already prepared for the requested cube.
  • Keeps vLLM routing centralized in VLLMRouterActor, so distributed Ray workers share one per-generation admission-control state.
  • Decouples Ray worker placement from vLLM routing: workers come from a global pool, while each LLM call gets its own route lease.
  • Adds Ray worker log collection and router stats to make distributed rollout behavior easier to debug.

Optional rollout artifact writing remains opt-in and file-based. In multi-node clusters it should only be enabled with a shared filesystem path or a cluster-aware storage backend; normal actor-stream training data does not depend on those artifacts.

Testing

I trained Qwen3-4B-Instruct-2507 on TIR using both the default PipelineRL implementation and the Cube-based version of TIR, on a single node.

Shared hyperparameters

actor.buffer_tokens=2000 \
force_restart=true \
fp32_lm_head=true \
finetune.learning_rate=1e-6 \
finetune.attempts=8 \
finetune.rl.policy_loss=gspo \
finetune.rl.epsilon_low=3e-3 \
finetune.rl.epsilon_high=4e-3 \
+finetune.rl.filter_zero_advantage_groups=true \
finetune.max_train_steps=100 \
finetune.seq_length=32000 \
finetune.seq_parallel=4 \
finetune.gradient_accumulation_passes=128 \
vllm_config.vllm_kwargs.max_model_len=32000 \
llm.parameters.max_tokens=16000 \
llm.parameters.temperature=0.7 \
llm.parameters.max_completion_tokens=16000 \
+llm.parameters.max_model_len=32000 \
test_llm.parameters.max_tokens=16000 \
test_llm.parameters.temperature=0.7 \
test_llm.parameters.max_completion_tokens=16000 \
+test_llm.parameters.max_model_len=32000 \
world.actor_fraction=4 \
world.preprocessor_fraction=0 \
world.finetune_fraction=4 \
streams=files \
eval_every_n_versions=0 \
vllm_config.vllm_kwargs.served_model_name=Qwen3-4B-Instruct-2507

Cube-RL-specific hyperparameters

actor.cube_workers_num_cpus=0.25 \
actor.cube_eval_workers_fraction=0.35 \
actor.cube_workers=384 

Cube-RL Ray cluster configs

actor.launch_ray_cluster=true \
actor.ray_head_port=${ray_head_port} \
actor.ray_node_manager_port=${ray_node_manager_port} \
actor.ray_object_manager_port=${ray_object_manager_port} \
actor.ray_worker_port_start=${ray_worker_port_start} \
actor.ray_worker_port_count=${ray_worker_port_count} \
actor.ray_extra_cpus_per_node=${ray_extra_cpus_per_node} \

Results

with

actor.llm_max_rollouts = 32
image image image image

Effect of scaling up llm_max_rollouts

from 32 to 64

image

Eval (Comparison between running with and without evaluation for both TIR and CUBE)

image

Multi-node run (2 nodes)

image

Setup and Cube-compatible repositories

This setup has been tested with the uv package manager and Python 3.12.13.

Required repositories:

Clone cube-harness and cube-standard alongside the PipelineRL repository before running the project. For example:

parent-directory/
├── PipelineRL/
├── cube-harness/
└── cube-standard/

@Masseeh Masseeh mentioned this pull request May 12, 2026
@Masseeh Masseeh requested a review from rafapi May 12, 2026 19:52
@Masseeh Masseeh mentioned this pull request May 12, 2026
@Masseeh
Copy link
Copy Markdown
Collaborator Author

Masseeh commented May 12, 2026

cube-harness changes

rl/pipelinerl-cube adds lightweight LLM metadata plumbing needed by the PipelineRL cube rollout path.

Key changes:

  • Adds metadata to LLMResponse.
  • Adds metadata to LLMRouteLease.
  • Adds metadata to LLMCall.
  • Propagates routed LLM metadata from RoutedLLM into the returned LLMResponse.
  • Copies LLMResponse.metadata into agent-created LLMCalls for:
    • TirAgent
    • ReactAgent
    • Genny
    • GenericAgent

Design Notes

The PipelineRL cube actor path needs to trace each generated training sample back to the vLLM route and LLM call that produced it.

This PR keeps that plumbing minimal:

RayVLLMRouter / RoutedLLM
  -> LLMResponse.metadata
  -> LLMCall.metadata
  -> PipelineRL TrainingText.metadata

The router still owns infrastructure metadata such as route ID, vLLM server ID, lease wait time, and LLM latency. cube-harness only preserves that metadata on the response/call objects so downstream rollout builders can attach it to training samples.

No agent behavior changes are intended. The metadata field is optional and defaults to an empty dict, so existing non-routed cube-harness usage continues to work.

TIR Agent and math-tool-use Cube

TIR Agent

The cube-harness side includes TirAgent support for PipelineRL rollouts. TirAgent builds chat prompts from the task observation/history, calls the configured routed LLM, parses tool calls from the model response, and returns a standard AgentOutput containing both actions and the associated LLMCall.

math-tool-use Cube

The math-tool-use cube provides the Cube-native TIR environment used for math tool-use RL. It wraps math reasoning tasks as cube tasks, exposes a Python execution tool through the task action set, and evaluates final math answers through the cube benchmark/task interface.

In the PipelineRL config, math-tool-use is used through cube_params.cubes, with train and test datasets configured as separate cube entries. Each entry specifies:

  • the MathToolUseBenchmark,
  • dataset filtering through dataset_name,
  • the TirAgentConfig,
  • the routed LLM config used by PipelineRL/vLLM,
  • the task/tool setup for sandboxed Python execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant