Skip to content

Multi step cube v2#141

Closed
Masseeh wants to merge 10 commits into
ServiceNow:multi-stepfrom
Masseeh:multi-step-cube-v2
Closed

Multi step cube v2#141
Masseeh wants to merge 10 commits into
ServiceNow:multi-stepfrom
Masseeh:multi-step-cube-v2

Conversation

@Masseeh

@Masseeh Masseeh commented May 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR adds the cube-specific PipelineRL actor path under pipelinerl/cube_rl and wires it into the Ray actor entrypoint flow.

Key changes:

  • Adds a cube Ray actor loop with synchronous CubeBenchmarkActor rollout workers.
  • Adds a shared VLLMRouterActor plus RayVLLMRouter adapter for per-generation vLLM routing and admission control.
  • Uses a global cube benchmark actor pool instead of pinning actors to individual vLLM URLs.
  • Treats actor.llm_max_rollouts as per-vLLM generation capacity.
  • Configures cube-harness agent LLM settings once during cube actor setup, with train/test LLM parameter profiles inherited from PipelineRL config.
  • Simplifies cube actor rollout calls to rollout(task_id), leaving vLLM selection to the router.
  • Adds cube configs, launch entrypoint support, Ray worker log collection, and resource checks.
  • Adds supporting rollout/tool-call plumbing, TIR domain changes, and a lightweight results viewer.

Design Notes

The cube actor loop now handles rollout-level scheduling and backpressure over benchmark actors, while VLLMRouterActor owns vLLM-level routing. Each cube-harness LLM generation acquires
a route lease from the router and releases it after completion or error.

This keeps CubeActorLoop focused on task scheduling, retries, metrics, and eval boundaries, while keeping admission control close to the actual generation call.

Testing

I trained Qwen3-4B-Instruct-2507 on TIR using both the default PipelineRL implementation and Cube-based version of TIR, on a single node.

Shared hyperparameters

actor.buffer_tokens=200 \
force_restart=true \
fp32_lm_head=true \
finetune.learning_rate=1e-6 \
finetune.attempts=8 \
finetune.rl.policy_loss=gspo \
finetune.rl.epsilon_low=3e-3 \
finetune.rl.epsilon_high=4e-3 \
+finetune.rl.filter_zero_advantage_groups=true \
finetune.max_train_steps=100 \
finetune.seq_length=32000 \
finetune.seq_parallel=4 \
finetune.gradient_accumulation_passes=128 \
vllm_config.vllm_kwargs.max_model_len=32000 \
llm.parameters.max_tokens=16000 \
llm.parameters.temperature=0.7 \
llm.parameters.max_completion_tokens=16000 \
+llm.parameters.max_model_len=32000 \
test_llm.parameters.max_tokens=16000 \
test_llm.parameters.temperature=0.7 \
test_llm.parameters.max_completion_tokens=16000 \
+test_llm.parameters.max_model_len=32000 \
world.actor_fraction=4 \
world.preprocessor_fraction=0 \
world.finetune_fraction=4 \
streams=files \
eval_every_n_versions=0 \
vllm_config.vllm_kwargs.served_model_name=Qwen3-4B-Instruct-2507

Cube-RL-specific hyperparameters

actor.cube_actor_num_cpus=0.5

Results

with

actor.llm_max_rollouts = 32
image image image image

Effect of scaling up llm_max_rollouts

from 32 to 64

image

Setup and Cube-compatible repositories

This setup has been tested with the uv package manager and Python 3.12.13.

Required repositories:

Clone cube-harness and cube-standard alongside the PipelineRL repository before running the project. For example:

parent-directory/
├── PipelineRL/
├── cube-harness/
└── cube-standard/

@rafapi @ehsk could you take a look at this PR?

@Masseeh Masseeh marked this pull request as ready for review May 6, 2026 21:18
@Masseeh Masseeh marked this pull request as draft May 6, 2026 21:18
@rafapi

rafapi commented May 7, 2026

Copy link
Copy Markdown
Collaborator

What is the result viewer for?

@Masseeh

Masseeh commented May 7, 2026

Copy link
Copy Markdown
Collaborator Author

What is the result viewer for?

It’s a web interface for viewing the contents of rollouts produced by actor streams. For example, the image below shows the first 10 rollouts. By clicking on a rollout, you can inspect the contents of each group as well as the individual steps within an episode. Spawn by the following command:

uv run python -m results_viewer --results [experiment_folder] --host 127.0.0.1 --port 8765
image image image

@rafapi

rafapi commented May 7, 2026

Copy link
Copy Markdown
Collaborator

What is the result viewer for?

It’s a web interface for viewing the contents of rollouts produced by actor streams. For example, the image below shows the first 10 rollouts. By clicking on a rollout, you can inspect the contents of each group as well as the individual steps within an episode. Spawn by the following command:

uv run python -m results_viewer --results [experiment_folder] --host 127.0.0.1 --port 8765

image image image

This is such a weird prompt:
image

@Masseeh

Masseeh commented May 7, 2026

Copy link
Copy Markdown
Collaborator Author

This is such a weird prompt:

Yes!! it's part of open_reasoner_zero_extended_72k dataset.

{
   "dataset":"open_reasoner_zero_extended_72k",
   "task":"9. Let $A=\\{1,2,3, \\cdots, 2 n, 2 n+1\\}, B \\subseteq A$, and for any three distinct elements $x, y, z$ in $B$, we have $x+y \\neq z$. Find the maximum value of $|B|$.\n\nTranslate the above text into English, please retain the original text's line breaks and format, and output the translation result directly.",
   "answer":"\\boxed{n+1}"
}

@Masseeh

Masseeh commented May 12, 2026

Copy link
Copy Markdown
Collaborator Author

Closing this in favor of PR #143

@Masseeh Masseeh closed this May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants