Skip to content

fix(fast-llm): stop producers and helpers on training completion#144

Merged
jlamypoirier merged 2 commits into
ServiceNow:fast-llmfrom
jlamypoirier:fix/actor-finished-uses-training-done-for-fast-llm
Jun 5, 2026
Merged

fix(fast-llm): stop producers and helpers on training completion#144
jlamypoirier merged 2 commits into
ServiceNow:fast-llmfrom
jlamypoirier:fix/actor-finished-uses-training-done-for-fast-llm

Conversation

@jlamypoirier
Copy link
Copy Markdown
Collaborator

@jlamypoirier jlamypoirier commented May 29, 2026

gpt-5 Codex note:

This is a fixup for PR #142. It extends the Fast-LLM early-stop fix beyond the actor path.

Changes:

  • The preprocessor now uses the explicit Fast-LLM training_finished event before stopping, instead of the legacy sample-counting heuristic.
  • The launcher now stops remaining supervised helper processes after the finetune process has exited cleanly and training completion has been observed. This covers redis/actor/preprocessor-style helpers, not only inference servers.

Why:

  • In Fast-LLM mode, samples_processed reflects Redis entries read, not natural optimizer completion, so preprocessing could still stop converting actor rollouts into fast_llm_streaming documents too early.
  • After a clean Fast-LLM completion, non-inference helper processes can otherwise keep the launcher alive even though the trainer has finished.

Verification from the local branch:

  • /Users/joel.lamy-poirier/Projects/Fast-LLM/venv/bin/python -m py_compile pipelinerl/launch.py pipelinerl/preprocess.py tests/test_launch_process_monitoring.py
  • git diff --check
  • Targeted pytest collection is blocked in this local environment because tests/conftest.py imports pipelinerl.vllm1, which requires uvloop.

@jlamypoirier jlamypoirier changed the base branch from fast-llm to fix/actor-finished-uses-training-done-for-fast-llm May 29, 2026 16:56
@jlamypoirier jlamypoirier changed the title fix(preprocess): use training_done under Fast-LLM fix(fast-llm): stop producers and helpers on training completion May 29, 2026
@jlamypoirier jlamypoirier deleted the branch ServiceNow:fast-llm June 5, 2026 19:07
@jlamypoirier jlamypoirier reopened this Jun 5, 2026
@jlamypoirier jlamypoirier changed the base branch from fix/actor-finished-uses-training-done-for-fast-llm to fast-llm June 5, 2026 19:09
@jlamypoirier jlamypoirier merged commit add17c7 into ServiceNow:fast-llm Jun 5, 2026
@jlamypoirier jlamypoirier deleted the fix/actor-finished-uses-training-done-for-fast-llm branch June 5, 2026 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant