Skip to content

fix: stop active pipeline workers before unload to free VRAM#966

Open
leszko wants to merge 1 commit intomainfrom
rafal/fix-pipeline-unload
Open

fix: stop active pipeline workers before unload to free VRAM#966
leszko wants to merge 1 commit intomainfrom
rafal/fix-pipeline-unload

Conversation

@leszko
Copy link
Copy Markdown
Collaborator

@leszko leszko commented Apr 20, 2026

When /pipeline/load triggered a swap while a PipelineProcessor worker was still producing frames, the unload path dropped pipeline_manager's reference and called gc.collect()/empty_cache(), but the worker thread kept the pipeline object alive through its closure and continued allocating CUDA memory. The next load (e.g. longlive after ltx2) OOMed with ~30 GiB still in use despite logging "CUDA cache cleared".

Add a pre-unload hook registry on PipelineManager. graph_executor registers each processor's stop() under its node_id at creation time, and FrameProcessor.stop() unregisters on normal teardown. The hook fires synchronously inside _unload_pipeline_by_id_unsafe BEFORE the pipeline reference is dropped, so the worker exits and releases its tensors first — then gc/empty_cache can actually reclaim VRAM.

Verified: loading ltx2, running a session, then POSTing /pipeline/load {longlive, passthrough} without a session stop now succeeds. Log sequence is Unloading → PipelineProcessor stopped → CUDA cache cleared, and the next session starts cleanly.

When /pipeline/load triggered a swap while a PipelineProcessor worker
was still producing frames, the unload path dropped pipeline_manager's
reference and called gc.collect()/empty_cache(), but the worker thread
kept the pipeline object alive through its closure and continued
allocating CUDA memory. The next load (e.g. longlive after ltx2) OOMed
with ~30 GiB still in use despite logging "CUDA cache cleared".

Add a pre-unload hook registry on PipelineManager. graph_executor
registers each processor's stop() under its node_id at creation time,
and FrameProcessor.stop() unregisters on normal teardown. The hook
fires synchronously inside _unload_pipeline_by_id_unsafe BEFORE the
pipeline reference is dropped, so the worker exits and releases its
tensors first — then gc/empty_cache can actually reclaim VRAM.

Verified: loading ltx2, running a session, then POSTing
/pipeline/load {longlive, passthrough} without a session stop now
succeeds. Log sequence is Unloading → PipelineProcessor stopped →
CUDA cache cleared, and the next session starts cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Rafal Leszko <rafal@livepeer.org>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d9188ebf-b936-4a84-a880-89ebdfecc792

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch rafal/fix-pipeline-unload

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@leszko leszko requested a review from gioelecerati April 20, 2026 09:43
@leszko leszko marked this pull request as ready for review April 20, 2026 09:43
@github-actions
Copy link
Copy Markdown
Contributor

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-966--preview
WebSocket wss://fal.run/daydream/scope-pr-966--preview/ws
Commit 15fb53f

Livepeer Runner

App ID daydream/scope-livepeer-pr-966--preview
WebSocket wss://fal.run/daydream/scope-livepeer-pr-966--preview/ws
Auth private

Testing Livepeer Mode

SCOPE_CLOUD_MODE=livepeer SCOPE_CLOUD_APP_ID="daydream/scope-livepeer-pr-966--preview/ws" uv run daydream-scope

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant