Skip to content

fix(cloud): persist user LoRAs to /data volume across worker resets#926

Open
livepeer-tessa wants to merge 2 commits intomainfrom
fix/923-lora-persistence-across-worker-resets
Open

fix(cloud): persist user LoRAs to /data volume across worker resets#926
livepeer-tessa wants to merge 2 commits intomainfrom
fix/923-lora-persistence-across-worker-resets

Conversation

@livepeer-tessa
Copy link
Copy Markdown
Contributor

Problem

User-installed LoRAs were written to /tmp/.daydream-scope/assets/lora/ which is ephemeral on fal.ai workers. When a worker was reset between jobs, /tmp/ was wiped, causing pipeline load failures for longlive and krea-realtime-video:

scope.server.pipeline_manager - ERROR - Failed to load pipeline longlive: 
LongLivePipeline.__init__: LoRA loading failed. File not found: 
/tmp/.daydream-scope/assets/lora/SUPERSUISH_LoRA_V1_000000750.safetensors.
Ensure the file exists in the models/lora/ directory.

6+ occurrences observed 2026-04-12 15:17–15:22 UTC across sessions aa6d9669 and others. The LoRA loaded fine on the first job, then failed on every subsequent worker reset.

Closes #923.

Fix

Introduce USER_LORA_DIR = /data/models/user-loras (persistent /data volume) and point DAYDREAM_SCOPE_LORA_DIR at it instead of /tmp. This is kept separate from DAYDREAM_SCOPE_LORA_SHARED_DIR (/data/models/lora) which holds pre-bundled sample LoRAs.

Session isolation is preserved: cleanup_session_data() and the /internal/cleanup-session endpoint both wipe USER_LORA_DIR at session end, preventing LoRA files from one user leaking to the next user on the same worker.

Changes

File What changed
fal_app.py Add USER_LORA_DIR, update DAYDREAM_SCOPE_LORA_DIR env, update cleanup_session_data() to also wipe user LoRAs
livepeer_fal_app.py Add USER_LORA_DIR, update runner env
livepeer_app.py Read USER_LORA_DIR from env, refactor cleanup helpers, clean user LoRAs on session end

Testing

  • All 46 existing test_workflow_resolve.py tests pass ✅
  • No changes to LoRA install/download logic — only where the directory lives

Tessa (livepeer-tessa) added 2 commits April 12, 2026 18:25
Add dimension validation in parse_lora_weights() so a LoRA trained for a
different model size (e.g. Wan2.1-5B, in_features=5120) is rejected with a
user-friendly ValueError when loaded into the 1.3B model (in_features=1536),
rather than loading silently and crashing 150+ times at inference.

Before: mat1/mat2 shape mismatch RuntimeError deep in peft/torch at inference
After:  ValueError at load time naming the layer, expected vs actual dims, and
        a plain-language hint about model architecture mismatch

Also adds test_lora_dimension_validation.py covering:
- compatible LoRA loads without error
- 5B LoRA on 1.3B model raises ValueError
- error message is user-friendly (names layer + dimensions)
- out_features mismatch is also caught
- 5B LoRA on 5B model is fine

Signed-off-by: Tessa (livepeer-tessa) <tessa@livepeer.org>
…923)

User-installed LoRAs were written to /tmp/.daydream-scope/assets/lora/
which is ephemeral on fal.ai workers.  When a worker was reset between
jobs the /tmp directory was wiped, causing pipeline load failures for
longlive and krea-realtime-video with errors like:

  LoRA loading failed. File not found:
  /tmp/.daydream-scope/assets/lora/SUPERSUISH_LoRA_V1_000000750.safetensors

Fix: introduce USER_LORA_DIR = /data/models/user-loras (persistent volume)
and point DAYDREAM_SCOPE_LORA_DIR at it instead of /tmp.  This is kept
separate from DAYDREAM_SCOPE_LORA_SHARED_DIR (/data/models/lora) which
holds pre-bundled sample LoRAs that must not be cleaned up between sessions.

Session cleanup (cleanup_session_data / /internal/cleanup-session) is
updated to wipe USER_LORA_DIR at session end so that one user's LoRAs
cannot leak to the next user on the same worker.

Affected files:
- src/scope/cloud/fal_app.py        — add USER_LORA_DIR, update env setup,
                                      update cleanup_session_data()
- src/scope/cloud/livepeer_fal_app.py — add USER_LORA_DIR, update runner env
- src/scope/cloud/livepeer_app.py   — read USER_LORA_DIR from env, refactor
                                      cleanup helpers, clean loras on session end

Closes #923

Signed-off-by: Tessa (livepeer-tessa) <tessa@livepeer.org>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 12, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2239c0bd-4f86-4038-9b5a-ca53b28a2b66

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/923-lora-persistence-across-worker-resets

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-926--preview
WebSocket wss://fal.run/daydream/scope-pr-926--preview/ws
Commit fa1a655

Livepeer Runner

App ID daydream/scope-livepeer-pr-926--preview
WebSocket wss://fal.run/daydream/scope-livepeer-pr-926--preview/ws
Auth private

Testing Livepeer Mode

SCOPE_CLOUD_MODE=livepeer SCOPE_CLOUD_APP_ID="daydream/scope-livepeer-pr-926--preview/ws" uv run daydream-scope

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[fal.ai] longlive/krea-realtime-video: LoRA lost after worker reset — /tmp/.daydream-scope/assets/lora/ cleared between jobs (SUPERSUISH_LoRA_V1)

1 participant