fix(cloud): persist user LoRAs to /data volume across worker resets#926
Open
livepeer-tessa wants to merge 2 commits intomainfrom
Open
fix(cloud): persist user LoRAs to /data volume across worker resets#926livepeer-tessa wants to merge 2 commits intomainfrom
livepeer-tessa wants to merge 2 commits intomainfrom
Conversation
added 2 commits
April 12, 2026 18:25
Add dimension validation in parse_lora_weights() so a LoRA trained for a
different model size (e.g. Wan2.1-5B, in_features=5120) is rejected with a
user-friendly ValueError when loaded into the 1.3B model (in_features=1536),
rather than loading silently and crashing 150+ times at inference.
Before: mat1/mat2 shape mismatch RuntimeError deep in peft/torch at inference
After: ValueError at load time naming the layer, expected vs actual dims, and
a plain-language hint about model architecture mismatch
Also adds test_lora_dimension_validation.py covering:
- compatible LoRA loads without error
- 5B LoRA on 1.3B model raises ValueError
- error message is user-friendly (names layer + dimensions)
- out_features mismatch is also caught
- 5B LoRA on 5B model is fine
Signed-off-by: Tessa (livepeer-tessa) <tessa@livepeer.org>
…923) User-installed LoRAs were written to /tmp/.daydream-scope/assets/lora/ which is ephemeral on fal.ai workers. When a worker was reset between jobs the /tmp directory was wiped, causing pipeline load failures for longlive and krea-realtime-video with errors like: LoRA loading failed. File not found: /tmp/.daydream-scope/assets/lora/SUPERSUISH_LoRA_V1_000000750.safetensors Fix: introduce USER_LORA_DIR = /data/models/user-loras (persistent volume) and point DAYDREAM_SCOPE_LORA_DIR at it instead of /tmp. This is kept separate from DAYDREAM_SCOPE_LORA_SHARED_DIR (/data/models/lora) which holds pre-bundled sample LoRAs that must not be cleaned up between sessions. Session cleanup (cleanup_session_data / /internal/cleanup-session) is updated to wipe USER_LORA_DIR at session end so that one user's LoRAs cannot leak to the next user on the same worker. Affected files: - src/scope/cloud/fal_app.py — add USER_LORA_DIR, update env setup, update cleanup_session_data() - src/scope/cloud/livepeer_fal_app.py — add USER_LORA_DIR, update runner env - src/scope/cloud/livepeer_app.py — read USER_LORA_DIR from env, refactor cleanup helpers, clean loras on session end Closes #923 Signed-off-by: Tessa (livepeer-tessa) <tessa@livepeer.org>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Contributor
🚀 fal.ai Preview Deployment
Livepeer Runner
Testing Livepeer Mode |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
User-installed LoRAs were written to
/tmp/.daydream-scope/assets/lora/which is ephemeral on fal.ai workers. When a worker was reset between jobs,/tmp/was wiped, causing pipeline load failures forlongliveandkrea-realtime-video:6+ occurrences observed 2026-04-12 15:17–15:22 UTC across sessions
aa6d9669and others. The LoRA loaded fine on the first job, then failed on every subsequent worker reset.Closes #923.
Fix
Introduce
USER_LORA_DIR = /data/models/user-loras(persistent/datavolume) and pointDAYDREAM_SCOPE_LORA_DIRat it instead of/tmp. This is kept separate fromDAYDREAM_SCOPE_LORA_SHARED_DIR(/data/models/lora) which holds pre-bundled sample LoRAs.Session isolation is preserved:
cleanup_session_data()and the/internal/cleanup-sessionendpoint both wipeUSER_LORA_DIRat session end, preventing LoRA files from one user leaking to the next user on the same worker.Changes
fal_app.pyUSER_LORA_DIR, updateDAYDREAM_SCOPE_LORA_DIRenv, updatecleanup_session_data()to also wipe user LoRAslivepeer_fal_app.pyUSER_LORA_DIR, update runner envlivepeer_app.pyUSER_LORA_DIRfrom env, refactor cleanup helpers, clean user LoRAs on session endTesting
test_workflow_resolve.pytests pass ✅