feat: VRAM swap, VisionOffload strategies, Q2K/Q3K quantization, and low-VRAM GPU support#65
Open
soymh wants to merge 10 commits into
Open
feat: VRAM swap, VisionOffload strategies, Q2K/Q3K quantization, and low-VRAM GPU support#65soymh wants to merge 10 commits into
soymh wants to merge 10 commits into
Conversation
- Move SAM and CLIP vision models to CPU VarBuilder when device=CUDA - Run vision forward pass on CPU, move output embeddings to CUDA - Move token_embedding to CPU after loading (~800MB freed) - Fix weight_f32 leak in DSQ quantization path - Gate redundant F32 VarBuilder loads behind !device.is_cuda() Tested on RTX 3050 4GB (3.68GB VRAM): model now loads and runs successfully with deepseek-ocr-q4k --device cuda. Throughput: generation=21.99 tok/s (vs 16.5 tok/s in Python HF NF4) Prefill still runs on CPU (92s) - to be improved by swap loading.
…ing, add swap module - Refactor VisionContext to hold separate sam and clip references instead of a single VisionModules, allowing different device placement per model - Add process_single_image to process patch crops individually through SAM, avoiding batched forward that multiplied activation memory 6x and caused OOM - Create VramSwapManager in new swap.rs for future CUDA vision optimization (currently not active: SAM global attention score matrix ~1GB doesn't fit alongside LM + SAM weights on 3.68GB VRAM) All vision models remain on CPU. Stable working state with ~92s prefill / ~23s gen.
Hybrid approach: global view on CPU (identical token count), patch crops on CUDA via SequentialVramSwap (SAM then CLIP, one model at a time). - global CPU: vision.sam/vision.clip forward on CPU (match CPU path) - patches CUDA: load SAM -> forward all -> unload, then CLIP same - peak VRAM: ~2.26GB (LM + SAM alone), ~1.86GB (LM + CLIP alone) - LM stays on CUDA permanently (QMatMul can't move devices) - should_use_vram_swap: nvidia-smi / DEEPSEEK_OCR_VRAM_MB env var - .cargo/config.toml: jobs=2 for FA2 build RAM cap
…ween chunks Process 6 patch crops in chunk_size=2 mini-batches to keep activation VRAM under 2.4GB peak. 3 SAM forwards + 3 CLIP forwards vs the previous single batched forward (OOM'd at 6 crops simultaneously). Load/unload cycles: 2 vs 12 (original per-crop loop).
Run global SAM/CLIP on CPU and patch SAM/CLIP + projector on CUDA simultaneously using std::thread::scope. Both use separate hardware (CPU vs GPU) with no shared mutable state. Wall time: ~50s -> ~40s prefill on RTX 3050 4GB.
…for true FullGpu mode FullGpu mode: evicts LM (~950MB) to CPU before vision, moves all vision processing to CUDA, restores LM after. UnsafeCell<DeepseekLanguageModel> enables safe device swap behind &self. Global SAM/CLIP stays on CPU to avoid fractured CUDA pool OOM on 4GB devices; patches bypass via SequentialVramSwap. VisionOffload added to VisionSettings, config, CLI.
…ed all-zero outputs Removing LM eviction (CUDA->CPU->CUDA via qmatmul_to_device) which corrupted Q4_K quantized weights, making the model generate only padding token 0. Peak VRAM with LM (~950MB) + SAM/CLIP (~400MB temporary via swap) + activations = ~1.6GB, well within 4GB.
Add Q2_K and Q3_K dtype support throughout: - DSQ runtime enum and quantization dispatch - CLI and server argument parsing - Built-in model registry entries and assets registry - DSQ writer quantize/add tensor methods - CLI fallback chain (Q2K/Q3K -> Q8_0 -> float) Expected LM VRAM: Q3K ~720MB, Q2K ~480MB (vs Q4K ~950MB). SAM/CLIP remain F16 on CPU, unchanged.
…ek OCR adapter When primary dtype is Q2_K, promote gate_proj and up_proj to Q4_K to preserve the gating signal inside each MoE layer. Attention projections (q/k/v/o) remain at Q2_K. Even with this fix, Q2K produces degraded output for the 12-layer MoE transformer — Q3K is the recommended minimum.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR brings DeepSeek‑OCR to low-end NVIDIA GPUs (RTX 3050 4GB, GTX 1060, etc.) via a VRAM swap system that manages SAM/CLIP vision model memory on CUDA, plus expanded DSQ quantization support and a configurable
VisionOffloadstrategy enum.10 commits | 23 files changed | +2140 / −603
Key Changes
1.
VisionOffloadStrategy Enum (crates/core/src/runtime.rs)Adds
Auto(default),Sequential,FullGpu,Cpumodes.--vision-offloadCLI/config flag replaces the old booleanvision_swap.2. Sequential VRAM Swap (
crates/infer-deepseek/src/model/swap.rs)New
SequentialVramSwapstruct loads SAM (~1.26GB) then CLIP (~0.86GB) one at a time on CUDA, never holding both simultaneously. Auto-detects VRAM <6GB vianvidia-smi/DEEPSEEK_OCR_VRAM_MBenv var.3. Hybrid Parallel Mode (
compute_image_embeddings_with_swap)Global view runs on CPU in parallel with patch crops on CUDA via scoped threads — best speed/VRAM trade-off on tight GPUs (~2.3GB peak for Q4K).
4. Full GPU Mode (
compute_image_embeddings_full_gpu)All vision on CUDA: loads SAM→processes→drops→loads CLIP→processes. ~3.2GB peak. Faster vision prefill when VRAM allows.
5. LM Device Transfer (
move_language_to_device,to_device)UnsafeCell-based LM ownership transfer lets the language model move between CPU/CUDA. Weight structs gainto_device(&self, device)withQMatMul::QTensorraw byte reconstruction.6. Q2K / Q3K DSQ Quantization (
deepseek-ocr-q2k,deepseek-ocr-q3k)New quantized model entries use Q2_K and Q3_K DSQ snapshots, further reducing memory.
7. Bug Fixes
gate_proj/up_projby clearingweight_f32in quantizedLinearWeightstrace!logging for first 20 generated tokens8. Config & CLI
--vision-offload <auto|sequential|full-gpu|cpu>CLI flag--patches-per-batch <N>tune VRAM vs throughput (default 2)cpu_patchesconfig option for hybrid mode tuningconfig.tomlincludesvision_swap = true,patches_per_batch = 2,vision_offload = "auto"Pros
--patches-per-batchlets users dial in VRAM vs throughput per cardnvidia-smi+ env var fallback means zero-config for most usersCons
UnsafeCellin model:language_model_mut()isunsafe— callers must guarantee no concurrent access; could be a footgun for future multi-threaded decodeQMatMuldevice transfer hack:qmatmul_to_devicereconstructsQTensorfrom raw bytes — brittle if upstreamcandlechanges internal representationnvidia-smidependency: VRAM auto-detection spawns a subprocess;DEEPSEEK_OCR_VRAM_MBenv var fallback is manual