feat: VRAM swap, VisionOffload strategies, Q2K/Q3K quantization, and low-VRAM GPU support by soymh · Pull Request #65 · TimmyOVO/deepseek-ocr.rs

soymh · 2026-05-20T08:11:36Z

Summary

This PR brings DeepSeek‑OCR to low-end NVIDIA GPUs (RTX 3050 4GB, GTX 1060, etc.) via a VRAM swap system that manages SAM/CLIP vision model memory on CUDA, plus expanded DSQ quantization support and a configurable VisionOffload strategy enum.

10 commits | 23 files changed | +2140 / −603

Key Changes

1. `VisionOffload` Strategy Enum (`crates/core/src/runtime.rs`)

Adds Auto (default), Sequential, FullGpu, Cpu modes. --vision-offload CLI/config flag replaces the old boolean vision_swap.

2. Sequential VRAM Swap (`crates/infer-deepseek/src/model/swap.rs`)

New SequentialVramSwap struct loads SAM (~1.26GB) then CLIP (~0.86GB) one at a time on CUDA, never holding both simultaneously. Auto-detects VRAM <6GB via nvidia-smi / DEEPSEEK_OCR_VRAM_MB env var.

3. Hybrid Parallel Mode (`compute_image_embeddings_with_swap`)

Global view runs on CPU in parallel with patch crops on CUDA via scoped threads — best speed/VRAM trade-off on tight GPUs (~2.3GB peak for Q4K).

4. Full GPU Mode (`compute_image_embeddings_full_gpu`)

All vision on CUDA: loads SAM→processes→drops→loads CLIP→processes. ~3.2GB peak. Faster vision prefill when VRAM allows.

5. LM Device Transfer (`move_language_to_device`, `to_device`)

UnsafeCell-based LM ownership transfer lets the language model move between CPU/CUDA. Weight structs gain to_device(&self, device) with QMatMul::QTensor raw byte reconstruction.

6. Q2K / Q3K DSQ Quantization (`deepseek-ocr-q2k`, `deepseek-ocr-q3k`)

New quantized model entries use Q2_K and Q3_K DSQ snapshots, further reducing memory.

7. Bug Fixes

Q4_K round-trip corruption: Keep LM on CUDA during FullGpu vision, evict only for sequential mode
Q2_K precision loss: Protect gate_proj/up_proj by clearing weight_f32 in quantized LinearWeights
Generate first tokens debug logging: Added trace! logging for first 20 generated tokens

8. Config & CLI

--vision-offload <auto|sequential|full-gpu|cpu> CLI flag
--patches-per-batch <N> tune VRAM vs throughput (default 2)
cpu_patches config option for hybrid mode tuning
Generated config.toml includes vision_swap = true, patches_per_batch = 2, vision_offload = "auto"

Pros

Enables 4GB GPUs: DeepSeek‑OCR Q4K now runs on RTX 3050, GTX 1060, etc (~2.3GB peak VRAM)
Flexible strategy enum: Auto-detect, sequential hybrid, full-GPU, or CPU-only — one flag fits all scenarios
Hybrid parallel speed: Global CPU + patch CUDA in parallel gives the best perf for sub-6GB cards without throttling
Full GPU mode: When VRAM allows, keeps everything on CUDA for fastest vision prefill (~3.2GB)
Configurable batch size: --patches-per-batch lets users dial in VRAM vs throughput per card
Q2K/Q3K support: Further quantization options for aggressive memory reduction
Correct Q4_K round-trip: LM stays on CUDA during FullGpu, preventing all-zero decoder corruption
Auto-detection: nvidia-smi + env var fallback means zero-config for most users

Cons

UnsafeCell in model: language_model_mut() is unsafe — callers must guarantee no concurrent access; could be a footgun for future multi-threaded decode
Sequential swap is slower: Loading SAM/CLIP one at a time adds ~1-2s per image vs keeping both on GPU
Hybrid mode CPU cost: Global view on CPU can become a bottleneck on systems with weak CPUs
QMatMul device transfer hack: qmatmul_to_device reconstructs QTensor from raw bytes — brittle if upstream candle changes internal representation
nvidia-smi dependency: VRAM auto-detection spawns a subprocess; DEEPSEEK_OCR_VRAM_MB env var fallback is manual
Vision swap only for DeepSeek‑OCR: PaddleOCR‑VL and DotsOCR don't benefit; the added complexity is unused for other backends
Limited testing: Tested primarily on Q4K CUDA; FP16 vision, FullGpu, and non‑CUDA paths have less coverage

- Move SAM and CLIP vision models to CPU VarBuilder when device=CUDA - Run vision forward pass on CPU, move output embeddings to CUDA - Move token_embedding to CPU after loading (~800MB freed) - Fix weight_f32 leak in DSQ quantization path - Gate redundant F32 VarBuilder loads behind !device.is_cuda() Tested on RTX 3050 4GB (3.68GB VRAM): model now loads and runs successfully with deepseek-ocr-q4k --device cuda. Throughput: generation=21.99 tok/s (vs 16.5 tok/s in Python HF NF4) Prefill still runs on CPU (92s) - to be improved by swap loading.

…ing, add swap module - Refactor VisionContext to hold separate sam and clip references instead of a single VisionModules, allowing different device placement per model - Add process_single_image to process patch crops individually through SAM, avoiding batched forward that multiplied activation memory 6x and caused OOM - Create VramSwapManager in new swap.rs for future CUDA vision optimization (currently not active: SAM global attention score matrix ~1GB doesn't fit alongside LM + SAM weights on 3.68GB VRAM) All vision models remain on CPU. Stable working state with ~92s prefill / ~23s gen.

Hybrid approach: global view on CPU (identical token count), patch crops on CUDA via SequentialVramSwap (SAM then CLIP, one model at a time). - global CPU: vision.sam/vision.clip forward on CPU (match CPU path) - patches CUDA: load SAM -> forward all -> unload, then CLIP same - peak VRAM: ~2.26GB (LM + SAM alone), ~1.86GB (LM + CLIP alone) - LM stays on CUDA permanently (QMatMul can't move devices) - should_use_vram_swap: nvidia-smi / DEEPSEEK_OCR_VRAM_MB env var - .cargo/config.toml: jobs=2 for FA2 build RAM cap

…ween chunks Process 6 patch crops in chunk_size=2 mini-batches to keep activation VRAM under 2.4GB peak. 3 SAM forwards + 3 CLIP forwards vs the previous single batched forward (OOM'd at 6 crops simultaneously). Load/unload cycles: 2 vs 12 (original per-crop loop).

Run global SAM/CLIP on CPU and patch SAM/CLIP + projector on CUDA simultaneously using std::thread::scope. Both use separate hardware (CPU vs GPU) with no shared mutable state. Wall time: ~50s -> ~40s prefill on RTX 3050 4GB.

…for true FullGpu mode FullGpu mode: evicts LM (~950MB) to CPU before vision, moves all vision processing to CUDA, restores LM after. UnsafeCell<DeepseekLanguageModel> enables safe device swap behind &self. Global SAM/CLIP stays on CPU to avoid fractured CUDA pool OOM on 4GB devices; patches bypass via SequentialVramSwap. VisionOffload added to VisionSettings, config, CLI.

…ed all-zero outputs Removing LM eviction (CUDA->CPU->CUDA via qmatmul_to_device) which corrupted Q4_K quantized weights, making the model generate only padding token 0. Peak VRAM with LM (~950MB) + SAM/CLIP (~400MB temporary via swap) + activations = ~1.6GB, well within 4GB.

Add Q2_K and Q3_K dtype support throughout: - DSQ runtime enum and quantization dispatch - CLI and server argument parsing - Built-in model registry entries and assets registry - DSQ writer quantize/add tensor methods - CLI fallback chain (Q2K/Q3K -> Q8_0 -> float) Expected LM VRAM: Q3K ~720MB, Q2K ~480MB (vs Q4K ~950MB). SAM/CLIP remain F16 on CPU, unchanged.

…ek OCR adapter When primary dtype is Q2_K, promote gate_proj and up_proj to Q4_K to preserve the gating signal inside each MoE layer. Attention projections (q/k/v/o) remain at Q2_K. Even with this fix, Q2K produces degraded output for the 12-layer MoE transformer — Q3K is the recommended minimum.

soymh added 10 commits May 18, 2026 01:22

perf: parallelize global CPU and patch CUDA vision passes

276eb49

Run global SAM/CLIP on CPU and patch SAM/CLIP + projector on CUDA simultaneously using std::thread::scope. Both use separate hardware (CPU vs GPU) with no shared mutable state. Wall time: ~50s -> ~40s prefill on RTX 3050 4GB.

feat: configurable vision_swap and patches_per_batch via CLI/config

3a7d11c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: VRAM swap, VisionOffload strategies, Q2K/Q3K quantization, and low-VRAM GPU support#65

feat: VRAM swap, VisionOffload strategies, Q2K/Q3K quantization, and low-VRAM GPU support#65
soymh wants to merge 10 commits into
TimmyOVO:masterfrom
soymh:feat/cuda-vram-swap

soymh commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

soymh commented May 20, 2026

Summary

Key Changes

1. VisionOffload Strategy Enum (crates/core/src/runtime.rs)

2. Sequential VRAM Swap (crates/infer-deepseek/src/model/swap.rs)

3. Hybrid Parallel Mode (compute_image_embeddings_with_swap)

4. Full GPU Mode (compute_image_embeddings_full_gpu)

5. LM Device Transfer (move_language_to_device, to_device)

6. Q2K / Q3K DSQ Quantization (deepseek-ocr-q2k, deepseek-ocr-q3k)

7. Bug Fixes

8. Config & CLI

Pros

Cons

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `VisionOffload` Strategy Enum (`crates/core/src/runtime.rs`)

2. Sequential VRAM Swap (`crates/infer-deepseek/src/model/swap.rs`)

3. Hybrid Parallel Mode (`compute_image_embeddings_with_swap`)

4. Full GPU Mode (`compute_image_embeddings_full_gpu`)

5. LM Device Transfer (`move_language_to_device`, `to_device`)

6. Q2K / Q3K DSQ Quantization (`deepseek-ocr-q2k`, `deepseek-ocr-q3k`)