Skip to content

feat: VRAM swap, VisionOffload strategies, Q2K/Q3K quantization, and low-VRAM GPU support#65

Open
soymh wants to merge 10 commits into
TimmyOVO:masterfrom
soymh:feat/cuda-vram-swap
Open

feat: VRAM swap, VisionOffload strategies, Q2K/Q3K quantization, and low-VRAM GPU support#65
soymh wants to merge 10 commits into
TimmyOVO:masterfrom
soymh:feat/cuda-vram-swap

Conversation

@soymh

@soymh soymh commented May 20, 2026

Copy link
Copy Markdown

Summary

This PR brings DeepSeek‑OCR to low-end NVIDIA GPUs (RTX 3050 4GB, GTX 1060, etc.) via a VRAM swap system that manages SAM/CLIP vision model memory on CUDA, plus expanded DSQ quantization support and a configurable VisionOffload strategy enum.

10 commits | 23 files changed | +2140 / −603


Key Changes

1. VisionOffload Strategy Enum (crates/core/src/runtime.rs)

Adds Auto (default), Sequential, FullGpu, Cpu modes. --vision-offload CLI/config flag replaces the old boolean vision_swap.

2. Sequential VRAM Swap (crates/infer-deepseek/src/model/swap.rs)

New SequentialVramSwap struct loads SAM (~1.26GB) then CLIP (~0.86GB) one at a time on CUDA, never holding both simultaneously. Auto-detects VRAM <6GB via nvidia-smi / DEEPSEEK_OCR_VRAM_MB env var.

3. Hybrid Parallel Mode (compute_image_embeddings_with_swap)

Global view runs on CPU in parallel with patch crops on CUDA via scoped threads — best speed/VRAM trade-off on tight GPUs (~2.3GB peak for Q4K).

4. Full GPU Mode (compute_image_embeddings_full_gpu)

All vision on CUDA: loads SAM→processes→drops→loads CLIP→processes. ~3.2GB peak. Faster vision prefill when VRAM allows.

5. LM Device Transfer (move_language_to_device, to_device)

UnsafeCell-based LM ownership transfer lets the language model move between CPU/CUDA. Weight structs gain to_device(&self, device) with QMatMul::QTensor raw byte reconstruction.

6. Q2K / Q3K DSQ Quantization (deepseek-ocr-q2k, deepseek-ocr-q3k)

New quantized model entries use Q2_K and Q3_K DSQ snapshots, further reducing memory.

7. Bug Fixes

  • Q4_K round-trip corruption: Keep LM on CUDA during FullGpu vision, evict only for sequential mode
  • Q2_K precision loss: Protect gate_proj/up_proj by clearing weight_f32 in quantized LinearWeights
  • Generate first tokens debug logging: Added trace! logging for first 20 generated tokens

8. Config & CLI

  • --vision-offload <auto|sequential|full-gpu|cpu> CLI flag
  • --patches-per-batch <N> tune VRAM vs throughput (default 2)
  • cpu_patches config option for hybrid mode tuning
  • Generated config.toml includes vision_swap = true, patches_per_batch = 2, vision_offload = "auto"

Pros

  1. Enables 4GB GPUs: DeepSeek‑OCR Q4K now runs on RTX 3050, GTX 1060, etc (~2.3GB peak VRAM)
  2. Flexible strategy enum: Auto-detect, sequential hybrid, full-GPU, or CPU-only — one flag fits all scenarios
  3. Hybrid parallel speed: Global CPU + patch CUDA in parallel gives the best perf for sub-6GB cards without throttling
  4. Full GPU mode: When VRAM allows, keeps everything on CUDA for fastest vision prefill (~3.2GB)
  5. Configurable batch size: --patches-per-batch lets users dial in VRAM vs throughput per card
  6. Q2K/Q3K support: Further quantization options for aggressive memory reduction
  7. Correct Q4_K round-trip: LM stays on CUDA during FullGpu, preventing all-zero decoder corruption
  8. Auto-detection: nvidia-smi + env var fallback means zero-config for most users

Cons

  1. UnsafeCell in model: language_model_mut() is unsafe — callers must guarantee no concurrent access; could be a footgun for future multi-threaded decode
  2. Sequential swap is slower: Loading SAM/CLIP one at a time adds ~1-2s per image vs keeping both on GPU
  3. Hybrid mode CPU cost: Global view on CPU can become a bottleneck on systems with weak CPUs
  4. QMatMul device transfer hack: qmatmul_to_device reconstructs QTensor from raw bytes — brittle if upstream candle changes internal representation
  5. nvidia-smi dependency: VRAM auto-detection spawns a subprocess; DEEPSEEK_OCR_VRAM_MB env var fallback is manual
  6. Vision swap only for DeepSeek‑OCR: PaddleOCR‑VL and DotsOCR don't benefit; the added complexity is unused for other backends
  7. Limited testing: Tested primarily on Q4K CUDA; FP16 vision, FullGpu, and non‑CUDA paths have less coverage

soymh added 10 commits May 18, 2026 01:22
- Move SAM and CLIP vision models to CPU VarBuilder when device=CUDA
- Run vision forward pass on CPU, move output embeddings to CUDA
- Move token_embedding to CPU after loading (~800MB freed)
- Fix weight_f32 leak in DSQ quantization path
- Gate redundant F32 VarBuilder loads behind !device.is_cuda()

Tested on RTX 3050 4GB (3.68GB VRAM): model now loads and runs
successfully with deepseek-ocr-q4k --device cuda.
Throughput: generation=21.99 tok/s (vs 16.5 tok/s in Python HF NF4)
Prefill still runs on CPU (92s) - to be improved by swap loading.
…ing, add swap module

- Refactor VisionContext to hold separate sam and clip references instead
  of a single VisionModules, allowing different device placement per model
- Add process_single_image to process patch crops individually through SAM,
  avoiding batched forward that multiplied activation memory 6x and caused OOM
- Create VramSwapManager in new swap.rs for future CUDA vision optimization
  (currently not active: SAM global attention score matrix ~1GB doesn't fit
  alongside LM + SAM weights on 3.68GB VRAM)

All vision models remain on CPU. Stable working state with ~92s prefill / ~23s gen.
Hybrid approach: global view on CPU (identical token count), patch crops
on CUDA via SequentialVramSwap (SAM then CLIP, one model at a time).

- global CPU: vision.sam/vision.clip forward on CPU (match CPU path)
- patches CUDA: load SAM -> forward all -> unload, then CLIP same
- peak VRAM: ~2.26GB (LM + SAM alone), ~1.86GB (LM + CLIP alone)
- LM stays on CUDA permanently (QMatMul can't move devices)
- should_use_vram_swap: nvidia-smi / DEEPSEEK_OCR_VRAM_MB env var
- .cargo/config.toml: jobs=2 for FA2 build RAM cap
…ween chunks

Process 6 patch crops in chunk_size=2 mini-batches to keep activation
VRAM under 2.4GB peak. 3 SAM forwards + 3 CLIP forwards vs the
previous single batched forward (OOM'd at 6 crops simultaneously).

Load/unload cycles: 2 vs 12 (original per-crop loop).
Run global SAM/CLIP on CPU and patch SAM/CLIP + projector on CUDA
simultaneously using std::thread::scope. Both use separate hardware
(CPU vs GPU) with no shared mutable state.

Wall time: ~50s -> ~40s prefill on RTX 3050 4GB.
…for true FullGpu mode

FullGpu mode: evicts LM (~950MB) to CPU before vision, moves all vision
processing to CUDA, restores LM after. UnsafeCell<DeepseekLanguageModel>
enables safe device swap behind &self. Global SAM/CLIP stays on CPU to
avoid fractured CUDA pool OOM on 4GB devices; patches bypass via
SequentialVramSwap. VisionOffload added to VisionSettings, config, CLI.
…ed all-zero outputs

Removing LM eviction (CUDA->CPU->CUDA via qmatmul_to_device) which
corrupted Q4_K quantized weights, making the model generate only
padding token 0. Peak VRAM with LM (~950MB) + SAM/CLIP (~400MB
temporary via swap) + activations = ~1.6GB, well within 4GB.
Add Q2_K and Q3_K dtype support throughout:
- DSQ runtime enum and quantization dispatch
- CLI and server argument parsing
- Built-in model registry entries and assets registry
- DSQ writer quantize/add tensor methods
- CLI fallback chain (Q2K/Q3K -> Q8_0 -> float)

Expected LM VRAM: Q3K ~720MB, Q2K ~480MB (vs Q4K ~950MB).
SAM/CLIP remain F16 on CPU, unchanged.
…ek OCR adapter

When primary dtype is Q2_K, promote gate_proj and up_proj to
Q4_K to preserve the gating signal inside each MoE layer.
Attention projections (q/k/v/o) remain at Q2_K.

Even with this fix, Q2K produces degraded output for the
12-layer MoE transformer — Q3K is the recommended minimum.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant