Dsv4 sparse prefill optimization by skipping odd tail block by Njuapp · Pull Request #187 · deepseek-ai/FlashMLA

Njuapp · 2026-06-09T03:15:04Z

Optimize DS v4 prefill HCA kernel by skipping odd tail blocks.

For ISL=8k, we can cut down HCA kernel cost by 20%.

build: replace torch/extension.h with libtorch headers build: remove torch/python.h include from kerutils tensor helpers build: avoid utils.h include collision with sgl-kernel fix: keep FMHACutlassSM100FwdRun ABI and cast internally compat: convert sparse prefill max_logits/lse back to log2

Co-authored-by: Bruce Wu <mogicianwu@fb.com>

FlamingoPg and others added 5 commits February 17, 2026 00:28

docs: add and refine SGLang maintenance workflow

f8b4903

decode-meta: add low-smem fallback for large batch scheduling

9804b12

Fix flashmla_kv nsa backend accuracy issue on B200 (deepseek-ai#5)

abb5477

Co-authored-by: Bruce Wu <mogicianwu@fb.com>

Optimize DSV4 sparse prefill C128 path

4d2a742

Njuapp changed the title ~~Dsv4 sparse prefill stable~~ Dsv4 sparse prefill optimization by skipping odd tail block Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dsv4 sparse prefill optimization by skipping odd tail block#187

Dsv4 sparse prefill optimization by skipping odd tail block#187
Njuapp wants to merge 5 commits into
deepseek-ai:mainfrom
Njuapp:dsv4-sparse-prefill-stable

Njuapp commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Njuapp commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Njuapp commented Jun 9, 2026 •

edited

Loading