ggml : add ggml_quantize_chunk_mt for parallel row quantization#22743
Closed
shikaku2 wants to merge 1 commit into
Closed
ggml : add ggml_quantize_chunk_mt for parallel row quantization#22743shikaku2 wants to merge 1 commit into
shikaku2 wants to merge 1 commit into
Conversation
iq4_nl uses a non-linear lookup table search per block that makes its single-threaded quantization throughput ~95x slower than other 4-bit types (12 MB/s vs 1000+ MB/s on a 16-core CPU). Since blocks are fully independent, the work parallelises perfectly across rows. ggml_quantize_chunk_mt splits [start, start + nrows * n_per_row) evenly across n_threads workers, each calling ggml_quantize_chunk on its row range. Because ggml_quantize_chunk writes to dst at start_row*row_size, threads target non-overlapping regions of the output buffer with no locking required. Benchmark (77 MiB f16 KV state, 32 segments, Ryzen 9 7950X 16c/32t): type threads MiB/s iq4_nl 1 12.3 iq4_nl 16 131.6 (~10.7x) q8_0 1 463 q4_0 1 1172 All other types are already fast enough that threading overhead is noise; the fallback to single-threaded ggml_quantize_chunk fires automatically when n_threads <= 1 or nrows <= 1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
kv-cache: use
-tthreads for IQ4 packingSummary
This PR adds ggml_quantize_chunk_mt, a parallel wrapper around ggml_quantize_chunk that splits rows across worker threads, which makes IQ4 KV-cache quantization use the configured llama.cpp thread count.
Previously,
iq4_nlKV-cache packing was effectively single-threaded and was much slower than the other supported KV cache formats. This made IQ4 impractical for large prompt-cache / large-context workloads, even though the packed size and RMSE characteristics are useful.Motivation
IQ4 is attractive for KV-cache storage because it gives the same packed size as
q4_0in this test while preserving better V-cache RMSE:However, before this change,
iq4_nlpacking was dramatically slower than the other formats:That is a roughly
9.76xspeedup foriq4_nlon this test file.Benchmark
Benchmark command:
Test input:
Results:
Ryzen 9 5950X (16 physical cores / 32 SMT threads). Near-linear scaling through the physical core count, SMT continues to improve throughput to ~12x at 32 threads:
Notes
The largest improvement is for
iq4_nl, which was the main target of this change.Other formats either already had relatively low compute cost or are dominated by simpler conversion paths, so their scaling is smaller.
f32,f16, andbf16are effectively unchanged, as expected.The important practical result is that IQ4 KV-caching is no longer prohibitively slow for larger context sizes. This makes IQ4 much more usable for large prompt-cache and large-context workloads where packed KV size matters.
This PR is intentionally limited to the IQ4 multithreading backend change, making it reviewable independently from higher-level KV-cache policy changes.
Requirements