ggml : add ggml_quantize_chunk_mt for parallel row quantization by shikaku2 · Pull Request #22743 · ggml-org/llama.cpp

shikaku2 · 2026-05-06T04:01:50Z

kv-cache: use `-t` threads for IQ4 packing

Summary

This PR adds ggml_quantize_chunk_mt, a parallel wrapper around ggml_quantize_chunk that splits rows across worker threads, which makes IQ4 KV-cache quantization use the configured llama.cpp thread count.

Previously, iq4_nl KV-cache packing was effectively single-threaded and was much slower than the other supported KV cache formats. This made IQ4 impractical for large prompt-cache / large-context workloads, even though the packed size and RMSE characteristics are useful.

Motivation

IQ4 is attractive for KV-cache storage because it gives the same packed size as q4_0 in this test while preserving better V-cache RMSE:

Type	Packed MiB	RMSE K	RMSE V
q4_0	203.572	0.293315	0.0301958
iq4_nl	203.572	0.237420	0.0257450

However, before this change, iq4_nl packing was dramatically slower than the other formats:

Type	Threads	Seconds	Input MiB/s
iq4_nl	1	59.5679	12.151
iq4_nl	16	6.1020	118.619

That is a roughly 9.76x speedup for iq4_nl on this test file.

Benchmark

Benchmark command:

python bench-kv-cache-quants-speed.py --rebuild --threads 1 ./kv-zstd/dicttest/kv-frankenstein100000.bin --rmse
python bench-kv-cache-quants-speed.py --rebuild --threads 16 ./kv-zstd/dicttest/kv-frankenstein100000.bin --rmse

Test input:

kv-frankenstein100000.bin
KV cache generated from the first 100000 characters of Frankenstein
32 KV segments
723.812 MiB input

Results:

Type	Threads	Input MiB	Packed MiB	Seconds	Input MiB/s	RMSE K	RMSE V
f32	1	723.812	1447.620	1.11001	652.075	0	0
f32	16	723.812	1447.620	1.10453	655.313	0	0
f16	1	723.812	723.812	1.74912	413.815	0	0
f16	16	723.812	723.812	1.74621	414.504	0	0
bf16	1	723.812	723.812	1.04079	695.446	0.00427049	0.000530655
bf16	16	723.812	723.812	1.05612	685.349	0.00427049	0.000530655
q8_0	1	723.812	384.525	2.07237	349.268	0.0184124	0.00188816
q8_0	16	723.812	384.525	1.09429	661.445	0.0184124	0.00188816
q4_0	1	723.812	203.572	1.15734	625.410	0.293315	0.0301958
q4_0	16	723.812	203.572	1.00654	719.109	0.293315	0.0301958
q4_1	1	723.812	226.191	1.10527	654.874	0.235459	0.0261354
q4_1	16	723.812	226.191	0.998586	724.838	0.235459	0.0261354
iq4_nl	1	723.812	203.572	59.5679	12.151	0.237420	0.0257450
iq4_nl	16	723.812	203.572	6.10200	118.619	0.237420	0.0257450
q5_0	1	723.812	248.811	1.57535	459.462	0.146387	0.0150381
q5_0	16	723.812	248.811	1.18220	612.259	0.146387	0.0150381
q5_1	1	723.812	271.430	1.50332	481.476	0.113917	0.0126468
q5_1	16	723.812	271.430	1.18375	611.458	0.113917	0.0126468

Ryzen 9 5950X (16 physical cores / 32 SMT threads). Near-linear scaling through the physical core count, SMT continues to improve throughput to ~12x at 32 threads:

Threads	Seconds	MiB/s	Speedup
1	59.42	12.18	1.0x
2	30.88	23.44	1.92x
3	21.56	33.57	2.76x
4	16.71	43.32	3.56x
5	13.70	52.84	4.34x
6	11.72	61.74	5.07x
7	10.32	70.15	5.76x
8	9.29	77.88	6.39x
9	8.44	85.71	7.03x
10	7.79	92.97	7.63x
11	7.29	99.33	8.16x
12	6.82	106.15	8.71x
13	6.48	111.70	9.17x
14	6.18	117.19	9.62x
15	5.96	121.49	9.97x
16	6.06	119.35	9.80x
17	6.39	113.32	9.30x
18	6.23	116.18	9.54x
19	6.04	119.88	9.84x
20	5.91	122.56	10.06x
21	5.74	126.03	10.35x
22	5.59	129.48	10.63x
23	5.46	132.68	10.89x
24	5.38	134.58	11.05x
25	5.27	137.41	11.28x
26	5.15	140.48	11.53x
27	5.07	142.78	11.72x
28	4.94	146.44	12.02x
29	4.87	148.71	12.21x
30	4.86	148.88	12.22x
31	5.05	143.27	11.76x
32	4.92	147.17	12.08x

Notes

The largest improvement is for iq4_nl, which was the main target of this change.

Other formats either already had relatively low compute cost or are dominated by simpler conversion paths, so their scaling is smaller. f32, f16, and bf16 are effectively unchanged, as expected.

The important practical result is that IQ4 KV-caching is no longer prohibitively slow for larger context sizes. This makes IQ4 much more usable for large prompt-cache and large-context workloads where packed KV size matters.

This PR is intentionally limited to the IQ4 multithreading backend change, making it reviewable independently from higher-level KV-cache policy changes.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, for benchmark/stability scripts, bench-kv-cache-quants-speed.py quant only portion speed test using saved frankenstein kv cache file, thread safety

iq4_nl uses a non-linear lookup table search per block that makes its single-threaded quantization throughput ~95x slower than other 4-bit types (12 MB/s vs 1000+ MB/s on a 16-core CPU). Since blocks are fully independent, the work parallelises perfectly across rows. ggml_quantize_chunk_mt splits [start, start + nrows * n_per_row) evenly across n_threads workers, each calling ggml_quantize_chunk on its row range. Because ggml_quantize_chunk writes to dst at start_row*row_size, threads target non-overlapping regions of the output buffer with no locking required. Benchmark (77 MiB f16 KV state, 32 segments, Ryzen 9 7950X 16c/32t): type threads MiB/s iq4_nl 1 12.3 iq4_nl 16 131.6 (~10.7x) q8_0 1 463 q4_0 1 1172 All other types are already fast enough that threading overhead is noise; the fallback to single-threaded ggml_quantize_chunk fires automatically when n_threads <= 1 or nrows <= 1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

shikaku2 requested a review from ggerganov as a code owner May 6, 2026 04:01

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 6, 2026

shikaku2 closed this by deleting the head repository May 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : add ggml_quantize_chunk_mt for parallel row quantization#22743

ggml : add ggml_quantize_chunk_mt for parallel row quantization#22743
shikaku2 wants to merge 1 commit into
ggml-org:masterfrom
shikaku2:iq4-quantize-mt

shikaku2 commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shikaku2 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

kv-cache: use -t threads for IQ4 packing

Summary

Motivation

Benchmark

Notes

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shikaku2 commented May 6, 2026 •

edited

Loading

kv-cache: use `-t` threads for IQ4 packing