Skip to content

ggml : add ggml_quantize_chunk_mt for parallel row quantization#22743

Closed
shikaku2 wants to merge 1 commit into
ggml-org:masterfrom
shikaku2:iq4-quantize-mt
Closed

ggml : add ggml_quantize_chunk_mt for parallel row quantization#22743
shikaku2 wants to merge 1 commit into
ggml-org:masterfrom
shikaku2:iq4-quantize-mt

Conversation

@shikaku2
Copy link
Copy Markdown

@shikaku2 shikaku2 commented May 6, 2026

kv-cache: use -t threads for IQ4 packing

Summary

This PR adds ggml_quantize_chunk_mt, a parallel wrapper around ggml_quantize_chunk that splits rows across worker threads, which makes IQ4 KV-cache quantization use the configured llama.cpp thread count.

Previously, iq4_nl KV-cache packing was effectively single-threaded and was much slower than the other supported KV cache formats. This made IQ4 impractical for large prompt-cache / large-context workloads, even though the packed size and RMSE characteristics are useful.

Motivation

IQ4 is attractive for KV-cache storage because it gives the same packed size as q4_0 in this test while preserving better V-cache RMSE:

Type Packed MiB RMSE K RMSE V
q4_0 203.572 0.293315 0.0301958
iq4_nl 203.572 0.237420 0.0257450

However, before this change, iq4_nl packing was dramatically slower than the other formats:

Type Threads Seconds Input MiB/s
iq4_nl 1 59.5679 12.151
iq4_nl 16 6.1020 118.619

That is a roughly 9.76x speedup for iq4_nl on this test file.

Benchmark

Benchmark command:

python bench-kv-cache-quants-speed.py --rebuild --threads 1 ./kv-zstd/dicttest/kv-frankenstein100000.bin --rmse
python bench-kv-cache-quants-speed.py --rebuild --threads 16 ./kv-zstd/dicttest/kv-frankenstein100000.bin --rmse

Test input:

kv-frankenstein100000.bin
KV cache generated from the first 100000 characters of Frankenstein
32 KV segments
723.812 MiB input

Results:

Type Threads Input MiB Packed MiB Seconds Input MiB/s RMSE K RMSE V
f32 1 723.812 1447.620 1.11001 652.075 0 0
f32 16 723.812 1447.620 1.10453 655.313 0 0
f16 1 723.812 723.812 1.74912 413.815 0 0
f16 16 723.812 723.812 1.74621 414.504 0 0
bf16 1 723.812 723.812 1.04079 695.446 0.00427049 0.000530655
bf16 16 723.812 723.812 1.05612 685.349 0.00427049 0.000530655
q8_0 1 723.812 384.525 2.07237 349.268 0.0184124 0.00188816
q8_0 16 723.812 384.525 1.09429 661.445 0.0184124 0.00188816
q4_0 1 723.812 203.572 1.15734 625.410 0.293315 0.0301958
q4_0 16 723.812 203.572 1.00654 719.109 0.293315 0.0301958
q4_1 1 723.812 226.191 1.10527 654.874 0.235459 0.0261354
q4_1 16 723.812 226.191 0.998586 724.838 0.235459 0.0261354
iq4_nl 1 723.812 203.572 59.5679 12.151 0.237420 0.0257450
iq4_nl 16 723.812 203.572 6.10200 118.619 0.237420 0.0257450
q5_0 1 723.812 248.811 1.57535 459.462 0.146387 0.0150381
q5_0 16 723.812 248.811 1.18220 612.259 0.146387 0.0150381
q5_1 1 723.812 271.430 1.50332 481.476 0.113917 0.0126468
q5_1 16 723.812 271.430 1.18375 611.458 0.113917 0.0126468

Ryzen 9 5950X (16 physical cores / 32 SMT threads). Near-linear scaling through the physical core count, SMT continues to improve throughput to ~12x at 32 threads:

Threads Seconds MiB/s Speedup
1 59.42 12.18 1.0x
2 30.88 23.44 1.92x
3 21.56 33.57 2.76x
4 16.71 43.32 3.56x
5 13.70 52.84 4.34x
6 11.72 61.74 5.07x
7 10.32 70.15 5.76x
8 9.29 77.88 6.39x
9 8.44 85.71 7.03x
10 7.79 92.97 7.63x
11 7.29 99.33 8.16x
12 6.82 106.15 8.71x
13 6.48 111.70 9.17x
14 6.18 117.19 9.62x
15 5.96 121.49 9.97x
16 6.06 119.35 9.80x
17 6.39 113.32 9.30x
18 6.23 116.18 9.54x
19 6.04 119.88 9.84x
20 5.91 122.56 10.06x
21 5.74 126.03 10.35x
22 5.59 129.48 10.63x
23 5.46 132.68 10.89x
24 5.38 134.58 11.05x
25 5.27 137.41 11.28x
26 5.15 140.48 11.53x
27 5.07 142.78 11.72x
28 4.94 146.44 12.02x
29 4.87 148.71 12.21x
30 4.86 148.88 12.22x
31 5.05 143.27 11.76x
32 4.92 147.17 12.08x

Notes

The largest improvement is for iq4_nl, which was the main target of this change.

Other formats either already had relatively low compute cost or are dominated by simpler conversion paths, so their scaling is smaller. f32, f16, and bf16 are effectively unchanged, as expected.

The important practical result is that IQ4 KV-caching is no longer prohibitively slow for larger context sizes. This makes IQ4 much more usable for large prompt-cache and large-context workloads where packed KV size matters.

This PR is intentionally limited to the IQ4 multithreading backend change, making it reviewable independently from higher-level KV-cache policy changes.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, for benchmark/stability scripts, bench-kv-cache-quants-speed.py quant only portion speed test using saved frankenstein kv cache file, thread safety

iq4_nl uses a non-linear lookup table search per block that makes its
single-threaded quantization throughput ~95x slower than other 4-bit
types (12 MB/s vs 1000+ MB/s on a 16-core CPU).  Since blocks are
fully independent, the work parallelises perfectly across rows.

ggml_quantize_chunk_mt splits [start, start + nrows * n_per_row) evenly
across n_threads workers, each calling ggml_quantize_chunk on its row
range.  Because ggml_quantize_chunk writes to dst at start_row*row_size,
threads target non-overlapping regions of the output buffer with no
locking required.

Benchmark (77 MiB f16 KV state, 32 segments, Ryzen 9 7950X 16c/32t):

  type     threads   MiB/s
  iq4_nl        1     12.3
  iq4_nl       16    131.6   (~10.7x)
  q8_0          1    463
  q4_0          1   1172

All other types are already fast enough that threading overhead is noise;
the fallback to single-threaded ggml_quantize_chunk fires automatically
when n_threads <= 1 or nrows <= 1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shikaku2 shikaku2 requested a review from ggerganov as a code owner May 6, 2026 04:01
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 6, 2026
@shikaku2 shikaku2 closed this by deleting the head repository May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant