ggml-cpu: Optimized risc-v cpu q1_0 dot by pl752 · Pull Request #22768 · ggml-org/llama.cpp

pl752 · 2026-05-06T15:49:39Z

Hello, I have prepared optimized implementation of risc-v V ext q1_0 dot product (mainly for Bonsai LLM models), this is a continuation of #21636 for risc-v platform and squash of #31 with related discussion

This implementation uses two kernels with fixed vl for vlen 128 and 256+ and dispatch on runtime with vlenb, 64 is omitted due to V ext requiring 128+, vla is not useful for this simple case implementation. Uses negate qy - masked merge by qx as mask - vredsum - scalar accum with scales.

Benchmarks for Bonsai-1.7B

Flow	`pp 64` t/s	`tg 16` t/s	Speedup
Scalar	1.19	0.94	1.0x / 1.0x
`VL128*`	10.14	7.50	8.5x / 8.0x
`VL256`	13.36	9.71	11.2x / 10.3x

* forced VLEN 128 kernel with LMUL=2, for VLEN >= 256: LMUL=1

Perplexity

Bonsai 1.7B, vl256 and vl128 both, 5x512 chunks of wiki.test.raw, baseline from cpu run of unpacked fp16 model

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : RISCV_V = 1 | RVV_VLEN = 32 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 5 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 175.57 seconds per pass - ETA 3.65 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9693 ±    3.1849       0.00088 ±    0.00209       0.00016 ±    0.00001     0.301 ±  0.029 %    100.000 ±  0.000 %
   2      20.1979 ±    3.4374       0.01429 ±    0.01150       0.00020 ±    0.00001     0.341 ±  0.029 %    99.804 ±  0.196 %
   3      20.8755 ±    2.7923       0.01027 ±    0.00769       0.00022 ±    0.00001     0.347 ±  0.023 %    99.216 ±  0.319 %
   4      21.2193 ±    2.3911       0.00730 ±    0.00578       0.00022 ±    0.00001     0.349 ±  0.019 %    99.314 ±  0.259 %
   5      21.1014 ±    2.1049       0.00633 ±    0.00465       0.00022 ±    0.00001     0.347 ±  0.017 %    99.373 ±  0.221 %

====== Perplexity statistics ======
Mean PPL(Q)                   :  21.101447 ±   2.104940
Mean PPL(base)                :  20.968387 ±   2.074795
Cor(ln(PPL(Q)), ln(PPL(base))):  99.89%
Mean ln(PPL(Q)/PPL(base))     :   0.006326 ±   0.004654
Mean PPL(Q)/PPL(base)         :   1.006346 ±   0.004683
Mean PPL(Q)-PPL(base)         :   0.133060 ±   0.101021

====== KL divergence statistics ======
Mean    KLD:   0.000221 ±   0.000009
Maximum KLD:   0.004212
99.9%   KLD:   0.004057
99.0%   KLD:   0.001409
95.0%   KLD:   0.000691
90.0%   KLD:   0.000495
Median  KLD:   0.000137
10.0%   KLD:   0.000002
 5.0%   KLD:   0.000000
 1.0%   KLD:  -0.000010
 0.1%   KLD:  -0.000033
Minimum KLD:  -0.000069

====== Token probability statistics ======
Mean    Δp: -0.007 ± 0.010 %
Maximum Δp:  2.484%
99.9%   Δp:  2.201%
99.0%   Δp:  1.143%
95.0%   Δp:  0.463%
90.0%   Δp:  0.256%
75.0%   Δp:  0.043%
Median  Δp: -0.000%
25.0%   Δp: -0.056%
10.0%   Δp: -0.333%
 5.0%   Δp: -0.533%
 1.0%   Δp: -1.088%
 0.1%   Δp: -1.614%
Minimum Δp: -1.647%
RMS Δp    :  0.347 ± 0.017 %
Same top p: 99.373 ± 0.221 %

Benchmarks were performed with:

OrangePI RV2 sbc (Ky X1 / spacemit k1) 8gb
Armbian Debian trixie rolling release at 6.18.26-current-spacemit kernel
Built with official Spacemit toolchain, but IME wasn't used.
Command: llama-bench -m Bonsai-1.7B.gguf -p 64 -n 16 -t 8 -r 3 -fa 1 -mmp 0

Other people related

@khosravipasha: Q/A and author of Bonsai model
@velonica0: Opened another PR (ggml-cpu: add RVV implementation for q1_0 x q8_0 vec dot #22500) independently before me, uses fixed vl too and pretty similar approach, but for compatibility uses lmul==4, which covers every vlen, but has suboptimal performance. To be honest I discovered vlm instruction from this PR, which I didn't noticed in documentation at first.

Requesting review from people who usually review such kind of changes: @am17an, @CISC, @taimur-10x.

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, porting avx2 code from my previous PR, testing automation, code then was manually reviewed and adjusted for better performance, was not used for writing this text

CISC

So, is this the favoured implementation?

xctan · 2026-05-07T12:52:38Z

K1 benchmarks show a speed regression with LMUL=2 vs LMUL=1. Compared to #22500, I find this implementation to be more refined, so I'm leaning towards this one.

pl752 · 2026-05-07T13:17:16Z

@xctan That's odd, that there is regression, do you mean that vl128 branch is slower than code from #22500 ? If so then the only major difference is how v_q8_neg/neg_qy is calculated, my uses vneg(qy), other uses vsub(0, qy), I should have tried that too.

xctan · 2026-05-07T13:52:36Z

@pl752

that there is regression

Updated my comment. Sorry for the confusion.

Implemented RVV Q1_0 1x1 dot fixed vl

aaf9d0e

pl752 requested a review from ggerganov as a code owner May 6, 2026 15:49

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 6, 2026

CISC requested review from a team May 6, 2026 16:17

xctan approved these changes May 7, 2026

View reviewed changes

taimur-10x approved these changes May 7, 2026

View reviewed changes

CISC approved these changes May 7, 2026

View reviewed changes

xctan merged commit 68380ae into ggml-org:master May 7, 2026
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: Optimized risc-v cpu q1_0 dot#22768

ggml-cpu: Optimized risc-v cpu q1_0 dot#22768
xctan merged 1 commit intoggml-org:masterfrom
pl752:perf/q1_0_rvv_dot_fixvl

pl752 commented May 6, 2026

Uh oh!

CISC left a comment

Uh oh!

xctan commented May 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

pl752 commented May 7, 2026

Uh oh!

xctan commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pl752 commented May 6, 2026

Benchmarks were performed with:

Other people related

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

xctan commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pl752 commented May 7, 2026

Uh oh!

xctan commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xctan commented May 7, 2026 •

edited

Loading