Skip to content

ggml-cpu: Optimized risc-v cpu q1_0 dot#22768

Merged
xctan merged 1 commit intoggml-org:masterfrom
pl752:perf/q1_0_rvv_dot_fixvl
May 7, 2026
Merged

ggml-cpu: Optimized risc-v cpu q1_0 dot#22768
xctan merged 1 commit intoggml-org:masterfrom
pl752:perf/q1_0_rvv_dot_fixvl

Conversation

@pl752
Copy link
Copy Markdown
Contributor

@pl752 pl752 commented May 6, 2026

Hello, I have prepared optimized implementation of risc-v V ext q1_0 dot product (mainly for Bonsai LLM models), this is a continuation of #21636 for risc-v platform and squash of #31 with related discussion

This implementation uses two kernels with fixed vl for vlen 128 and 256+ and dispatch on runtime with vlenb, 64 is omitted due to V ext requiring 128+, vla is not useful for this simple case implementation. Uses negate qy - masked merge by qx as mask - vredsum - scalar accum with scales.

Benchmarks for Bonsai-1.7B
Flow pp 64 t/s tg 16 t/s Speedup
Scalar 1.19 0.94 1.0x / 1.0x
VL128* 10.14 7.50 8.5x / 8.0x
VL256 13.36 9.71 11.2x / 10.3x
  • * forced VLEN 128 kernel with LMUL=2, for VLEN >= 256: LMUL=1
Perplexity Bonsai 1.7B, vl256 and vl128 both, 5x512 chunks of wiki.test.raw, baseline from cpu run of unpacked fp16 model
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : RISCV_V = 1 | RVV_VLEN = 32 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 5 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 175.57 seconds per pass - ETA 3.65 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9693 ±    3.1849       0.00088 ±    0.00209       0.00016 ±    0.00001     0.301 ±  0.029 %    100.000 ±  0.000 %
   2      20.1979 ±    3.4374       0.01429 ±    0.01150       0.00020 ±    0.00001     0.341 ±  0.029 %    99.804 ±  0.196 %
   3      20.8755 ±    2.7923       0.01027 ±    0.00769       0.00022 ±    0.00001     0.347 ±  0.023 %    99.216 ±  0.319 %
   4      21.2193 ±    2.3911       0.00730 ±    0.00578       0.00022 ±    0.00001     0.349 ±  0.019 %    99.314 ±  0.259 %
   5      21.1014 ±    2.1049       0.00633 ±    0.00465       0.00022 ±    0.00001     0.347 ±  0.017 %    99.373 ±  0.221 %

====== Perplexity statistics ======
Mean PPL(Q)                   :  21.101447 ±   2.104940
Mean PPL(base)                :  20.968387 ±   2.074795
Cor(ln(PPL(Q)), ln(PPL(base))):  99.89%
Mean ln(PPL(Q)/PPL(base))     :   0.006326 ±   0.004654
Mean PPL(Q)/PPL(base)         :   1.006346 ±   0.004683
Mean PPL(Q)-PPL(base)         :   0.133060 ±   0.101021

====== KL divergence statistics ======
Mean    KLD:   0.000221 ±   0.000009
Maximum KLD:   0.004212
99.9%   KLD:   0.004057
99.0%   KLD:   0.001409
95.0%   KLD:   0.000691
90.0%   KLD:   0.000495
Median  KLD:   0.000137
10.0%   KLD:   0.000002
 5.0%   KLD:   0.000000
 1.0%   KLD:  -0.000010
 0.1%   KLD:  -0.000033
Minimum KLD:  -0.000069

====== Token probability statistics ======
Mean    Δp: -0.007 ± 0.010 %
Maximum Δp:  2.484%
99.9%   Δp:  2.201%
99.0%   Δp:  1.143%
95.0%   Δp:  0.463%
90.0%   Δp:  0.256%
75.0%   Δp:  0.043%
Median  Δp: -0.000%
25.0%   Δp: -0.056%
10.0%   Δp: -0.333%
 5.0%   Δp: -0.533%
 1.0%   Δp: -1.088%
 0.1%   Δp: -1.614%
Minimum Δp: -1.647%
RMS Δp    :  0.347 ± 0.017 %
Same top p: 99.373 ± 0.221 %

Benchmarks were performed with:

  • OrangePI RV2 sbc (Ky X1 / spacemit k1) 8gb
  • Armbian Debian trixie rolling release at 6.18.26-current-spacemit kernel
  • Built with official Spacemit toolchain, but IME wasn't used.
  • Command: llama-bench -m Bonsai-1.7B.gguf -p 64 -n 16 -t 8 -r 3 -fa 1 -mmp 0

Other people related

Requesting review from people who usually review such kind of changes: @am17an, @CISC, @taimur-10x.


  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, porting avx2 code from my previous PR, testing automation, code then was manually reviewed and adjusted for better performance, was not used for writing this text

@pl752 pl752 requested a review from ggerganov as a code owner May 6, 2026 15:49
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label May 6, 2026
@CISC CISC requested review from a team May 6, 2026 16:17
Copy link
Copy Markdown
Member

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is this the favoured implementation?

@xctan
Copy link
Copy Markdown
Collaborator

xctan commented May 7, 2026

K1 benchmarks show a speed regression with LMUL=2 vs LMUL=1. Compared to #22500, I find this implementation to be more refined, so I'm leaning towards this one.

@xctan xctan merged commit 68380ae into ggml-org:master May 7, 2026
49 checks passed
@pl752
Copy link
Copy Markdown
Contributor Author

pl752 commented May 7, 2026

@xctan That's odd, that there is regression, do you mean that vl128 branch is slower than code from #22500 ? If so then the only major difference is how v_q8_neg/neg_qy is calculated, my uses vneg(qy), other uses vsub(0, qy), I should have tried that too.

@xctan
Copy link
Copy Markdown
Collaborator

xctan commented May 7, 2026

@pl752

that there is regression

Updated my comment. Sorry for the confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants