Skip to content

ARM64: Use fmov instead of umov for 64-bit integer ToScalar#126803

Open
Copilot wants to merge 2 commits intomainfrom
copilot/change-codegen-arm64-for-scalar
Open

ARM64: Use fmov instead of umov for 64-bit integer ToScalar#126803
Copilot wants to merge 2 commits intomainfrom
copilot/change-codegen-arm64-for-scalar

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 12, 2026

  • Understand current ToScalar codegen on ARM64
  • Modify hwintrinsiccodegenarm64.cpp to emit fmov instead of umov for 64-bit integer types (int64/uint64)
  • Build and verify the change compiles (ARM64 JIT cross-build: 0 warnings, 0 errors)
  • Run validation (code review + CodeQL passed)
  • Self code-review completed

For .AsUInt64().ToScalar() and .AsInt64().ToScalar(), emit fmov Xd, Dn
instead of umov Xd, Vn.D[0]. Both are architecturally equivalent but
fmov is a simpler instruction.

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/b814b59f-cdd9-4b8e-a7f1-becf2a043382

Co-authored-by: EgorBo <523221+EgorBo@users.noreply.github.com>
@EgorBo EgorBo marked this pull request as ready for review April 12, 2026 01:34
Copilot AI review requested due to automatic review settings April 12, 2026 01:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates ARM64 JIT codegen for Vector64/Vector128.ToScalar when the SIMD base type is 64-bit integer (long/ulong) to emit fmov (vector↔general) rather than umov (vector element extract), keeping existing behavior for other element types.

Changes:

  • Special-case 64-bit integer ToScalar to emit INS_fmov via emitIns_Mov.
  • Preserve existing umov-based emission for non-64-bit-integer base types.

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 12, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Copy Markdown
Member

EgorBo commented Apr 12, 2026

@copilot use the behnchmark skill to write a benhcmark that benchmarks ToScalar for long and ulong - a loop loads data into two vectors at a time and calls to ToScalar on both (in parallel). Run it against -arm -aws_arm -azure_arm.

This comment was marked as outdated.

@EgorBo
Copy link
Copy Markdown
Member

EgorBo commented Apr 12, 2026

@copilot, benchmark results are ready:

Please analyze the results and act accordingly.
NOTE: some benchmarks may be flaky or bi-modal, so use your judgment when interpreting small differences.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 12, 2026

@copilot, benchmark results are ready:

Benchmark results show the change is performance-neutral across all three ARM64 targets — no regressions detected:

Target ToScalar_Long ToScalar_ULong
Apple M4 28.07 vs 27.98 ns (1.00x) 28.06 vs 28.34 ns (1.00x)
AWS Graviton4 56.06 vs 55.87 ns (1.00x) 55.94 vs 55.95 ns (1.00x)
Azure Cobalt100 56.51 vs 57.57 ns (~2% faster) 56.97 vs 56.90 ns (1.00x)

All differences are within noise. Both fmov and umov have similar throughput on modern ARM64 cores — the benefit of this change is using the architecturally preferred instruction for 64-bit vector-to-GPR transfers.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • bot.egorbo.com
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@EgorBo
Copy link
Copy Markdown
Member

EgorBo commented Apr 12, 2026

@EgorBot -arm -aws_arm -azure_arm --envvars DOTNET_JitDisasm:ToScalar_Long

using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    private byte[] _data = new byte[128];
    private byte _niddle = 0;

    [Benchmark]
    public long ToScalar_Long()
    {
        long sum = 0;
        var data = _data;
        Vector128<byte> niddleVec = Vector128.Create(_niddle);
        for (int i = 0; i < data.Length - 15; i += 16)
        {
            var v1 = Vector128.LoadUnsafe(ref data[i]);
            if (v1 == niddleVec)
                sum += v1.AsInt64().ToScalar();
        }
        return sum;
    }
}

@EgorBo
Copy link
Copy Markdown
Member

EgorBo commented Apr 12, 2026

@tannergooding opinions? I was not able to come up with a benchmark or find latencies/TP for modern cloud hw like Cobalt 100 and Graviton4. But the Apple Optimization guide explicitly recommends fmov over umos/smov when conversion is not needed, but I suspect it was also kind of fixed in their recent HW such as Apple M4 I ran on. Or my benchmark is just wrong and I didn't introduce some dependency chain it could avoid...

Given the apple manual recommends this + native compiles prefer fmov regardless of target, we probably should do the same?

Copy link
Copy Markdown
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The Apple M4 Optimization Manual v4.0 specifically covers

Recommendation:"Use FMOV instruction when moving data from vector registers
to GPRs when no conversion is needed.

[Magnitude: Low | Applicability: Medium] The FMOV instruction does not perform
any conversion when moving data from the vector registers to the general purpose
registers, whereas UMOV and SMOV do. Use the lower bandwidth and lower latency
FMOV wherever possible.

What isn't clear is if this is meant to only apply to 8/16-bit values or if it is also meant to apply to 32/64-bit values (where zero-conversion uses the preferred disassembly of mov instead of umov). However, given that there is no callout on that and that fmov is explicitly documented to be lower cost this seems like a positive change.

We likely want to also handle 32-bit -> 32-bit in the same way (barring the case where it's immediately sign extended and so should just be a 32->64-bit smov

@tannergooding
Copy link
Copy Markdown
Member

tannergooding commented Apr 12, 2026

-- The fix for the blittable structure failures was merged, so you might get a clean CI run if you update the branch and rerun. Don't think it's strictly needed though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants