ARM64: Use fmov instead of umov for 64-bit integer ToScalar by Copilot · Pull Request #126803 · dotnet/runtime

Copilot · 2026-04-12T01:23:27Z

Understand current ToScalar codegen on ARM64
Modify hwintrinsiccodegenarm64.cpp to emit fmov instead of umov for 64-bit integer types (int64/uint64)
Build and verify the change compiles (ARM64 JIT cross-build: 0 warnings, 0 errors)
Run validation (code review + CodeQL passed)
Self code-review completed

For .AsUInt64().ToScalar() and .AsInt64().ToScalar(), emit fmov Xd, Dn instead of umov Xd, Vn.D[0]. Both are architecturally equivalent but fmov is a simpler instruction. Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/b814b59f-cdd9-4b8e-a7f1-becf2a043382 Co-authored-by: EgorBo <523221+EgorBo@users.noreply.github.com>

Copilot

Pull request overview

Updates ARM64 JIT codegen for Vector64/Vector128.ToScalar when the SIMD base type is 64-bit integer (long/ulong) to emit fmov (vector↔general) rather than umov (vector element extract), keeping existing behavior for other element types.

Changes:

Special-case 64-bit integer ToScalar to emit INS_fmov via emitIns_Mov.
Preserve existing umov-based emission for non-64-bit-integer base types.

dotnet-policy-service · 2026-04-12T01:42:53Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

EgorBo · 2026-04-12T13:44:57Z

@copilot use the behnchmark skill to write a benhcmark that benchmarks ToScalar for long and ulong - a loop loads data into two vectors at a time and calls to ToScalar on both (in parallel). Run it against -arm -aws_arm -azure_arm.

EgorBo · 2026-04-12T14:17:14Z

@copilot, benchmark results are ready:

Please analyze the results and act accordingly.
NOTE: some benchmarks may be flaky or bi-modal, so use your judgment when interpreting small differences.

Copilot · 2026-04-12T14:20:55Z

@copilot, benchmark results are ready:

✅ macos15_helix_arm64...

Benchmark results show the change is performance-neutral across all three ARM64 targets — no regressions detected:

Target	ToScalar_Long	ToScalar_ULong
Apple M4	28.07 vs 27.98 ns (1.00x)	28.06 vs 28.34 ns (1.00x)
AWS Graviton4	56.06 vs 55.87 ns (1.00x)	55.94 vs 55.95 ns (1.00x)
Azure Cobalt100	56.51 vs 57.57 ns (~2% faster)	56.97 vs 56.90 ns (1.00x)

All differences are within noise. Both fmov and umov have similar throughput on modern ARM64 cores — the benefit of this change is using the architecturally preferred instruction for 64-bit vector-to-GPR transfers.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

bot.egorbo.com
- Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

EgorBo · 2026-04-12T14:27:56Z

@EgorBot -arm -aws_arm -azure_arm --envvars DOTNET_JitDisasm:ToScalar_Long

using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    private byte[] _data = new byte[128];
    private byte _niddle = 0;

    [Benchmark]
    public long ToScalar_Long()
    {
        long sum = 0;
        var data = _data;
        Vector128<byte> niddleVec = Vector128.Create(_niddle);
        for (int i = 0; i < data.Length - 15; i += 16)
        {
            var v1 = Vector128.LoadUnsafe(ref data[i]);
            if (v1 == niddleVec)
                sum += v1.AsInt64().ToScalar();
        }
        return sum;
    }
}

EgorBo · 2026-04-12T16:24:17Z

@tannergooding opinions? I was not able to come up with a benchmark or find latencies/TP for modern cloud hw like Cobalt 100 and Graviton4. But the Apple Optimization guide explicitly recommends fmov over umos/smov when conversion is not needed, but I suspect it was also kind of fixed in their recent HW such as Apple M4 I ran on. Or my benchmark is just wrong and I didn't introduce some dependency chain it could avoid...

Given the apple manual recommends this + native compiles prefer fmov regardless of target, we probably should do the same?

tannergooding

LGTM

The Apple M4 Optimization Manual v4.0 specifically covers

Recommendation:"Use FMOV instruction when moving data from vector registers
to GPRs when no conversion is needed.

[Magnitude: Low | Applicability: Medium] The FMOV instruction does not perform
any conversion when moving data from the vector registers to the general purpose
registers, whereas UMOV and SMOV do. Use the lower bandwidth and lower latency
FMOV wherever possible.

What isn't clear is if this is meant to only apply to 8/16-bit values or if it is also meant to apply to 32/64-bit values (where zero-conversion uses the preferred disassembly of mov instead of umov). However, given that there is no callout on that and that fmov is explicitly documented to be lower cost this seems like a positive change.

We likely want to also handle 32-bit -> 32-bit in the same way (barring the case where it's immediately sign extended and so should just be a 32->64-bit smov

tannergooding · 2026-04-12T16:41:33Z

-- The fix for the blittable structure failures was merged, so you might get a clean CI run if you update the branch and rerun. Don't think it's strictly needed though

Copilot AI assigned Copilot and EgorBo Apr 12, 2026

Copilot AI requested review from Copilot and removed request for Copilot April 12, 2026 01:23

Copilot created this pull request from a session on behalf of EgorBo April 12, 2026 01:24 View session

Copilot AI requested a review from EgorBo April 12, 2026 01:24

Copilot finished work on behalf of EgorBo April 12, 2026 01:24

EgorBo approved these changes Apr 12, 2026

View reviewed changes

EgorBo marked this pull request as ready for review April 12, 2026 01:34

Copilot AI review requested due to automatic review settings April 12, 2026 01:34

Copilot started reviewing on behalf of EgorBo April 12, 2026 01:36 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 12, 2026

Copilot started work on behalf of EgorBo April 12, 2026 13:45 View session

This comment was marked as outdated.

Sign in to view

EgorBot mentioned this pull request Apr 12, 2026

Benchmarks for dotnet/runtime#126803 (for @Copilot) EgorBot/Benchmarks#117

Open

Copilot finished work on behalf of EgorBo April 12, 2026 13:49

Copilot AI requested a review from EgorBo April 12, 2026 13:49

Copilot started work on behalf of EgorBo April 12, 2026 14:17 View session

Copilot finished work on behalf of EgorBo April 12, 2026 14:21

EgorBot mentioned this pull request Apr 12, 2026

Benchmarks for dotnet/runtime#126803 (for @EgorBo) EgorBot/Benchmarks#118

Open

tannergooding approved these changes Apr 12, 2026

View reviewed changes

Merge branch 'main' into copilot/change-codegen-arm64-for-scalar

ad1e3d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM64: Use fmov instead of umov for 64-bit integer ToScalar#126803

ARM64: Use fmov instead of umov for 64-bit integer ToScalar#126803
Copilot wants to merge 2 commits intomainfrom
copilot/change-codegen-arm64-for-scalar

Copilot AI commented Apr 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

dotnet-policy-service bot commented Apr 12, 2026

Uh oh!

EgorBo commented Apr 12, 2026

Uh oh!

This comment was marked as outdated.

EgorBo commented Apr 12, 2026

Uh oh!

Copilot AI commented Apr 12, 2026 •

edited

Loading

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

EgorBo commented Apr 12, 2026

Uh oh!

EgorBo commented Apr 12, 2026 •

edited

Loading

Uh oh!

tannergooding left a comment

Uh oh!

tannergooding commented Apr 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Copilot AI commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

dotnet-policy-service bot commented Apr 12, 2026

Uh oh!

EgorBo commented Apr 12, 2026

Uh oh!

This comment was marked as outdated.

EgorBo commented Apr 12, 2026

Uh oh!

Copilot AI commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

EgorBo commented Apr 12, 2026

Uh oh!

EgorBo commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding left a comment

Choose a reason for hiding this comment

Uh oh!

tannergooding commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Apr 12, 2026 •

edited

Loading

Copilot AI commented Apr 12, 2026 •

edited

Loading

EgorBo commented Apr 12, 2026 •

edited

Loading

tannergooding commented Apr 12, 2026 •

edited

Loading