ARM64: Use fmov instead of umov for 64-bit integer ToScalar#126803
ARM64: Use fmov instead of umov for 64-bit integer ToScalar#126803
Conversation
For .AsUInt64().ToScalar() and .AsInt64().ToScalar(), emit fmov Xd, Dn instead of umov Xd, Vn.D[0]. Both are architecturally equivalent but fmov is a simpler instruction. Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/b814b59f-cdd9-4b8e-a7f1-becf2a043382 Co-authored-by: EgorBo <523221+EgorBo@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Updates ARM64 JIT codegen for Vector64/Vector128.ToScalar when the SIMD base type is 64-bit integer (long/ulong) to emit fmov (vector↔general) rather than umov (vector element extract), keeping existing behavior for other element types.
Changes:
- Special-case 64-bit integer
ToScalarto emitINS_fmovviaemitIns_Mov. - Preserve existing
umov-based emission for non-64-bit-integer base types.
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
|
@copilot use the behnchmark skill to write a benhcmark that benchmarks ToScalar for long and ulong - a loop loads data into two vectors at a time and calls to ToScalar on both (in parallel). Run it against -arm -aws_arm -azure_arm. |
This comment was marked as outdated.
This comment was marked as outdated.
|
@copilot, benchmark results are ready: Please analyze the results and act accordingly. |
Benchmark results show the change is performance-neutral across all three ARM64 targets — no regressions detected:
All differences are within noise. Both Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
@EgorBot -arm -aws_arm -azure_arm --envvars DOTNET_JitDisasm:ToScalar_Long using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);
public class Bench
{
private byte[] _data = new byte[128];
private byte _niddle = 0;
[Benchmark]
public long ToScalar_Long()
{
long sum = 0;
var data = _data;
Vector128<byte> niddleVec = Vector128.Create(_niddle);
for (int i = 0; i < data.Length - 15; i += 16)
{
var v1 = Vector128.LoadUnsafe(ref data[i]);
if (v1 == niddleVec)
sum += v1.AsInt64().ToScalar();
}
return sum;
}
} |
|
@tannergooding opinions? I was not able to come up with a benchmark or find latencies/TP for modern cloud hw like Cobalt 100 and Graviton4. But the Apple Optimization guide explicitly recommends fmov over umos/smov when conversion is not needed, but I suspect it was also kind of fixed in their recent HW such as Apple M4 I ran on. Or my benchmark is just wrong and I didn't introduce some dependency chain it could avoid... Given the apple manual recommends this + native compiles prefer fmov regardless of target, we probably should do the same? |
tannergooding
left a comment
There was a problem hiding this comment.
LGTM
The Apple M4 Optimization Manual v4.0 specifically covers
Recommendation:"Use FMOV instruction when moving data from vector registers
to GPRs when no conversion is needed.[Magnitude: Low | Applicability: Medium] The FMOV instruction does not perform
any conversion when moving data from the vector registers to the general purpose
registers, whereas UMOV and SMOV do. Use the lower bandwidth and lower latency
FMOV wherever possible.
What isn't clear is if this is meant to only apply to 8/16-bit values or if it is also meant to apply to 32/64-bit values (where zero-conversion uses the preferred disassembly of mov instead of umov). However, given that there is no callout on that and that fmov is explicitly documented to be lower cost this seems like a positive change.
We likely want to also handle 32-bit -> 32-bit in the same way (barring the case where it's immediately sign extended and so should just be a 32->64-bit smov
|
-- The fix for the blittable structure failures was merged, so you might get a clean CI run if you update the branch and rerun. Don't think it's strictly needed though |
hwintrinsiccodegenarm64.cppto emitfmovinstead ofumovfor 64-bit integer types (int64/uint64)