Refactor OSR method prologs to duplicate tier0 prolog at the beginning by jakobbotsch · Pull Request #126791 · dotnet/runtime

jakobbotsch · 2026-04-11T16:04:35Z

This changes how OSR methods have their prologs generated. OSR methods now begin with a prolog that emulates the stack frame changes made by the tier0 prolog, followed by the existing OSR prolog that assumes the tier0 stack frame changes have already executed. The idea is to handle two separate problems:

On win-arm64 there is a requirement that unwind codes map 1:1 to prolog instructions. OSR functions did not satisfy this requirement before. Hence unwinding out of partially executed OSR prologs (and potentially also some instructions after the prolog) was broken. This probably did not affect anything but diagnostics. It was likely a problem on all platforms, not just win-arm64.
Resuming runtime async methods was expensive as it required going through a path resumption stub -> tier0 -> OSR method. The last step required a very expensive unwinder-based transition. This is Optimize runtime async OSR resumption performance #120865 and was causing some severe ASP.NET micro benchmark regressions when runtime async is enabled.

With this change:

To transition from tier0 method to OSR method we transition to the OSR method starting at the offset that corresponds to the OSR method prolog. This offset is communicated via a new PatchpointInfo instance that the OSR method now records. This should solve unwinding a partially executed prolog since the unwinder now believes that the initial part of the prolog that corresponds to the tier0 prolog has executed, and hence will partially unwind that as normal, even if we've just transitioned.
To resume runtime async methods we can now skip the tier0 method entirely and instead execute the OSR method directly, starting from the beginning.

Currently implemented only for x64. For arm64 some changes are needed to first restore callee saves from the tier0 part of the frame.

Example:

public class Program
{
    static int s_sum;
    public static void Main()
    {
        string foo = "abc";
        string bar = "def";
        string baz = "fawd";
        string beef = "a123";
        string bazbeef = "a11";
        for (int i = 0; i < 20000; i++)
        {
            if (i >= 10000 && i <= 10100)
                Thread.Sleep(10);

            s_sum += i * i;
        }

        Console.WriteLine(s_sum);
        Console.WriteLine(foo);
        Console.WriteLine(bar);
        Console.WriteLine(baz);
        Console.WriteLine(beef);
        Console.WriteLine(bazbeef);
    }
}

First the tier0 prolog:

; Assembly listing for method Program:Main() (Instrumented Tier0)
; Emitting BLENDED_CODE for x64 + VEX on Windows
; Instrumented Tier0 code
; rbp based frame
; fully interruptible
; compiling with minopt

G_M000_IG01:                ;; offset=0x0000
       push     rbp
       sub      rsp, 144
       lea      rbp, [rsp+0x90]
       xor      eax, eax
       mov      dword ptr [rbp-0x64], eax
       vxorps   xmm4, xmm4, xmm4
       vmovdqu  ymmword ptr [rbp-0x60], ymm4
       mov      qword ptr [rbp-0x40], rax

OSR method then starts with a prolog that is mostly similar (but not always 100%):

G_M000_IG01:                ;; offset=0x0000
       push     rbp                                  ; Tier0 save rbp
       sub      rsp, 144                             ; Tier0 allocate locals
       lea      rbp, [rsp+0x90]                      ; Tier0 setup rbp
       vxorps   xmm4, xmm4, xmm4                     ; Tier0
       vmovdqu  ymmword ptr [rbp-0x60], ymm4         ; Tier0
       xor      eax, eax                             ; Tier0
       mov      qword ptr [rbp-0x40], rax            ; Tier0 null out GC refs on tier0 frame that we are reporting in the OSR method
                                                     ; OSR transition point
       push     rax                                  ; OSR prolog: set up expected misalignment (TODO: remove)
       sub      rsp, 88                              ; OSR allocate locals
       mov      qword ptr [rsp+0xE8], r15            ; OSR save callee regs
       mov      qword ptr [rsp+0xE0], r14            ; OSR
       mov      qword ptr [rsp+0xD8], r13            ; OSR
       mov      qword ptr [rsp+0xD0], rdi            ; OSR
       mov      qword ptr [rsp+0xC8], rsi            ; OSR
       mov      qword ptr [rsp+0xC0], rbx            ; OSR
       mov      rsi, gword ptr [rsp+0xB0]            ; OSR
       mov      rdi, gword ptr [rsp+0xA8]            ; OSR
       mov      rbp, gword ptr [rsp+0xA0]            ; OSR
       mov      r14, gword ptr [rsp+0x98]            ; OSR 
       mov      r15, gword ptr [rsp+0x90]            ; OSR
       mov      ebx, dword ptr [rsp+0x8C]            ; OSR enregister tier0 locals

Another example:

static int s_value;
static async Task Foo(int n, NullAwaiter na)
{
    for (int i = 0; i < n; i++)
    {
        s_value += i;
    }

    Stopwatch timer = Stopwatch.StartNew();
    for (int i = 0; i < 10_000_000; i++)
    {
        await na;
    }
    Console.WriteLine("Took {0:F1} ms", timer.Elapsed.TotalMilliseconds);
}

Tier 0:

; Assembly listing for method OSRPerf.Program:Foo(int,OSRPerf.Program+NullAwaiter) (Instrumented Tier0)
; Emitting BLENDED_CODE for x64 + VEX on Windows
; Instrumented Tier0 code
; async
; rbp based frame
; fully interruptible
; compiling with minopt

G_M000_IG01:                ;; offset=0x0000
       push     rbp
       sub      rsp, 176
       lea      rbp, [rsp+0xB0]
       vxorps   xmm4, xmm4, xmm4
       vmovdqu  ymmword ptr [rbp-0x90], ymm4
       vmovdqu  ymmword ptr [rbp-0x70], ymm4
       vmovdqa  xmmword ptr [rbp-0x50], xmm4
       xor      eax, eax
       mov      qword ptr [rbp-0x40], rax
       mov      gword ptr [rbp+0x10], rcx
       mov      dword ptr [rbp+0x18], edx
       mov      gword ptr [rbp+0x20], r8

; Assembly listing for method OSRPerf.Program:Foo(int,OSRPerf.Program+NullAwaiter) (Tier1-OSR)
; Emitting BLENDED_CODE for x64 + VEX on Windows
; Tier1-OSR code
; OSR variant for entry point 0x14
; async
; optimized code
; optimized using Synthesized PGO
; rbp based frame
; fully interruptible
; with Synthesized PGO: fgCalledCount is 1
; 4 inlinees with PGO data; 14 single block inlinees; 1 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
                                                ; Runtime async transition point
       push     rbp                             ; Tier0 Save rbp
       sub      rsp, 176                        ; Tier0 Alloc locals
       lea      rbp, [rsp+0xB0]                 ; Tier0 Establish frame pointer
       xor      eax, eax                        ; Tier0 
       mov      qword ptr [rbp-0x50], rax       ; Tier0 
       mov      qword ptr [rbp-0x60], rax       ; Tier0 
       mov      qword ptr [rbp-0x40], rax       ; Tier0 
       mov      qword ptr [rbp-0x48], rax       ; Tier0 null out GC refs on tier0 frame that we are reporting in the OSR method
       mov      qword ptr [rbp+0x10], rcx       ; Tier0 save async continuation parameter
                                                ; OSR transition point
       push     rax                             ; OSR set up expected misaslignment (TODO: remove)
       mov      rax, qword ptr [rbp]            ; OSR 
       push     rax                             ; OSR Maintain chained frame pointers
       sub      rsp, 112                        ; OSR Alloc locals
       mov      qword ptr [rsp+0x128], r14      ; OSR 
       mov      qword ptr [rsp+0x120], rdi      ; OSR 
       mov      qword ptr [rsp+0x118], rsi      ; OSR 
       mov      qword ptr [rsp+0x110], rbx      ; OSR Save callee saves
       vzeroupper                               ; OSR 
       vmovaps  xmmword ptr [rsp+0x40], xmm6    ; OSR Save float callee regs
       lea      rbp, [rsp+0x70]                 ; OSR Establish frame pointer
       xor      eax, eax                        ; OSR
       mov      qword ptr [rbp-0x38], rax       ; OSR
       mov      qword ptr [rbp-0x40], rax       ; OSR Zero out locals
       mov      rax, gword ptr [rbp+0xD0]       ; OSR
       mov      ecx, dword ptr [rbp+0xD8]       ; OSR
       mov      edx, dword ptr [rbp+0x6C]       ; OSR Enregister tier0 locals

Perf of this last benchmark:
Before: Took 6156.5 ms
After: Took 484.6 ms

FYI @AndyAyersMS

dotnet-policy-service · 2026-04-11T16:05:57Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

This PR refactors OSR method prolog generation so OSR methods begin with a duplicated Tier0 prolog (currently implemented for x64), enabling correct/robust unwinding from OSR prologs and allowing runtime async resumption to enter OSR code directly without routing through Tier0.

Changes:

Emit a duplicated Tier0 prolog at the start of x64 OSR methods and record an OSR “transition prolog offset” in per-OSR-method PatchpointInfo.
Update VM patchpoint transition logic to jump into the OSR method at the recorded transition offset (skipping the duplicated Tier0 prolog).
Refactor frame zero-init helpers to accept an explicit base register, and adjust async transformation/resumption logic to account for the new x64 OSR entry behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/coreclr/vm/jitinterface.cpp	Adjusts async resumption stub behavior around OSR tiers (x64 vs non-x64).
src/coreclr/vm/jithelpers.cpp	Computes OSR transition address using per-method `PatchpointInfo` and updates patchpoint transition context handling.
src/coreclr/jit/compiler.h	Adds/normalizes default initialization for several `Compiler::info` fields.
src/coreclr/jit/compiler.cpp	Removes now-redundant explicit resets/initializations relying on new defaults.
src/coreclr/jit/codegenxarch.cpp	Implements x64 OSR duplicated Tier0 prolog emission and records transition offset in `PatchpointInfo`.
src/coreclr/jit/codegencommon.cpp	Wires x64 OSR prolog duplication into common prolog generation and updates zero-init call signature.
src/coreclr/jit/codegen.h	Declares `genDuplicateTier0Prolog` and updates `genZeroInitFrameUsingBlockInit` signature.
src/coreclr/jit/codegenarm64.cpp	Updates `genZeroInitFrameUsingBlockInit` to accept a base register.
src/coreclr/jit/codegenarm.cpp	Updates `genZeroInitFrameUsingBlockInit` to accept a base register.
src/coreclr/jit/codegenloongarch64.cpp	Updates `genZeroInitFrameUsingBlockInit` to accept a base register.
src/coreclr/jit/codegenriscv64.cpp	Updates `genZeroInitFrameUsingBlockInit` to accept a base register.
src/coreclr/jit/async.cpp	Adjusts async continuation reuse and OSR IL-offset handling (skipped on x64 due to direct OSR entry).
src/coreclr/inc/patchpointinfo.h	Adds `m_transitionPrologOffset` to `PatchpointInfo` and accessors.

Copilot · 2026-04-11T16:12:52Z

src/coreclr/vm/jitinterface.cpp

    PrepareCodeConfig* config = GetThread()->GetCurrentPrepareCodeConfig();
    NativeCodeVersion ncv = config->GetCodeVersion();
+#ifndef TARGET_AMD64


On TARGET_AMD64 the #ifndef TARGET_AMD64 excludes the if (ncv.GetOptimizationTier() == ...Tier1OSR) block, but config/ncv are still declared and become unused. CoreCLR builds commonly treat unused locals as warnings-as-errors, so this will likely break amd64 builds. Consider moving the PrepareCodeConfig* config and NativeCodeVersion ncv declarations inside the #ifndef TARGET_AMD64 block (or explicitly marking them unused on amd64).

Suggested change

PrepareCodeConfig* config = GetThread()->GetCurrentPrepareCodeConfig();

NativeCodeVersion ncv = config->GetCodeVersion();

#ifndef TARGET_AMD64

#ifndef TARGET_AMD64

PrepareCodeConfig* config = GetThread()->GetCurrentPrepareCodeConfig();

NativeCodeVersion ncv = config->GetCodeVersion();

Copilot · 2026-04-11T16:12:52Z

src/coreclr/jit/codegenxarch.cpp

+    PatchpointInfo* ppi              = m_compiler->info.compPatchpointInfo;
+    regMaskTP       tier0CalleeSaves = (regMaskTP)ppi->CalleeSaveRegisters();
+    tier0CalleeSaves &= ~RBM_FPBASE;
+    // On x64 only integer registers are part of this set.
+    // TODO: Isn't this a bug? How does the OSR method restore float registers saved by tier0?
+    assert((tier0CalleeSaves & RBM_ALLINT) == tier0CalleeSaves);


PatchpointInfo::CalleeSaveRegisters() can include XMM6–XMM15 on Windows amd64 (see RBM_FLT_CALLEE_SAVED in targetamd64.h), since it is set from rsGetModifiedCalleeSavedRegsMask(). The new assert that the tier0 callee-saves mask is RBM_ALLINT can therefore fire, and even if it doesn’t, the duplicated prolog currently only emits integer pushes, so it can’t faithfully reproduce tier0 prolog behavior when tier0 saved float callee-saves. Either restrict what gets recorded in PatchpointInfo to integer callee-saves (and assert that invariant when recording), or extend the duplication logic to also account for float callee-saves.

Copilot · 2026-04-11T16:12:52Z

src/coreclr/jit/codegencommon.cpp

    }
-#endif // defined(TARGET_ARM64) || defined(TARGET_LOONGARCH64) || defined(TARGET_RISCV64)
+#elif defined(TARGET_AMD64)
+    if (m_compiler->opts.IsOSR())
+    {
+        genDuplicateTier0Prolog();
+    }


The comment just above the OSR prolog handling still says “x64 handles this differently; … emitted in genOSRRecordTier0CalleeSavedRegistersAndFrame”, but that helper was removed and the new behavior is genDuplicateTier0Prolog(). Please update the comment so it matches the new implementation (otherwise it’s misleading for future maintenance).

am11 · 2026-04-11T16:50:00Z

@jakobbotsch, just curious, are you planning to eliminate the usage of VirtualUnwindToFirstManagedCallFrame/RtlVirtualUnwind usage from this path as well? (it was my main motivation of touching this code 🙂)

jakobbotsch · 2026-04-11T16:57:41Z

@jakobbotsch, just curious, are you planning to eliminate the usage of VirtualUnwindToFirstManagedCallFrame/RtlVirtualUnwind usage from this path as well? (it was my main motivation of touching this code 🙂)

My current plan is to fix the runtime async perf issue first and foremost. Once done it should not be that much work to do that refactoring (and most of the work, with GT_PATCHPOINT, should already be in your PR). If you'd like to take it from there at that point I definitely would not be opposed, otherwise I will probably submit it as a follow-up when I get to it.

jakobbotsch · 2026-04-11T17:06:55Z

Once done it should not be that much work

I suppose will be a lot of work in updating loongarch64/RISCV64 to the same scheme as x64/arm64. I might take a look and the answer probably depends on how much work that is.

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Copilot · 2026-04-11T17:08:53Z

src/coreclr/inc/patchpointinfo.h

    void Initialize(uint32_t localCount, int32_t totalFrameSize)
    {
        m_calleeSaveRegisters     = 0;
        m_tier0Version            = 0;
        m_totalFrameSize          = totalFrameSize;
        m_numberOfLocals          = localCount;
+        m_transitionPrologOffset  = -1;


PatchpointInfo::Initialize sets several members to sentinel values, but it does not initialize m_asyncExecutionContextOffset / m_asyncSynchronizationContextOffset (the corresponding HasAsync*Offset() checks assume -1 means "absent"). If these fields are not explicitly set later, they can contain garbage and incorrectly appear present. Initialize them to -1 here (along with the other offsets).

Copilot · 2026-04-11T17:08:53Z

src/coreclr/inc/patchpointinfo.h

    void Copy(const PatchpointInfo* original)
    {
        m_calleeSaveRegisters = original->m_calleeSaveRegisters;
        m_tier0Version = original->m_tier0Version;
+        m_transitionPrologOffset = original->m_transitionPrologOffset;
        m_genericContextArgOffset = original->m_genericContextArgOffset;


PatchpointInfo::Copy only copies a subset of fields (it currently omits at least m_totalFrameSize, m_numberOfLocals, and the async execution/synchronization context offsets). This can produce an internally inconsistent copy (e.g., NumberOfLocals()/TotalFrameSize() not matching the copied offset data). Copy all scalar members, including the async offsets, before copying the variable-length m_offsetAndExposureData.

Copilot · 2026-04-11T17:08:53Z

src/coreclr/jit/codegenxarch.cpp

-    int const tier0FrameSize = patchpointInfo->TotalFrameSize() + REGSIZE_BYTES;
-    int const tier0NetSize   = tier0FrameSize - tier0IntCalleeSaveUsedSize;
-    m_compiler->unwindAllocStack(tier0NetSize);
+    unsigned totalFrameSize = (unsigned)m_compiler->info.compPatchpointInfo->TotalFrameSize();


genDuplicateTier0Prolog derives localFrameSize directly from compPatchpointInfo->TotalFrameSize(). On amd64, Compiler::generatePatchpointInfo currently records TotalFrameSize as genTotalFrameSize() + TARGET_POINTER_SIZE to account for the "pseudo return address" slot used by the OSR transition. With this PR the transition no longer adjusts SP in the VM helper, and the misalignment slot is now created via a push in the OSR prolog instead. Unless TotalFrameSize semantics were updated elsewhere, this will cause the duplicated tier0 prolog to allocate an extra pointer-sized slot and miscompute the tier0 frame layout. Consider subtracting the pseudo-call slot here (or updating TotalFrameSize generation) so the duplicated prolog matches the actual tier0 frame size at the transition point.

Suggested change

unsigned totalFrameSize = (unsigned)m_compiler->info.compPatchpointInfo->TotalFrameSize();

unsigned totalFrameSize = (unsigned)m_compiler->info.compPatchpointInfo->TotalFrameSize();

#ifdef TARGET_AMD64

// PatchpointInfo records an extra pointer-sized pseudo return-address slot for the OSR transition.

// The duplicated tier0 prolog must allocate only the actual tier0 frame size here.

assert(totalFrameSize >= REGSIZE_BYTES);

totalFrameSize -= REGSIZE_BYTES;

#endif

Copilot · 2026-04-11T17:08:54Z

src/coreclr/vm/jithelpers.cpp

+    if (patchpointInfo->TransitionPrologOffset() != -1)
+    {
+        offset = patchpointInfo->TransitionPrologOffset();


GetOSRTransitionAddress treats a missing TransitionPrologOffset() (default -1) as offset 0, which would transition into the OSR method entrypoint. On amd64 OSR methods now begin with a duplicated tier0 prolog; entering at offset 0 from a tier0->OSR transition would re-run those stack adjustments on top of an existing tier0 frame and can corrupt the stack/unwind state. It seems safer to treat TransitionPrologOffset() == -1 as an error (e.g., return NULL and invalidate the patchpoint / log) rather than falling back to 0.

Suggested change

if (patchpointInfo->TransitionPrologOffset() != -1)

{

offset = patchpointInfo->TransitionPrologOffset();

offset = patchpointInfo->TransitionPrologOffset();

if (offset == -1)

{

return NULL;

AndyAyersMS · 2026-04-12T14:50:11Z

I haven't seen any cases where the arm64 phantom unwind causes actual problems.

Would it be simpler to check the "continue in OSR" bit at the end of the Tier0 prolog and then jump to the OSR entry point (with suitable pains to adjust SP when needed) instead of calling the patchpoint helper?

jakobbotsch added 8 commits April 10, 2026 17:04

JIT: Duplicate tier0 prologs in OSR functions

111b803

Hack

535835b

Fix build

6cb5c58

Store transition offset

6ecf9f7

Simplify

74c2b99

Break everything

16fdf90

Finally fix

23b13e1

Revert unnecessary changes

6ba8464

Copilot AI review requested due to automatic review settings April 11, 2026 16:04

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 11, 2026

dotnet-policy-service bot assigned jakobbotsch Apr 11, 2026

Copilot started reviewing on behalf of jakobbotsch April 11, 2026 16:05 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

Fix build

dec5ac7

Still need to skip tier0 continuations in OSR method

1d62189

jakobbotsch added 2 commits April 11, 2026 18:59

Fix build

7dd7bb8

Run jit-format

1c8bba1

Copilot AI review requested due to automatic review settings April 11, 2026 17:01

Copilot started reviewing on behalf of jakobbotsch April 11, 2026 17:02 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

Fix build 2

0890a42

This was referenced Apr 11, 2026

Unable to pull image from mcr.microsoft.com #117164

Open

System.Net.NameResolution.Tests DNS failures: Name or service not known #126641

Open

-    unsigned totalFrameSize = (unsigned)m_compiler->info.compPatchpointInfo->TotalFrameSize();
+    unsigned totalFrameSize = (unsigned)m_compiler->info.compPatchpointInfo->TotalFrameSize();
+#ifdef TARGET_AMD64
+    // PatchpointInfo records an extra pointer-sized pseudo return-address slot for the OSR transition.
+    // The duplicated tier0 prolog must allocate only the actual tier0 frame size here.
+    assert(totalFrameSize >= REGSIZE_BYTES);
+    totalFrameSize -= REGSIZE_BYTES;
+#endif

Conversation

jakobbotsch commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dotnet-policy-service bot commented Apr 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

am11 commented Apr 11, 2026

Uh oh!

jakobbotsch commented Apr 11, 2026

Uh oh!

jakobbotsch commented Apr 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jakobbotsch commented Apr 11, 2026 •

edited

Loading