Skip to content

Refactor OSR method prologs to duplicate tier0 prolog at the beginning#126791

Draft
jakobbotsch wants to merge 13 commits intodotnet:mainfrom
jakobbotsch:osr-duplicated-prologs
Draft

Refactor OSR method prologs to duplicate tier0 prolog at the beginning#126791
jakobbotsch wants to merge 13 commits intodotnet:mainfrom
jakobbotsch:osr-duplicated-prologs

Conversation

@jakobbotsch
Copy link
Copy Markdown
Member

@jakobbotsch jakobbotsch commented Apr 11, 2026

This changes how OSR methods have their prologs generated. OSR methods now begin with a prolog that emulates the stack frame changes made by the tier0 prolog, followed by the existing OSR prolog that assumes the tier0 stack frame changes have already executed. The idea is to handle two separate problems:

  • On win-arm64 there is a requirement that unwind codes map 1:1 to prolog instructions. OSR functions did not satisfy this requirement before. Hence unwinding out of partially executed OSR prologs (and potentially also some instructions after the prolog) was broken. This probably did not affect anything but diagnostics. It was likely a problem on all platforms, not just win-arm64.
  • Resuming runtime async methods was expensive as it required going through a path resumption stub -> tier0 -> OSR method. The last step required a very expensive unwinder-based transition. This is Optimize runtime async OSR resumption performance #120865 and was causing some severe ASP.NET micro benchmark regressions when runtime async is enabled.

With this change:

  • To transition from tier0 method to OSR method we transition to the OSR method starting at the offset that corresponds to the OSR method prolog. This offset is communicated via a new PatchpointInfo instance that the OSR method now records. This should solve unwinding a partially executed prolog since the unwinder now believes that the initial part of the prolog that corresponds to the tier0 prolog has executed, and hence will partially unwind that as normal, even if we've just transitioned.
  • To resume runtime async methods we can now skip the tier0 method entirely and instead execute the OSR method directly, starting from the beginning.

Currently implemented only for x64. For arm64 some changes are needed to first restore callee saves from the tier0 part of the frame.

Example:

public class Program
{
    static int s_sum;
    public static void Main()
    {
        string foo = "abc";
        string bar = "def";
        string baz = "fawd";
        string beef = "a123";
        string bazbeef = "a11";
        for (int i = 0; i < 20000; i++)
        {
            if (i >= 10000 && i <= 10100)
                Thread.Sleep(10);

            s_sum += i * i;
        }

        Console.WriteLine(s_sum);
        Console.WriteLine(foo);
        Console.WriteLine(bar);
        Console.WriteLine(baz);
        Console.WriteLine(beef);
        Console.WriteLine(bazbeef);
    }
}

First the tier0 prolog:

; Assembly listing for method Program:Main() (Instrumented Tier0)
; Emitting BLENDED_CODE for x64 + VEX on Windows
; Instrumented Tier0 code
; rbp based frame
; fully interruptible
; compiling with minopt

G_M000_IG01:                ;; offset=0x0000
       push     rbp
       sub      rsp, 144
       lea      rbp, [rsp+0x90]
       xor      eax, eax
       mov      dword ptr [rbp-0x64], eax
       vxorps   xmm4, xmm4, xmm4
       vmovdqu  ymmword ptr [rbp-0x60], ymm4
       mov      qword ptr [rbp-0x40], rax

OSR method then starts with a prolog that is mostly similar (but not always 100%):

G_M000_IG01:                ;; offset=0x0000
       push     rbp                                  ; Tier0 save rbp
       sub      rsp, 144                             ; Tier0 allocate locals
       lea      rbp, [rsp+0x90]                      ; Tier0 setup rbp
       vxorps   xmm4, xmm4, xmm4                     ; Tier0
       vmovdqu  ymmword ptr [rbp-0x60], ymm4         ; Tier0
       xor      eax, eax                             ; Tier0
       mov      qword ptr [rbp-0x40], rax            ; Tier0 null out GC refs on tier0 frame that we are reporting in the OSR method
                                                     ; OSR transition point
       push     rax                                  ; OSR prolog: set up expected misalignment (TODO: remove)
       sub      rsp, 88                              ; OSR allocate locals
       mov      qword ptr [rsp+0xE8], r15            ; OSR save callee regs
       mov      qword ptr [rsp+0xE0], r14            ; OSR
       mov      qword ptr [rsp+0xD8], r13            ; OSR
       mov      qword ptr [rsp+0xD0], rdi            ; OSR
       mov      qword ptr [rsp+0xC8], rsi            ; OSR
       mov      qword ptr [rsp+0xC0], rbx            ; OSR
       mov      rsi, gword ptr [rsp+0xB0]            ; OSR
       mov      rdi, gword ptr [rsp+0xA8]            ; OSR
       mov      rbp, gword ptr [rsp+0xA0]            ; OSR
       mov      r14, gword ptr [rsp+0x98]            ; OSR 
       mov      r15, gword ptr [rsp+0x90]            ; OSR
       mov      ebx, dword ptr [rsp+0x8C]            ; OSR enregister tier0 locals

Another example:

static int s_value;
static async Task Foo(int n, NullAwaiter na)
{
    for (int i = 0; i < n; i++)
    {
        s_value += i;
    }

    Stopwatch timer = Stopwatch.StartNew();
    for (int i = 0; i < 10_000_000; i++)
    {
        await na;
    }
    Console.WriteLine("Took {0:F1} ms", timer.Elapsed.TotalMilliseconds);
}

Tier 0:

; Assembly listing for method OSRPerf.Program:Foo(int,OSRPerf.Program+NullAwaiter) (Instrumented Tier0)
; Emitting BLENDED_CODE for x64 + VEX on Windows
; Instrumented Tier0 code
; async
; rbp based frame
; fully interruptible
; compiling with minopt

G_M000_IG01:                ;; offset=0x0000
       push     rbp
       sub      rsp, 176
       lea      rbp, [rsp+0xB0]
       vxorps   xmm4, xmm4, xmm4
       vmovdqu  ymmword ptr [rbp-0x90], ymm4
       vmovdqu  ymmword ptr [rbp-0x70], ymm4
       vmovdqa  xmmword ptr [rbp-0x50], xmm4
       xor      eax, eax
       mov      qword ptr [rbp-0x40], rax
       mov      gword ptr [rbp+0x10], rcx
       mov      dword ptr [rbp+0x18], edx
       mov      gword ptr [rbp+0x20], r8
; Assembly listing for method OSRPerf.Program:Foo(int,OSRPerf.Program+NullAwaiter) (Tier1-OSR)
; Emitting BLENDED_CODE for x64 + VEX on Windows
; Tier1-OSR code
; OSR variant for entry point 0x14
; async
; optimized code
; optimized using Synthesized PGO
; rbp based frame
; fully interruptible
; with Synthesized PGO: fgCalledCount is 1
; 4 inlinees with PGO data; 14 single block inlinees; 1 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
                                                ; Runtime async transition point
       push     rbp                             ; Tier0 Save rbp
       sub      rsp, 176                        ; Tier0 Alloc locals
       lea      rbp, [rsp+0xB0]                 ; Tier0 Establish frame pointer
       xor      eax, eax                        ; Tier0 
       mov      qword ptr [rbp-0x50], rax       ; Tier0 
       mov      qword ptr [rbp-0x60], rax       ; Tier0 
       mov      qword ptr [rbp-0x40], rax       ; Tier0 
       mov      qword ptr [rbp-0x48], rax       ; Tier0 null out GC refs on tier0 frame that we are reporting in the OSR method
       mov      qword ptr [rbp+0x10], rcx       ; Tier0 save async continuation parameter
                                                ; OSR transition point
       push     rax                             ; OSR set up expected misaslignment (TODO: remove)
       mov      rax, qword ptr [rbp]            ; OSR 
       push     rax                             ; OSR Maintain chained frame pointers
       sub      rsp, 112                        ; OSR Alloc locals
       mov      qword ptr [rsp+0x128], r14      ; OSR 
       mov      qword ptr [rsp+0x120], rdi      ; OSR 
       mov      qword ptr [rsp+0x118], rsi      ; OSR 
       mov      qword ptr [rsp+0x110], rbx      ; OSR Save callee saves
       vzeroupper                               ; OSR 
       vmovaps  xmmword ptr [rsp+0x40], xmm6    ; OSR Save float callee regs
       lea      rbp, [rsp+0x70]                 ; OSR Establish frame pointer
       xor      eax, eax                        ; OSR
       mov      qword ptr [rbp-0x38], rax       ; OSR
       mov      qword ptr [rbp-0x40], rax       ; OSR Zero out locals
       mov      rax, gword ptr [rbp+0xD0]       ; OSR
       mov      ecx, dword ptr [rbp+0xD8]       ; OSR
       mov      edx, dword ptr [rbp+0x6C]       ; OSR Enregister tier0 locals

Perf of this last benchmark:
Before: Took 6156.5 ms
After: Took 484.6 ms

FYI @AndyAyersMS

Copilot AI review requested due to automatic review settings April 11, 2026 16:04
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 11, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors OSR method prolog generation so OSR methods begin with a duplicated Tier0 prolog (currently implemented for x64), enabling correct/robust unwinding from OSR prologs and allowing runtime async resumption to enter OSR code directly without routing through Tier0.

Changes:

  • Emit a duplicated Tier0 prolog at the start of x64 OSR methods and record an OSR “transition prolog offset” in per-OSR-method PatchpointInfo.
  • Update VM patchpoint transition logic to jump into the OSR method at the recorded transition offset (skipping the duplicated Tier0 prolog).
  • Refactor frame zero-init helpers to accept an explicit base register, and adjust async transformation/resumption logic to account for the new x64 OSR entry behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/coreclr/vm/jitinterface.cpp Adjusts async resumption stub behavior around OSR tiers (x64 vs non-x64).
src/coreclr/vm/jithelpers.cpp Computes OSR transition address using per-method PatchpointInfo and updates patchpoint transition context handling.
src/coreclr/jit/compiler.h Adds/normalizes default initialization for several Compiler::info fields.
src/coreclr/jit/compiler.cpp Removes now-redundant explicit resets/initializations relying on new defaults.
src/coreclr/jit/codegenxarch.cpp Implements x64 OSR duplicated Tier0 prolog emission and records transition offset in PatchpointInfo.
src/coreclr/jit/codegencommon.cpp Wires x64 OSR prolog duplication into common prolog generation and updates zero-init call signature.
src/coreclr/jit/codegen.h Declares genDuplicateTier0Prolog and updates genZeroInitFrameUsingBlockInit signature.
src/coreclr/jit/codegenarm64.cpp Updates genZeroInitFrameUsingBlockInit to accept a base register.
src/coreclr/jit/codegenarm.cpp Updates genZeroInitFrameUsingBlockInit to accept a base register.
src/coreclr/jit/codegenloongarch64.cpp Updates genZeroInitFrameUsingBlockInit to accept a base register.
src/coreclr/jit/codegenriscv64.cpp Updates genZeroInitFrameUsingBlockInit to accept a base register.
src/coreclr/jit/async.cpp Adjusts async continuation reuse and OSR IL-offset handling (skipped on x64 due to direct OSR entry).
src/coreclr/inc/patchpointinfo.h Adds m_transitionPrologOffset to PatchpointInfo and accessors.

Comment on lines 14946 to +14948
PrepareCodeConfig* config = GetThread()->GetCurrentPrepareCodeConfig();
NativeCodeVersion ncv = config->GetCodeVersion();
#ifndef TARGET_AMD64
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On TARGET_AMD64 the #ifndef TARGET_AMD64 excludes the if (ncv.GetOptimizationTier() == ...Tier1OSR) block, but config/ncv are still declared and become unused. CoreCLR builds commonly treat unused locals as warnings-as-errors, so this will likely break amd64 builds. Consider moving the PrepareCodeConfig* config and NativeCodeVersion ncv declarations inside the #ifndef TARGET_AMD64 block (or explicitly marking them unused on amd64).

Suggested change
PrepareCodeConfig* config = GetThread()->GetCurrentPrepareCodeConfig();
NativeCodeVersion ncv = config->GetCodeVersion();
#ifndef TARGET_AMD64
#ifndef TARGET_AMD64
PrepareCodeConfig* config = GetThread()->GetCurrentPrepareCodeConfig();
NativeCodeVersion ncv = config->GetCodeVersion();

Copilot uses AI. Check for mistakes.
Comment on lines +9918 to +9923
PatchpointInfo* ppi = m_compiler->info.compPatchpointInfo;
regMaskTP tier0CalleeSaves = (regMaskTP)ppi->CalleeSaveRegisters();
tier0CalleeSaves &= ~RBM_FPBASE;
// On x64 only integer registers are part of this set.
// TODO: Isn't this a bug? How does the OSR method restore float registers saved by tier0?
assert((tier0CalleeSaves & RBM_ALLINT) == tier0CalleeSaves);
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatchpointInfo::CalleeSaveRegisters() can include XMM6–XMM15 on Windows amd64 (see RBM_FLT_CALLEE_SAVED in targetamd64.h), since it is set from rsGetModifiedCalleeSavedRegsMask(). The new assert that the tier0 callee-saves mask is RBM_ALLINT can therefore fire, and even if it doesn’t, the duplicated prolog currently only emits integer pushes, so it can’t faithfully reproduce tier0 prolog behavior when tier0 saved float callee-saves. Either restrict what gets recorded in PatchpointInfo to integer callee-saves (and assert that invariant when recording), or extend the duplication logic to also account for float callee-saves.

Copilot uses AI. Check for mistakes.
Comment on lines 5058 to +5063
}
#endif // defined(TARGET_ARM64) || defined(TARGET_LOONGARCH64) || defined(TARGET_RISCV64)
#elif defined(TARGET_AMD64)
if (m_compiler->opts.IsOSR())
{
genDuplicateTier0Prolog();
}
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment just above the OSR prolog handling still says “x64 handles this differently; … emitted in genOSRRecordTier0CalleeSavedRegistersAndFrame”, but that helper was removed and the new behavior is genDuplicateTier0Prolog(). Please update the comment so it matches the new implementation (otherwise it’s misleading for future maintenance).

Copilot uses AI. Check for mistakes.
@am11
Copy link
Copy Markdown
Member

am11 commented Apr 11, 2026

@jakobbotsch, just curious, are you planning to eliminate the usage of VirtualUnwindToFirstManagedCallFrame/RtlVirtualUnwind usage from this path as well? (it was my main motivation of touching this code 🙂)

@jakobbotsch
Copy link
Copy Markdown
Member Author

@jakobbotsch, just curious, are you planning to eliminate the usage of VirtualUnwindToFirstManagedCallFrame/RtlVirtualUnwind usage from this path as well? (it was my main motivation of touching this code 🙂)

My current plan is to fix the runtime async perf issue first and foremost. Once done it should not be that much work to do that refactoring (and most of the work, with GT_PATCHPOINT, should already be in your PR). If you'd like to take it from there at that point I definitely would not be opposed, otherwise I will probably submit it as a follow-up when I get to it.

Copilot AI review requested due to automatic review settings April 11, 2026 17:01
@jakobbotsch
Copy link
Copy Markdown
Member Author

Once done it should not be that much work

I suppose will be a lot of work in updating loongarch64/RISCV64 to the same scheme as x64/arm64. I might take a look and the answer probably depends on how much work that is.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Comment on lines 40 to +46
void Initialize(uint32_t localCount, int32_t totalFrameSize)
{
m_calleeSaveRegisters = 0;
m_tier0Version = 0;
m_totalFrameSize = totalFrameSize;
m_numberOfLocals = localCount;
m_transitionPrologOffset = -1;
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatchpointInfo::Initialize sets several members to sentinel values, but it does not initialize m_asyncExecutionContextOffset / m_asyncSynchronizationContextOffset (the corresponding HasAsync*Offset() checks assume -1 means "absent"). If these fields are not explicitly set later, they can contain garbage and incorrectly appear present. Initialize them to -1 here (along with the other offsets).

Copilot uses AI. Check for mistakes.
Comment on lines 54 to 59
void Copy(const PatchpointInfo* original)
{
m_calleeSaveRegisters = original->m_calleeSaveRegisters;
m_tier0Version = original->m_tier0Version;
m_transitionPrologOffset = original->m_transitionPrologOffset;
m_genericContextArgOffset = original->m_genericContextArgOffset;
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatchpointInfo::Copy only copies a subset of fields (it currently omits at least m_totalFrameSize, m_numberOfLocals, and the async execution/synchronization context offsets). This can produce an internally inconsistent copy (e.g., NumberOfLocals()/TotalFrameSize() not matching the copied offset data). Copy all scalar members, including the async offsets, before copying the variable-length m_offsetAndExposureData.

Copilot uses AI. Check for mistakes.
int const tier0FrameSize = patchpointInfo->TotalFrameSize() + REGSIZE_BYTES;
int const tier0NetSize = tier0FrameSize - tier0IntCalleeSaveUsedSize;
m_compiler->unwindAllocStack(tier0NetSize);
unsigned totalFrameSize = (unsigned)m_compiler->info.compPatchpointInfo->TotalFrameSize();
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

genDuplicateTier0Prolog derives localFrameSize directly from compPatchpointInfo->TotalFrameSize(). On amd64, Compiler::generatePatchpointInfo currently records TotalFrameSize as genTotalFrameSize() + TARGET_POINTER_SIZE to account for the "pseudo return address" slot used by the OSR transition. With this PR the transition no longer adjusts SP in the VM helper, and the misalignment slot is now created via a push in the OSR prolog instead. Unless TotalFrameSize semantics were updated elsewhere, this will cause the duplicated tier0 prolog to allocate an extra pointer-sized slot and miscompute the tier0 frame layout. Consider subtracting the pseudo-call slot here (or updating TotalFrameSize generation) so the duplicated prolog matches the actual tier0 frame size at the transition point.

Suggested change
unsigned totalFrameSize = (unsigned)m_compiler->info.compPatchpointInfo->TotalFrameSize();
unsigned totalFrameSize = (unsigned)m_compiler->info.compPatchpointInfo->TotalFrameSize();
#ifdef TARGET_AMD64
// PatchpointInfo records an extra pointer-sized pseudo return-address slot for the OSR transition.
// The duplicated tier0 prolog must allocate only the actual tier0 frame size here.
assert(totalFrameSize >= REGSIZE_BYTES);
totalFrameSize -= REGSIZE_BYTES;
#endif

Copilot uses AI. Check for mistakes.
Comment on lines +1286 to +1288
if (patchpointInfo->TransitionPrologOffset() != -1)
{
offset = patchpointInfo->TransitionPrologOffset();
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetOSRTransitionAddress treats a missing TransitionPrologOffset() (default -1) as offset 0, which would transition into the OSR method entrypoint. On amd64 OSR methods now begin with a duplicated tier0 prolog; entering at offset 0 from a tier0->OSR transition would re-run those stack adjustments on top of an existing tier0 frame and can corrupt the stack/unwind state. It seems safer to treat TransitionPrologOffset() == -1 as an error (e.g., return NULL and invalidate the patchpoint / log) rather than falling back to 0.

Suggested change
if (patchpointInfo->TransitionPrologOffset() != -1)
{
offset = patchpointInfo->TransitionPrologOffset();
offset = patchpointInfo->TransitionPrologOffset();
if (offset == -1)
{
return NULL;

Copilot uses AI. Check for mistakes.
@AndyAyersMS
Copy link
Copy Markdown
Member

I haven't seen any cases where the arm64 phantom unwind causes actual problems.

Would it be simpler to check the "continue in OSR" bit at the end of the Tier0 prolog and then jump to the OSR entry point (with suitable pains to adjust SP when needed) instead of calling the patchpoint helper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants