Refactor OSR method prologs to duplicate tier0 prolog at the beginning#126791
Refactor OSR method prologs to duplicate tier0 prolog at the beginning#126791jakobbotsch wants to merge 13 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
This PR refactors OSR method prolog generation so OSR methods begin with a duplicated Tier0 prolog (currently implemented for x64), enabling correct/robust unwinding from OSR prologs and allowing runtime async resumption to enter OSR code directly without routing through Tier0.
Changes:
- Emit a duplicated Tier0 prolog at the start of x64 OSR methods and record an OSR “transition prolog offset” in per-OSR-method
PatchpointInfo. - Update VM patchpoint transition logic to jump into the OSR method at the recorded transition offset (skipping the duplicated Tier0 prolog).
- Refactor frame zero-init helpers to accept an explicit base register, and adjust async transformation/resumption logic to account for the new x64 OSR entry behavior.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/vm/jitinterface.cpp | Adjusts async resumption stub behavior around OSR tiers (x64 vs non-x64). |
| src/coreclr/vm/jithelpers.cpp | Computes OSR transition address using per-method PatchpointInfo and updates patchpoint transition context handling. |
| src/coreclr/jit/compiler.h | Adds/normalizes default initialization for several Compiler::info fields. |
| src/coreclr/jit/compiler.cpp | Removes now-redundant explicit resets/initializations relying on new defaults. |
| src/coreclr/jit/codegenxarch.cpp | Implements x64 OSR duplicated Tier0 prolog emission and records transition offset in PatchpointInfo. |
| src/coreclr/jit/codegencommon.cpp | Wires x64 OSR prolog duplication into common prolog generation and updates zero-init call signature. |
| src/coreclr/jit/codegen.h | Declares genDuplicateTier0Prolog and updates genZeroInitFrameUsingBlockInit signature. |
| src/coreclr/jit/codegenarm64.cpp | Updates genZeroInitFrameUsingBlockInit to accept a base register. |
| src/coreclr/jit/codegenarm.cpp | Updates genZeroInitFrameUsingBlockInit to accept a base register. |
| src/coreclr/jit/codegenloongarch64.cpp | Updates genZeroInitFrameUsingBlockInit to accept a base register. |
| src/coreclr/jit/codegenriscv64.cpp | Updates genZeroInitFrameUsingBlockInit to accept a base register. |
| src/coreclr/jit/async.cpp | Adjusts async continuation reuse and OSR IL-offset handling (skipped on x64 due to direct OSR entry). |
| src/coreclr/inc/patchpointinfo.h | Adds m_transitionPrologOffset to PatchpointInfo and accessors. |
| PrepareCodeConfig* config = GetThread()->GetCurrentPrepareCodeConfig(); | ||
| NativeCodeVersion ncv = config->GetCodeVersion(); | ||
| #ifndef TARGET_AMD64 |
There was a problem hiding this comment.
On TARGET_AMD64 the #ifndef TARGET_AMD64 excludes the if (ncv.GetOptimizationTier() == ...Tier1OSR) block, but config/ncv are still declared and become unused. CoreCLR builds commonly treat unused locals as warnings-as-errors, so this will likely break amd64 builds. Consider moving the PrepareCodeConfig* config and NativeCodeVersion ncv declarations inside the #ifndef TARGET_AMD64 block (or explicitly marking them unused on amd64).
| PrepareCodeConfig* config = GetThread()->GetCurrentPrepareCodeConfig(); | |
| NativeCodeVersion ncv = config->GetCodeVersion(); | |
| #ifndef TARGET_AMD64 | |
| #ifndef TARGET_AMD64 | |
| PrepareCodeConfig* config = GetThread()->GetCurrentPrepareCodeConfig(); | |
| NativeCodeVersion ncv = config->GetCodeVersion(); |
| PatchpointInfo* ppi = m_compiler->info.compPatchpointInfo; | ||
| regMaskTP tier0CalleeSaves = (regMaskTP)ppi->CalleeSaveRegisters(); | ||
| tier0CalleeSaves &= ~RBM_FPBASE; | ||
| // On x64 only integer registers are part of this set. | ||
| // TODO: Isn't this a bug? How does the OSR method restore float registers saved by tier0? | ||
| assert((tier0CalleeSaves & RBM_ALLINT) == tier0CalleeSaves); |
There was a problem hiding this comment.
PatchpointInfo::CalleeSaveRegisters() can include XMM6–XMM15 on Windows amd64 (see RBM_FLT_CALLEE_SAVED in targetamd64.h), since it is set from rsGetModifiedCalleeSavedRegsMask(). The new assert that the tier0 callee-saves mask is RBM_ALLINT can therefore fire, and even if it doesn’t, the duplicated prolog currently only emits integer pushes, so it can’t faithfully reproduce tier0 prolog behavior when tier0 saved float callee-saves. Either restrict what gets recorded in PatchpointInfo to integer callee-saves (and assert that invariant when recording), or extend the duplication logic to also account for float callee-saves.
| } | ||
| #endif // defined(TARGET_ARM64) || defined(TARGET_LOONGARCH64) || defined(TARGET_RISCV64) | ||
| #elif defined(TARGET_AMD64) | ||
| if (m_compiler->opts.IsOSR()) | ||
| { | ||
| genDuplicateTier0Prolog(); | ||
| } |
There was a problem hiding this comment.
The comment just above the OSR prolog handling still says “x64 handles this differently; … emitted in genOSRRecordTier0CalleeSavedRegistersAndFrame”, but that helper was removed and the new behavior is genDuplicateTier0Prolog(). Please update the comment so it matches the new implementation (otherwise it’s misleading for future maintenance).
|
@jakobbotsch, just curious, are you planning to eliminate the usage of |
My current plan is to fix the runtime async perf issue first and foremost. Once done it should not be that much work to do that refactoring (and most of the work, with |
I suppose will be a lot of work in updating loongarch64/RISCV64 to the same scheme as x64/arm64. I might take a look and the answer probably depends on how much work that is. |
| void Initialize(uint32_t localCount, int32_t totalFrameSize) | ||
| { | ||
| m_calleeSaveRegisters = 0; | ||
| m_tier0Version = 0; | ||
| m_totalFrameSize = totalFrameSize; | ||
| m_numberOfLocals = localCount; | ||
| m_transitionPrologOffset = -1; |
There was a problem hiding this comment.
PatchpointInfo::Initialize sets several members to sentinel values, but it does not initialize m_asyncExecutionContextOffset / m_asyncSynchronizationContextOffset (the corresponding HasAsync*Offset() checks assume -1 means "absent"). If these fields are not explicitly set later, they can contain garbage and incorrectly appear present. Initialize them to -1 here (along with the other offsets).
| void Copy(const PatchpointInfo* original) | ||
| { | ||
| m_calleeSaveRegisters = original->m_calleeSaveRegisters; | ||
| m_tier0Version = original->m_tier0Version; | ||
| m_transitionPrologOffset = original->m_transitionPrologOffset; | ||
| m_genericContextArgOffset = original->m_genericContextArgOffset; |
There was a problem hiding this comment.
PatchpointInfo::Copy only copies a subset of fields (it currently omits at least m_totalFrameSize, m_numberOfLocals, and the async execution/synchronization context offsets). This can produce an internally inconsistent copy (e.g., NumberOfLocals()/TotalFrameSize() not matching the copied offset data). Copy all scalar members, including the async offsets, before copying the variable-length m_offsetAndExposureData.
| int const tier0FrameSize = patchpointInfo->TotalFrameSize() + REGSIZE_BYTES; | ||
| int const tier0NetSize = tier0FrameSize - tier0IntCalleeSaveUsedSize; | ||
| m_compiler->unwindAllocStack(tier0NetSize); | ||
| unsigned totalFrameSize = (unsigned)m_compiler->info.compPatchpointInfo->TotalFrameSize(); |
There was a problem hiding this comment.
genDuplicateTier0Prolog derives localFrameSize directly from compPatchpointInfo->TotalFrameSize(). On amd64, Compiler::generatePatchpointInfo currently records TotalFrameSize as genTotalFrameSize() + TARGET_POINTER_SIZE to account for the "pseudo return address" slot used by the OSR transition. With this PR the transition no longer adjusts SP in the VM helper, and the misalignment slot is now created via a push in the OSR prolog instead. Unless TotalFrameSize semantics were updated elsewhere, this will cause the duplicated tier0 prolog to allocate an extra pointer-sized slot and miscompute the tier0 frame layout. Consider subtracting the pseudo-call slot here (or updating TotalFrameSize generation) so the duplicated prolog matches the actual tier0 frame size at the transition point.
| unsigned totalFrameSize = (unsigned)m_compiler->info.compPatchpointInfo->TotalFrameSize(); | |
| unsigned totalFrameSize = (unsigned)m_compiler->info.compPatchpointInfo->TotalFrameSize(); | |
| #ifdef TARGET_AMD64 | |
| // PatchpointInfo records an extra pointer-sized pseudo return-address slot for the OSR transition. | |
| // The duplicated tier0 prolog must allocate only the actual tier0 frame size here. | |
| assert(totalFrameSize >= REGSIZE_BYTES); | |
| totalFrameSize -= REGSIZE_BYTES; | |
| #endif |
src/coreclr/vm/jithelpers.cpp
Outdated
| if (patchpointInfo->TransitionPrologOffset() != -1) | ||
| { | ||
| offset = patchpointInfo->TransitionPrologOffset(); |
There was a problem hiding this comment.
GetOSRTransitionAddress treats a missing TransitionPrologOffset() (default -1) as offset 0, which would transition into the OSR method entrypoint. On amd64 OSR methods now begin with a duplicated tier0 prolog; entering at offset 0 from a tier0->OSR transition would re-run those stack adjustments on top of an existing tier0 frame and can corrupt the stack/unwind state. It seems safer to treat TransitionPrologOffset() == -1 as an error (e.g., return NULL and invalidate the patchpoint / log) rather than falling back to 0.
| if (patchpointInfo->TransitionPrologOffset() != -1) | |
| { | |
| offset = patchpointInfo->TransitionPrologOffset(); | |
| offset = patchpointInfo->TransitionPrologOffset(); | |
| if (offset == -1) | |
| { | |
| return NULL; |
|
I haven't seen any cases where the arm64 phantom unwind causes actual problems. Would it be simpler to check the "continue in OSR" bit at the end of the Tier0 prolog and then jump to the OSR entry point (with suitable pains to adjust SP when needed) instead of calling the patchpoint helper? |
This changes how OSR methods have their prologs generated. OSR methods now begin with a prolog that emulates the stack frame changes made by the tier0 prolog, followed by the existing OSR prolog that assumes the tier0 stack frame changes have already executed. The idea is to handle two separate problems:
With this change:
PatchpointInfoinstance that the OSR method now records. This should solve unwinding a partially executed prolog since the unwinder now believes that the initial part of the prolog that corresponds to the tier0 prolog has executed, and hence will partially unwind that as normal, even if we've just transitioned.Currently implemented only for x64. For arm64 some changes are needed to first restore callee saves from the tier0 part of the frame.
Example:
First the tier0 prolog:
OSR method then starts with a prolog that is mostly similar (but not always 100%):
Another example:
Tier 0:
Perf of this last benchmark:
Before: Took 6156.5 ms
After: Took 484.6 ms
FYI @AndyAyersMS