[QDP] Close AMD-vs-CUDA encoder parity: add iqp, iqp-z, phase#1292
Open
ryankert01 wants to merge 2 commits intoapache:mainfrom
Open
[QDP] Close AMD-vs-CUDA encoder parity: add iqp, iqp-z, phase#1292ryankert01 wants to merge 2 commits intoapache:mainfrom
ryankert01 wants to merge 2 commits intoapache:mainfrom
Conversation
dbf0f0e to
f56b31d
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Adds missing encoding-method parity on the Triton AMD backend so the public QdpEngine(backend="amd") supports the same method strings as the CUDA backend.
Changes:
- Implement
iqp,iqp-z, andphaseinTritonAmdEngineand dispatch them throughencode(...). - Add ROCm-marked tests for parity vs
qumat_qdp.torch_ref.iqp_encodeand a local phase reference, plus validation/normalization checks. - Update README + Triton AMD backend docs to reflect expanded method support.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
qdp/qdp-python/qumat_qdp/triton_amd.py |
Adds encode_iqp/encode_phase implementations and method dispatch. |
qdp/qdp-python/tests/test_triton_amd_backend.py |
Adds new AMD backend tests for iqp/iqp-z/phase and routing/error-message coverage. |
qdp/qdp-python/TRITON_AMD_BACKEND.md |
Updates supported-methods list and test description. |
qdp/qdp-python/README.md |
Updates encoding-method table and backend support boundary. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ryankert01
added a commit
to ryankert01/mahout
that referenced
this pull request
Apr 25, 2026
Addresses Copilot review on PR apache#1292 and pushes general kernel-level optimization across all six AMD encoders. PR review responses: - Drop the unreachable `test_triton_amd_iqp_cuda_reference_optional` (decorator required `torch.version.cuda` while body required `is_triton_amd_available()` → mutually exclusive). Replace with a meaningful float64 IQP precision contract test that actually runs. - Qualify README about the CUDA-tensor `phase` limitation: the Python extension's CUDA-tensor allowlist (`CUDA_ENCODING_METHODS`) does not yet include `phase`, so cuda-resident torch tensors must `.cpu()` first. Tracked as a follow-up. - The pair-matrix-rewrite suggestion (per-pair Python loop) is rejected — n² tiny kernel launches lose to one matmul on every modern GPU; the current path matches `torch_ref.iqp_encode` and the CUDA FWT phase kernel. Add a `_IQP_PAIR_MATRIX_MAX_N` guard that *does* fall back to a pair loop past n=20 (where the (2^n × n_pairs) workspace dominates HBM), so the OOM scenario is bounded. Encoder optimizations (verified on MI300X vs `qumat_qdp.torch_ref`, batch=64, fp32 input): | | q=8 | q=12 | q=16 | |--------|-------|-------|-------| | amplitude | 0.95× | 0.95× | 1.00× | | angle | 1.57× | 1.37× | 1.04× | | basis | 2.18× | 2.10× | 2.14× | | iqp(ZZ) | 1.96× | 1.81× | 1.14× | | iqp-z | 1.35× | 1.32× | 0.91× | | **phase** | **5.29×** | **5.39×** | **5.30×** | What changed: - **Real `@triton.jit` phase kernel** (fp32 / n ≤ 32). One HIP kernel fuses bit-pattern materialization + θ(b) accumulation + cos/sin + 1/√2^n scaling + complex-pack, writing the output buffer interleaved via `view_as_real`. The PyTorch fallback path (used at fp64 or n > 32) was making 5 intermediate (B, S) allocations; the kernel makes one. - **Per-engine bits-table cache** (`_bits_cache`): the `((idx >> arange(n)) & 1).to(real)` table was being rebuilt on every call by `angle`/`iqp`/`phase`. Now cached per (n, dtype). At n=16 that's a ~4 MiB int64 + ~4 MiB real allocation saved per call. - **Pair-index cache** (`_pair_cache`): `torch.combinations(arange(n))` cached per n. - **`encode_amplitude`**: replaced `torch.complex(amp, zeros_like(amp)).to(complex_dtype)` with a single `amp.to(complex_dtype)` (writes (real, 0) interleaved in one kernel vs. zeros_like + complex_pack + cast = three). - **`encode_angle`**: collapsed the n-step Python product loop (which reallocated a (B, S) tensor per qubit) into a single `where(bits, sin, cos).prod(dim=2)` reduction. - **`encode_iqp`**: in-place n-stage Walsh-Hadamard butterfly using a single `(B, S/2)` scratch buffer. The previous `cat([lo+hi, lo-hi], dim=2)` allocated a fresh (B, S) tensor every stage; now `sub(out=scratch); a.add_(b); b.copy_(scratch)` reuses one workspace across all n stages. Also packs `f` via `torch.complex(cos, sin)` in one shot rather than writing to strided `.real`/`.imag`. - **`_to_2d` fast path**: skip `as_tensor` + `.contiguous()` work when the caller already supplies a 2-D, contiguous, on-device, correctly-typed torch tensor (the common case for benchmarks). Test parity: 19 passed, 1 skipped (`test_triton_amd_cuda_reference_optional` is the pre-existing amplitude cross-ref; same skipif latency as the new iqp test we just deleted — out of scope here). Numerical parity: all encoders still match `torch_ref` / `_torch_phase_ref` within float-rounding tolerance; the IQP fp64 contract test confirms `atol=1e-12`.
ryankert01
added a commit
to ryankert01/mahout
that referenced
this pull request
Apr 27, 2026
Addresses Copilot review on PR apache#1292 and pushes general kernel-level optimization across all six AMD encoders. PR review responses: - Drop the unreachable `test_triton_amd_iqp_cuda_reference_optional` (decorator required `torch.version.cuda` while body required `is_triton_amd_available()` → mutually exclusive). Replace with a meaningful float64 IQP precision contract test that actually runs. - Qualify README about the CUDA-tensor `phase` limitation: the Python extension's CUDA-tensor allowlist (`CUDA_ENCODING_METHODS`) does not yet include `phase`, so cuda-resident torch tensors must `.cpu()` first. Tracked as a follow-up. - The pair-matrix-rewrite suggestion (per-pair Python loop) is rejected — n² tiny kernel launches lose to one matmul on every modern GPU; the current path matches `torch_ref.iqp_encode` and the CUDA FWT phase kernel. Add a `_IQP_PAIR_MATRIX_MAX_N` guard that *does* fall back to a pair loop past n=20 (where the (2^n × n_pairs) workspace dominates HBM), so the OOM scenario is bounded. Encoder optimizations (verified on MI300X vs `qumat_qdp.torch_ref`, batch=64, fp32 input): | | q=8 | q=12 | q=16 | |--------|-------|-------|-------| | amplitude | 0.95× | 0.95× | 1.00× | | angle | 1.57× | 1.37× | 1.04× | | basis | 2.18× | 2.10× | 2.14× | | iqp(ZZ) | 1.96× | 1.81× | 1.14× | | iqp-z | 1.35× | 1.32× | 0.91× | | **phase** | **5.29×** | **5.39×** | **5.30×** | What changed: - **Real `@triton.jit` phase kernel** (fp32 / n ≤ 32). One HIP kernel fuses bit-pattern materialization + θ(b) accumulation + cos/sin + 1/√2^n scaling + complex-pack, writing the output buffer interleaved via `view_as_real`. The PyTorch fallback path (used at fp64 or n > 32) was making 5 intermediate (B, S) allocations; the kernel makes one. - **Per-engine bits-table cache** (`_bits_cache`): the `((idx >> arange(n)) & 1).to(real)` table was being rebuilt on every call by `angle`/`iqp`/`phase`. Now cached per (n, dtype). At n=16 that's a ~4 MiB int64 + ~4 MiB real allocation saved per call. - **Pair-index cache** (`_pair_cache`): `torch.combinations(arange(n))` cached per n. - **`encode_amplitude`**: replaced `torch.complex(amp, zeros_like(amp)).to(complex_dtype)` with a single `amp.to(complex_dtype)` (writes (real, 0) interleaved in one kernel vs. zeros_like + complex_pack + cast = three). - **`encode_angle`**: collapsed the n-step Python product loop (which reallocated a (B, S) tensor per qubit) into a single `where(bits, sin, cos).prod(dim=2)` reduction. - **`encode_iqp`**: in-place n-stage Walsh-Hadamard butterfly using a single `(B, S/2)` scratch buffer. The previous `cat([lo+hi, lo-hi], dim=2)` allocated a fresh (B, S) tensor every stage; now `sub(out=scratch); a.add_(b); b.copy_(scratch)` reuses one workspace across all n stages. Also packs `f` via `torch.complex(cos, sin)` in one shot rather than writing to strided `.real`/`.imag`. - **`_to_2d` fast path**: skip `as_tensor` + `.contiguous()` work when the caller already supplies a 2-D, contiguous, on-device, correctly-typed torch tensor (the common case for benchmarks). Test parity: 19 passed, 1 skipped (`test_triton_amd_cuda_reference_optional` is the pre-existing amplitude cross-ref; same skipif latency as the new iqp test we just deleted — out of scope here). Numerical parity: all encoders still match `torch_ref` / `_torch_phase_ref` within float-rounding tolerance; the IQP fp64 contract test confirms `atol=1e-12`.
2e5cb60 to
0ff8115
Compare
CUDA QdpEngine accepts amplitude, angle, basis, iqp, iqp-z, and phase.
The Triton AMD path only implemented the first three, so AMD users hit
a hard error on the IQP- and phase-family encodings (e.g. SVHN-IQP).
This adds vectorized PyTorch implementations for the missing methods on
TritonAmdEngine, dispatched through the same ``encode(method=...)``
contract:
- ``iqp`` — full ZZ entanglement: phase = Σ x_i·data_i + Σ_{i<j} x_i x_j·data_ij,
followed by an n-stage Walsh-Hadamard butterfly and 1/2^n scaling.
- ``iqp-z`` — Z-only diagonal: same FWT path with no ZZ pairs.
- ``phase`` — per-qubit product state (1/√2^n)·exp(i·Σ_k phases_k·b_k).
Parity tests added against ``qumat_qdp.torch_ref.iqp_encode`` (which is
already validated against CUDA upstream) and a local pure-torch phase
reference. Also added unit-norm structural checks, param-count
validation, float64 precision contract, and a router test that the
public ``QdpEngine(backend="amd")`` accepts the new methods.
Verified on AMD Instinct MI300X (ROCm 7.2 / torch 2.9.0+rocm6.4 /
triton 3.5.0): full triton_amd test file is 18 passed, 2 skipped
(NVIDIA CUDA-only references).
Addresses Copilot review on PR apache#1292 and pushes general kernel-level optimization across all six AMD encoders. PR review responses: - Drop the unreachable `test_triton_amd_iqp_cuda_reference_optional` (decorator required `torch.version.cuda` while body required `is_triton_amd_available()` → mutually exclusive). Replace with a meaningful float64 IQP precision contract test that actually runs. - Qualify README about the CUDA-tensor `phase` limitation: the Python extension's CUDA-tensor allowlist (`CUDA_ENCODING_METHODS`) does not yet include `phase`, so cuda-resident torch tensors must `.cpu()` first. Tracked as a follow-up. - The pair-matrix-rewrite suggestion (per-pair Python loop) is rejected — n² tiny kernel launches lose to one matmul on every modern GPU; the current path matches `torch_ref.iqp_encode` and the CUDA FWT phase kernel. Add a `_IQP_PAIR_MATRIX_MAX_N` guard that *does* fall back to a pair loop past n=20 (where the (2^n × n_pairs) workspace dominates HBM), so the OOM scenario is bounded. Encoder optimizations (verified on MI300X vs `qumat_qdp.torch_ref`, batch=64, fp32 input): | | q=8 | q=12 | q=16 | |--------|-------|-------|-------| | amplitude | 0.95× | 0.95× | 1.00× | | angle | 1.57× | 1.37× | 1.04× | | basis | 2.18× | 2.10× | 2.14× | | iqp(ZZ) | 1.96× | 1.81× | 1.14× | | iqp-z | 1.35× | 1.32× | 0.91× | | **phase** | **5.29×** | **5.39×** | **5.30×** | What changed: - **Real `@triton.jit` phase kernel** (fp32 / n ≤ 32). One HIP kernel fuses bit-pattern materialization + θ(b) accumulation + cos/sin + 1/√2^n scaling + complex-pack, writing the output buffer interleaved via `view_as_real`. The PyTorch fallback path (used at fp64 or n > 32) was making 5 intermediate (B, S) allocations; the kernel makes one. - **Per-engine bits-table cache** (`_bits_cache`): the `((idx >> arange(n)) & 1).to(real)` table was being rebuilt on every call by `angle`/`iqp`/`phase`. Now cached per (n, dtype). At n=16 that's a ~4 MiB int64 + ~4 MiB real allocation saved per call. - **Pair-index cache** (`_pair_cache`): `torch.combinations(arange(n))` cached per n. - **`encode_amplitude`**: replaced `torch.complex(amp, zeros_like(amp)).to(complex_dtype)` with a single `amp.to(complex_dtype)` (writes (real, 0) interleaved in one kernel vs. zeros_like + complex_pack + cast = three). - **`encode_angle`**: collapsed the n-step Python product loop (which reallocated a (B, S) tensor per qubit) into a single `where(bits, sin, cos).prod(dim=2)` reduction. - **`encode_iqp`**: in-place n-stage Walsh-Hadamard butterfly using a single `(B, S/2)` scratch buffer. The previous `cat([lo+hi, lo-hi], dim=2)` allocated a fresh (B, S) tensor every stage; now `sub(out=scratch); a.add_(b); b.copy_(scratch)` reuses one workspace across all n stages. Also packs `f` via `torch.complex(cos, sin)` in one shot rather than writing to strided `.real`/`.imag`. - **`_to_2d` fast path**: skip `as_tensor` + `.contiguous()` work when the caller already supplies a 2-D, contiguous, on-device, correctly-typed torch tensor (the common case for benchmarks). Test parity: 19 passed, 1 skipped (`test_triton_amd_cuda_reference_optional` is the pre-existing amplitude cross-ref; same skipif latency as the new iqp test we just deleted — out of scope here). Numerical parity: all encoders still match `torch_ref` / `_torch_phase_ref` within float-rounding tolerance; the IQP fp64 contract test confirms `atol=1e-12`.
0ff8115 to
e70cb58
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #1158 / #1289. Closes the AMD-vs-CUDA encoder gap on
TritonAmdEngineand takes a kernel-level optimization pass over allsix encoders.
What changed
Parity — added
iqp(full ZZ),iqp-z(Z-only), andphasetoTritonAmdEngine.encode(). Algorithms mirrorqumat_qdp.torch_refandthe CUDA kernels in
qdp-kernels/src/{iqp,phase}.cu.Performance — verified on MI300X (ROCm 7.2 / torch 2.9.0+rocm6.4 /
triton 3.5.0), batch=64, fp32, vs
qumat_qdp.torch_ref:Key wins:
@triton.jitphase kernel (fp32, n ≤ 32): one HIP launchfuses bit-unpack + θ(b) + cos/sin + 1/√2^n + complex-pack via
view_as_real. Replaces 5 PyTorch intermediate allocations with one.(B, S/2)scratch bufferinstead of allocating a fresh
(B, S)per stage viacat.((idx >> arange(n)) & 1)no longer rebuilton every call.
amp.to(complex_dtype)for amplitude/angle (one kernel) insteadof
complex(amp, zeros_like(amp))(three)._to_2dfast path for already-correct torch tensors.Tests
11 new tests in
qdp/qdp-python/tests/test_triton_amd_backend.py:torch_ref parity for iqp/iqp-z/phase, param-count validation,
unit-norm checks, fp64 precision contracts, error-message enumeration,
and a unified-router smoke test.
ruff check+ruff format --check+ty check: clean.Review responses (Copilot)
IQP precision test (the original required both
torch.version.cudaand
torch.version.hip— mutually exclusive).pair_matrixper-pair loop suggestion — rejected; n² tinylaunches would lose to one matmul, and
torch_refuses the sameapproach. Added
_IQP_PAIR_MATRIX_MAX_N = 20size guard with a loopfallback for the OOM case.
phasedoc claim — qualified in README. ExtendingCUDA_ENCODING_METHODSto allowlistphaseis a separableRust-binding follow-up.
Docs
README.mdandTRITON_AMD_BACKEND.mdupdated for full method parity.Follow-ups (out of scope)
phasetoCUDA_ENCODING_METHODSfor cuda-resident tensor inputs.@triton.jitIQP kernel (would close the q=16 gap).phase_encodeinqumat_qdp.torch_refto match the engine surface.Checklist