[QDP] Close AMD-vs-CUDA encoder parity: add iqp, iqp-z, phase by ryankert01 · Pull Request #1292 · apache/mahout

ryankert01 · 2026-04-25T19:54:06Z

Follow-up to #1158 / #1289. Closes the AMD-vs-CUDA encoder gap on
TritonAmdEngine and takes a kernel-level optimization pass over all
six encoders.

What changed

Parity — added iqp (full ZZ), iqp-z (Z-only), and phase to
TritonAmdEngine.encode(). Algorithms mirror qumat_qdp.torch_ref and
the CUDA kernels in qdp-kernels/src/{iqp,phase}.cu.

Performance — verified on MI300X (ROCm 7.2 / torch 2.9.0+rocm6.4 /
triton 3.5.0), batch=64, fp32, vs qumat_qdp.torch_ref:

	q=8	q=12	q=16
amplitude	0.95×	0.95×	1.00×
angle	1.57×	1.37×	1.04×
basis	2.18×	2.10×	2.14×
iqp(ZZ)	1.96×	1.81×	1.14×
iqp-z	1.35×	1.32×	0.91×
phase	5.29×	5.39×	5.30×

Key wins:

Real @triton.jit phase kernel (fp32, n ≤ 32): one HIP launch
fuses bit-unpack + θ(b) + cos/sin + 1/√2^n + complex-pack via
view_as_real. Replaces 5 PyTorch intermediate allocations with one.
In-place FWT for IQP using a single (B, S/2) scratch buffer
instead of allocating a fresh (B, S) per stage via cat.
Bits-table cache — ((idx >> arange(n)) & 1) no longer rebuilt
on every call.
amp.to(complex_dtype) for amplitude/angle (one kernel) instead
of complex(amp, zeros_like(amp)) (three).
_to_2d fast path for already-correct torch tensors.

Tests

11 new tests in qdp/qdp-python/tests/test_triton_amd_backend.py:
torch_ref parity for iqp/iqp-z/phase, param-count validation,
unit-norm checks, fp64 precision contracts, error-message enumeration,
and a unified-router smoke test.

pytest qdp/qdp-python/tests/test_triton_amd_backend.py -q
========================== 19 passed, 1 skipped ==========================

ruff check + ruff format --check + ty check: clean.

Review responses (Copilot)

Broken cross-ref skipif — fixed. Replaced with a meaningful fp64
IQP precision test (the original required both torch.version.cuda
and torch.version.hip — mutually exclusive).
pair_matrix per-pair loop suggestion — rejected; n² tiny
launches would lose to one matmul, and torch_ref uses the same
approach. Added _IQP_PAIR_MATRIX_MAX_N = 20 size guard with a loop
fallback for the OOM case.
CUDA-tensor phase doc claim — qualified in README. Extending
CUDA_ENCODING_METHODS to allowlist phase is a separable
Rust-binding follow-up.

Docs

README.md and TRITON_AMD_BACKEND.md updated for full method parity.

Follow-ups (out of scope)

Add phase to CUDA_ENCODING_METHODS for cuda-resident tensor inputs.
Fused-FWT @triton.jit IQP kernel (would close the q=16 gap).
phase_encode in qumat_qdp.torch_ref to match the engine surface.

Checklist

Added or updated unit tests for all changes
Added or updated documentation for all changes

Copilot

Pull request overview

Adds missing encoding-method parity on the Triton AMD backend so the public QdpEngine(backend="amd") supports the same method strings as the CUDA backend.

Changes:

Implement iqp, iqp-z, and phase in TritonAmdEngine and dispatch them through encode(...).
Add ROCm-marked tests for parity vs qumat_qdp.torch_ref.iqp_encode and a local phase reference, plus validation/normalization checks.
Update README + Triton AMD backend docs to reflect expanded method support.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`qdp/qdp-python/qumat_qdp/triton_amd.py`	Adds `encode_iqp`/`encode_phase` implementations and method dispatch.
`qdp/qdp-python/tests/test_triton_amd_backend.py`	Adds new AMD backend tests for iqp/iqp-z/phase and routing/error-message coverage.
`qdp/qdp-python/TRITON_AMD_BACKEND.md`	Updates supported-methods list and test description.
`qdp/qdp-python/README.md`	Updates encoding-method table and backend support boundary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Addresses Copilot review on PR apache#1292 and pushes general kernel-level optimization across all six AMD encoders. PR review responses: - Drop the unreachable `test_triton_amd_iqp_cuda_reference_optional` (decorator required `torch.version.cuda` while body required `is_triton_amd_available()` → mutually exclusive). Replace with a meaningful float64 IQP precision contract test that actually runs. - Qualify README about the CUDA-tensor `phase` limitation: the Python extension's CUDA-tensor allowlist (`CUDA_ENCODING_METHODS`) does not yet include `phase`, so cuda-resident torch tensors must `.cpu()` first. Tracked as a follow-up. - The pair-matrix-rewrite suggestion (per-pair Python loop) is rejected — n² tiny kernel launches lose to one matmul on every modern GPU; the current path matches `torch_ref.iqp_encode` and the CUDA FWT phase kernel. Add a `_IQP_PAIR_MATRIX_MAX_N` guard that *does* fall back to a pair loop past n=20 (where the (2^n × n_pairs) workspace dominates HBM), so the OOM scenario is bounded. Encoder optimizations (verified on MI300X vs `qumat_qdp.torch_ref`, batch=64, fp32 input): | | q=8 | q=12 | q=16 | |--------|-------|-------|-------| | amplitude | 0.95× | 0.95× | 1.00× | | angle | 1.57× | 1.37× | 1.04× | | basis | 2.18× | 2.10× | 2.14× | | iqp(ZZ) | 1.96× | 1.81× | 1.14× | | iqp-z | 1.35× | 1.32× | 0.91× | | **phase** | **5.29×** | **5.39×** | **5.30×** | What changed: - **Real `@triton.jit` phase kernel** (fp32 / n ≤ 32). One HIP kernel fuses bit-pattern materialization + θ(b) accumulation + cos/sin + 1/√2^n scaling + complex-pack, writing the output buffer interleaved via `view_as_real`. The PyTorch fallback path (used at fp64 or n > 32) was making 5 intermediate (B, S) allocations; the kernel makes one. - **Per-engine bits-table cache** (`_bits_cache`): the `((idx >> arange(n)) & 1).to(real)` table was being rebuilt on every call by `angle`/`iqp`/`phase`. Now cached per (n, dtype). At n=16 that's a ~4 MiB int64 + ~4 MiB real allocation saved per call. - **Pair-index cache** (`_pair_cache`): `torch.combinations(arange(n))` cached per n. - **`encode_amplitude`**: replaced `torch.complex(amp, zeros_like(amp)).to(complex_dtype)` with a single `amp.to(complex_dtype)` (writes (real, 0) interleaved in one kernel vs. zeros_like + complex_pack + cast = three). - **`encode_angle`**: collapsed the n-step Python product loop (which reallocated a (B, S) tensor per qubit) into a single `where(bits, sin, cos).prod(dim=2)` reduction. - **`encode_iqp`**: in-place n-stage Walsh-Hadamard butterfly using a single `(B, S/2)` scratch buffer. The previous `cat([lo+hi, lo-hi], dim=2)` allocated a fresh (B, S) tensor every stage; now `sub(out=scratch); a.add_(b); b.copy_(scratch)` reuses one workspace across all n stages. Also packs `f` via `torch.complex(cos, sin)` in one shot rather than writing to strided `.real`/`.imag`. - **`_to_2d` fast path**: skip `as_tensor` + `.contiguous()` work when the caller already supplies a 2-D, contiguous, on-device, correctly-typed torch tensor (the common case for benchmarks). Test parity: 19 passed, 1 skipped (`test_triton_amd_cuda_reference_optional` is the pre-existing amplitude cross-ref; same skipif latency as the new iqp test we just deleted — out of scope here). Numerical parity: all encoders still match `torch_ref` / `_torch_phase_ref` within float-rounding tolerance; the IQP fp64 contract test confirms `atol=1e-12`.

CUDA QdpEngine accepts amplitude, angle, basis, iqp, iqp-z, and phase. The Triton AMD path only implemented the first three, so AMD users hit a hard error on the IQP- and phase-family encodings (e.g. SVHN-IQP). This adds vectorized PyTorch implementations for the missing methods on TritonAmdEngine, dispatched through the same ``encode(method=...)`` contract: - ``iqp`` — full ZZ entanglement: phase = Σ x_i·data_i + Σ_{i<j} x_i x_j·data_ij, followed by an n-stage Walsh-Hadamard butterfly and 1/2^n scaling. - ``iqp-z`` — Z-only diagonal: same FWT path with no ZZ pairs. - ``phase`` — per-qubit product state (1/√2^n)·exp(i·Σ_k phases_k·b_k). Parity tests added against ``qumat_qdp.torch_ref.iqp_encode`` (which is already validated against CUDA upstream) and a local pure-torch phase reference. Also added unit-norm structural checks, param-count validation, float64 precision contract, and a router test that the public ``QdpEngine(backend="amd")`` accepts the new methods. Verified on AMD Instinct MI300X (ROCm 7.2 / torch 2.9.0+rocm6.4 / triton 3.5.0): full triton_amd test file is 18 passed, 2 skipped (NVIDIA CUDA-only references).

Addresses Copilot review on PR apache#1292 and pushes general kernel-level optimization across all six AMD encoders. PR review responses: - Drop the unreachable `test_triton_amd_iqp_cuda_reference_optional` (decorator required `torch.version.cuda` while body required `is_triton_amd_available()` → mutually exclusive). Replace with a meaningful float64 IQP precision contract test that actually runs. - Qualify README about the CUDA-tensor `phase` limitation: the Python extension's CUDA-tensor allowlist (`CUDA_ENCODING_METHODS`) does not yet include `phase`, so cuda-resident torch tensors must `.cpu()` first. Tracked as a follow-up. - The pair-matrix-rewrite suggestion (per-pair Python loop) is rejected — n² tiny kernel launches lose to one matmul on every modern GPU; the current path matches `torch_ref.iqp_encode` and the CUDA FWT phase kernel. Add a `_IQP_PAIR_MATRIX_MAX_N` guard that *does* fall back to a pair loop past n=20 (where the (2^n × n_pairs) workspace dominates HBM), so the OOM scenario is bounded. Encoder optimizations (verified on MI300X vs `qumat_qdp.torch_ref`, batch=64, fp32 input): | | q=8 | q=12 | q=16 | |--------|-------|-------|-------| | amplitude | 0.95× | 0.95× | 1.00× | | angle | 1.57× | 1.37× | 1.04× | | basis | 2.18× | 2.10× | 2.14× | | iqp(ZZ) | 1.96× | 1.81× | 1.14× | | iqp-z | 1.35× | 1.32× | 0.91× | | **phase** | **5.29×** | **5.39×** | **5.30×** | What changed: - **Real `@triton.jit` phase kernel** (fp32 / n ≤ 32). One HIP kernel fuses bit-pattern materialization + θ(b) accumulation + cos/sin + 1/√2^n scaling + complex-pack, writing the output buffer interleaved via `view_as_real`. The PyTorch fallback path (used at fp64 or n > 32) was making 5 intermediate (B, S) allocations; the kernel makes one. - **Per-engine bits-table cache** (`_bits_cache`): the `((idx >> arange(n)) & 1).to(real)` table was being rebuilt on every call by `angle`/`iqp`/`phase`. Now cached per (n, dtype). At n=16 that's a ~4 MiB int64 + ~4 MiB real allocation saved per call. - **Pair-index cache** (`_pair_cache`): `torch.combinations(arange(n))` cached per n. - **`encode_amplitude`**: replaced `torch.complex(amp, zeros_like(amp)).to(complex_dtype)` with a single `amp.to(complex_dtype)` (writes (real, 0) interleaved in one kernel vs. zeros_like + complex_pack + cast = three). - **`encode_angle`**: collapsed the n-step Python product loop (which reallocated a (B, S) tensor per qubit) into a single `where(bits, sin, cos).prod(dim=2)` reduction. - **`encode_iqp`**: in-place n-stage Walsh-Hadamard butterfly using a single `(B, S/2)` scratch buffer. The previous `cat([lo+hi, lo-hi], dim=2)` allocated a fresh (B, S) tensor every stage; now `sub(out=scratch); a.add_(b); b.copy_(scratch)` reuses one workspace across all n stages. Also packs `f` via `torch.complex(cos, sin)` in one shot rather than writing to strided `.real`/`.imag`. - **`_to_2d` fast path**: skip `as_tensor` + `.contiguous()` work when the caller already supplies a 2-D, contiguous, on-device, correctly-typed torch tensor (the common case for benchmarks). Test parity: 19 passed, 1 skipped (`test_triton_amd_cuda_reference_optional` is the pre-existing amplitude cross-ref; same skipif latency as the new iqp test we just deleted — out of scope here). Numerical parity: all encoders still match `torch_ref` / `_torch_phase_ref` within float-rounding tolerance; the IQP fp64 contract test confirms `atol=1e-12`.

ryankert01 requested review from 400Ping and guan404ming as code owners April 25, 2026 19:54

ryankert01 force-pushed the amd-iqp-phase-parity branch from dbf0f0e to f56b31d Compare April 25, 2026 19:59

ryankert01 requested a review from Copilot April 25, 2026 20:01

Copilot started reviewing on behalf of ryankert01 April 25, 2026 20:02 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Comment thread qdp/qdp-python/tests/test_triton_amd_backend.py Outdated

Comment thread qdp/qdp-python/qumat_qdp/triton_amd.py Outdated

Comment thread qdp/qdp-python/README.md

ryankert01 requested a review from rich7420 April 26, 2026 04:46

ryankert01 force-pushed the amd-iqp-phase-parity branch from 2e5cb60 to 0ff8115 Compare April 27, 2026 11:56

ryankert01 added 2 commits May 7, 2026 15:01

ryankert01 force-pushed the amd-iqp-phase-parity branch from 0ff8115 to e70cb58 Compare May 7, 2026 07:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QDP] Close AMD-vs-CUDA encoder parity: add iqp, iqp-z, phase#1292

[QDP] Close AMD-vs-CUDA encoder parity: add iqp, iqp-z, phase#1292
ryankert01 wants to merge 2 commits intoapache:mainfrom
ryankert01:amd-iqp-phase-parity

ryankert01 commented Apr 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ryankert01 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Tests

Review responses (Copilot)

Docs

Follow-ups (out of scope)

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ryankert01 commented Apr 25, 2026 •

edited

Loading