Skip to content

[QDP] Close AMD-vs-CUDA encoder parity: add iqp, iqp-z, phase#1292

Open
ryankert01 wants to merge 2 commits intoapache:mainfrom
ryankert01:amd-iqp-phase-parity
Open

[QDP] Close AMD-vs-CUDA encoder parity: add iqp, iqp-z, phase#1292
ryankert01 wants to merge 2 commits intoapache:mainfrom
ryankert01:amd-iqp-phase-parity

Conversation

@ryankert01
Copy link
Copy Markdown
Member

@ryankert01 ryankert01 commented Apr 25, 2026

Follow-up to #1158 / #1289. Closes the AMD-vs-CUDA encoder gap on
TritonAmdEngine and takes a kernel-level optimization pass over all
six encoders.

What changed

Parity — added iqp (full ZZ), iqp-z (Z-only), and phase to
TritonAmdEngine.encode(). Algorithms mirror qumat_qdp.torch_ref and
the CUDA kernels in qdp-kernels/src/{iqp,phase}.cu.

Performance — verified on MI300X (ROCm 7.2 / torch 2.9.0+rocm6.4 /
triton 3.5.0), batch=64, fp32, vs qumat_qdp.torch_ref:

q=8 q=12 q=16
amplitude 0.95× 0.95× 1.00×
angle 1.57× 1.37× 1.04×
basis 2.18× 2.10× 2.14×
iqp(ZZ) 1.96× 1.81× 1.14×
iqp-z 1.35× 1.32× 0.91×
phase 5.29× 5.39× 5.30×

Key wins:

  • Real @triton.jit phase kernel (fp32, n ≤ 32): one HIP launch
    fuses bit-unpack + θ(b) + cos/sin + 1/√2^n + complex-pack via
    view_as_real. Replaces 5 PyTorch intermediate allocations with one.
  • In-place FWT for IQP using a single (B, S/2) scratch buffer
    instead of allocating a fresh (B, S) per stage via cat.
  • Bits-table cache((idx >> arange(n)) & 1) no longer rebuilt
    on every call.
  • amp.to(complex_dtype) for amplitude/angle (one kernel) instead
    of complex(amp, zeros_like(amp)) (three).
  • _to_2d fast path for already-correct torch tensors.

Tests

11 new tests in qdp/qdp-python/tests/test_triton_amd_backend.py:
torch_ref parity for iqp/iqp-z/phase, param-count validation,
unit-norm checks, fp64 precision contracts, error-message enumeration,
and a unified-router smoke test.

pytest qdp/qdp-python/tests/test_triton_amd_backend.py -q
========================== 19 passed, 1 skipped ==========================

ruff check + ruff format --check + ty check: clean.

Review responses (Copilot)

  • Broken cross-ref skipif — fixed. Replaced with a meaningful fp64
    IQP precision test (the original required both torch.version.cuda
    and torch.version.hip — mutually exclusive).
  • pair_matrix per-pair loop suggestion — rejected; n² tiny
    launches would lose to one matmul, and torch_ref uses the same
    approach. Added _IQP_PAIR_MATRIX_MAX_N = 20 size guard with a loop
    fallback for the OOM case.
  • CUDA-tensor phase doc claim — qualified in README. Extending
    CUDA_ENCODING_METHODS to allowlist phase is a separable
    Rust-binding follow-up.

Docs

README.md and TRITON_AMD_BACKEND.md updated for full method parity.

Follow-ups (out of scope)

  • Add phase to CUDA_ENCODING_METHODS for cuda-resident tensor inputs.
  • Fused-FWT @triton.jit IQP kernel (would close the q=16 gap).
  • phase_encode in qumat_qdp.torch_ref to match the engine surface.

Checklist

  • Added or updated unit tests for all changes
  • Added or updated documentation for all changes

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds missing encoding-method parity on the Triton AMD backend so the public QdpEngine(backend="amd") supports the same method strings as the CUDA backend.

Changes:

  • Implement iqp, iqp-z, and phase in TritonAmdEngine and dispatch them through encode(...).
  • Add ROCm-marked tests for parity vs qumat_qdp.torch_ref.iqp_encode and a local phase reference, plus validation/normalization checks.
  • Update README + Triton AMD backend docs to reflect expanded method support.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
qdp/qdp-python/qumat_qdp/triton_amd.py Adds encode_iqp/encode_phase implementations and method dispatch.
qdp/qdp-python/tests/test_triton_amd_backend.py Adds new AMD backend tests for iqp/iqp-z/phase and routing/error-message coverage.
qdp/qdp-python/TRITON_AMD_BACKEND.md Updates supported-methods list and test description.
qdp/qdp-python/README.md Updates encoding-method table and backend support boundary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread qdp/qdp-python/tests/test_triton_amd_backend.py Outdated
Comment thread qdp/qdp-python/qumat_qdp/triton_amd.py Outdated
Comment thread qdp/qdp-python/README.md
ryankert01 added a commit to ryankert01/mahout that referenced this pull request Apr 25, 2026
Addresses Copilot review on PR apache#1292 and pushes general kernel-level
optimization across all six AMD encoders.

PR review responses:

- Drop the unreachable `test_triton_amd_iqp_cuda_reference_optional`
  (decorator required `torch.version.cuda` while body required
  `is_triton_amd_available()` → mutually exclusive). Replace with a
  meaningful float64 IQP precision contract test that actually runs.
- Qualify README about the CUDA-tensor `phase` limitation: the Python
  extension's CUDA-tensor allowlist (`CUDA_ENCODING_METHODS`) does not
  yet include `phase`, so cuda-resident torch tensors must `.cpu()`
  first. Tracked as a follow-up.
- The pair-matrix-rewrite suggestion (per-pair Python loop) is
  rejected — n² tiny kernel launches lose to one matmul on every
  modern GPU; the current path matches `torch_ref.iqp_encode` and the
  CUDA FWT phase kernel. Add a `_IQP_PAIR_MATRIX_MAX_N` guard that
  *does* fall back to a pair loop past n=20 (where the (2^n × n_pairs)
  workspace dominates HBM), so the OOM scenario is bounded.

Encoder optimizations (verified on MI300X vs `qumat_qdp.torch_ref`,
batch=64, fp32 input):

|        | q=8   | q=12  | q=16  |
|--------|-------|-------|-------|
| amplitude | 0.95× | 0.95× | 1.00× |
| angle     | 1.57× | 1.37× | 1.04× |
| basis     | 2.18× | 2.10× | 2.14× |
| iqp(ZZ)   | 1.96× | 1.81× | 1.14× |
| iqp-z     | 1.35× | 1.32× | 0.91× |
| **phase** | **5.29×** | **5.39×** | **5.30×** |

What changed:

- **Real `@triton.jit` phase kernel** (fp32 / n ≤ 32). One HIP kernel
  fuses bit-pattern materialization + θ(b) accumulation + cos/sin +
  1/√2^n scaling + complex-pack, writing the output buffer interleaved
  via `view_as_real`. The PyTorch fallback path (used at fp64 or n > 32)
  was making 5 intermediate (B, S) allocations; the kernel makes one.
- **Per-engine bits-table cache** (`_bits_cache`): the
  `((idx >> arange(n)) & 1).to(real)` table was being rebuilt on every
  call by `angle`/`iqp`/`phase`. Now cached per (n, dtype). At n=16
  that's a ~4 MiB int64 + ~4 MiB real allocation saved per call.
- **Pair-index cache** (`_pair_cache`): `torch.combinations(arange(n))`
  cached per n.
- **`encode_amplitude`**: replaced
  `torch.complex(amp, zeros_like(amp)).to(complex_dtype)` with a single
  `amp.to(complex_dtype)` (writes (real, 0) interleaved in one kernel
  vs. zeros_like + complex_pack + cast = three).
- **`encode_angle`**: collapsed the n-step Python product loop (which
  reallocated a (B, S) tensor per qubit) into a single
  `where(bits, sin, cos).prod(dim=2)` reduction.
- **`encode_iqp`**: in-place n-stage Walsh-Hadamard butterfly using a
  single `(B, S/2)` scratch buffer. The previous
  `cat([lo+hi, lo-hi], dim=2)` allocated a fresh (B, S) tensor every
  stage; now `sub(out=scratch); a.add_(b); b.copy_(scratch)` reuses one
  workspace across all n stages. Also packs `f` via `torch.complex(cos,
  sin)` in one shot rather than writing to strided `.real`/`.imag`.
- **`_to_2d` fast path**: skip `as_tensor` + `.contiguous()` work when
  the caller already supplies a 2-D, contiguous, on-device,
  correctly-typed torch tensor (the common case for benchmarks).

Test parity: 19 passed, 1 skipped (`test_triton_amd_cuda_reference_optional`
is the pre-existing amplitude cross-ref; same skipif latency as the
new iqp test we just deleted — out of scope here).

Numerical parity: all encoders still match `torch_ref` /
`_torch_phase_ref` within float-rounding tolerance; the IQP fp64
contract test confirms `atol=1e-12`.
@ryankert01 ryankert01 requested a review from rich7420 April 26, 2026 04:46
ryankert01 added a commit to ryankert01/mahout that referenced this pull request Apr 27, 2026
Addresses Copilot review on PR apache#1292 and pushes general kernel-level
optimization across all six AMD encoders.

PR review responses:

- Drop the unreachable `test_triton_amd_iqp_cuda_reference_optional`
  (decorator required `torch.version.cuda` while body required
  `is_triton_amd_available()` → mutually exclusive). Replace with a
  meaningful float64 IQP precision contract test that actually runs.
- Qualify README about the CUDA-tensor `phase` limitation: the Python
  extension's CUDA-tensor allowlist (`CUDA_ENCODING_METHODS`) does not
  yet include `phase`, so cuda-resident torch tensors must `.cpu()`
  first. Tracked as a follow-up.
- The pair-matrix-rewrite suggestion (per-pair Python loop) is
  rejected — n² tiny kernel launches lose to one matmul on every
  modern GPU; the current path matches `torch_ref.iqp_encode` and the
  CUDA FWT phase kernel. Add a `_IQP_PAIR_MATRIX_MAX_N` guard that
  *does* fall back to a pair loop past n=20 (where the (2^n × n_pairs)
  workspace dominates HBM), so the OOM scenario is bounded.

Encoder optimizations (verified on MI300X vs `qumat_qdp.torch_ref`,
batch=64, fp32 input):

|        | q=8   | q=12  | q=16  |
|--------|-------|-------|-------|
| amplitude | 0.95× | 0.95× | 1.00× |
| angle     | 1.57× | 1.37× | 1.04× |
| basis     | 2.18× | 2.10× | 2.14× |
| iqp(ZZ)   | 1.96× | 1.81× | 1.14× |
| iqp-z     | 1.35× | 1.32× | 0.91× |
| **phase** | **5.29×** | **5.39×** | **5.30×** |

What changed:

- **Real `@triton.jit` phase kernel** (fp32 / n ≤ 32). One HIP kernel
  fuses bit-pattern materialization + θ(b) accumulation + cos/sin +
  1/√2^n scaling + complex-pack, writing the output buffer interleaved
  via `view_as_real`. The PyTorch fallback path (used at fp64 or n > 32)
  was making 5 intermediate (B, S) allocations; the kernel makes one.
- **Per-engine bits-table cache** (`_bits_cache`): the
  `((idx >> arange(n)) & 1).to(real)` table was being rebuilt on every
  call by `angle`/`iqp`/`phase`. Now cached per (n, dtype). At n=16
  that's a ~4 MiB int64 + ~4 MiB real allocation saved per call.
- **Pair-index cache** (`_pair_cache`): `torch.combinations(arange(n))`
  cached per n.
- **`encode_amplitude`**: replaced
  `torch.complex(amp, zeros_like(amp)).to(complex_dtype)` with a single
  `amp.to(complex_dtype)` (writes (real, 0) interleaved in one kernel
  vs. zeros_like + complex_pack + cast = three).
- **`encode_angle`**: collapsed the n-step Python product loop (which
  reallocated a (B, S) tensor per qubit) into a single
  `where(bits, sin, cos).prod(dim=2)` reduction.
- **`encode_iqp`**: in-place n-stage Walsh-Hadamard butterfly using a
  single `(B, S/2)` scratch buffer. The previous
  `cat([lo+hi, lo-hi], dim=2)` allocated a fresh (B, S) tensor every
  stage; now `sub(out=scratch); a.add_(b); b.copy_(scratch)` reuses one
  workspace across all n stages. Also packs `f` via `torch.complex(cos,
  sin)` in one shot rather than writing to strided `.real`/`.imag`.
- **`_to_2d` fast path**: skip `as_tensor` + `.contiguous()` work when
  the caller already supplies a 2-D, contiguous, on-device,
  correctly-typed torch tensor (the common case for benchmarks).

Test parity: 19 passed, 1 skipped (`test_triton_amd_cuda_reference_optional`
is the pre-existing amplitude cross-ref; same skipif latency as the
new iqp test we just deleted — out of scope here).

Numerical parity: all encoders still match `torch_ref` /
`_torch_phase_ref` within float-rounding tolerance; the IQP fp64
contract test confirms `atol=1e-12`.
@ryankert01 ryankert01 force-pushed the amd-iqp-phase-parity branch from 2e5cb60 to 0ff8115 Compare April 27, 2026 11:56
ryankert01 added 2 commits May 7, 2026 15:01
CUDA QdpEngine accepts amplitude, angle, basis, iqp, iqp-z, and phase.
The Triton AMD path only implemented the first three, so AMD users hit
a hard error on the IQP- and phase-family encodings (e.g. SVHN-IQP).

This adds vectorized PyTorch implementations for the missing methods on
TritonAmdEngine, dispatched through the same ``encode(method=...)``
contract:

- ``iqp`` — full ZZ entanglement: phase = Σ x_i·data_i + Σ_{i<j} x_i x_j·data_ij,
  followed by an n-stage Walsh-Hadamard butterfly and 1/2^n scaling.
- ``iqp-z`` — Z-only diagonal: same FWT path with no ZZ pairs.
- ``phase`` — per-qubit product state (1/√2^n)·exp(i·Σ_k phases_k·b_k).

Parity tests added against ``qumat_qdp.torch_ref.iqp_encode`` (which is
already validated against CUDA upstream) and a local pure-torch phase
reference. Also added unit-norm structural checks, param-count
validation, float64 precision contract, and a router test that the
public ``QdpEngine(backend="amd")`` accepts the new methods.

Verified on AMD Instinct MI300X (ROCm 7.2 / torch 2.9.0+rocm6.4 /
triton 3.5.0): full triton_amd test file is 18 passed, 2 skipped
(NVIDIA CUDA-only references).
Addresses Copilot review on PR apache#1292 and pushes general kernel-level
optimization across all six AMD encoders.

PR review responses:

- Drop the unreachable `test_triton_amd_iqp_cuda_reference_optional`
  (decorator required `torch.version.cuda` while body required
  `is_triton_amd_available()` → mutually exclusive). Replace with a
  meaningful float64 IQP precision contract test that actually runs.
- Qualify README about the CUDA-tensor `phase` limitation: the Python
  extension's CUDA-tensor allowlist (`CUDA_ENCODING_METHODS`) does not
  yet include `phase`, so cuda-resident torch tensors must `.cpu()`
  first. Tracked as a follow-up.
- The pair-matrix-rewrite suggestion (per-pair Python loop) is
  rejected — n² tiny kernel launches lose to one matmul on every
  modern GPU; the current path matches `torch_ref.iqp_encode` and the
  CUDA FWT phase kernel. Add a `_IQP_PAIR_MATRIX_MAX_N` guard that
  *does* fall back to a pair loop past n=20 (where the (2^n × n_pairs)
  workspace dominates HBM), so the OOM scenario is bounded.

Encoder optimizations (verified on MI300X vs `qumat_qdp.torch_ref`,
batch=64, fp32 input):

|        | q=8   | q=12  | q=16  |
|--------|-------|-------|-------|
| amplitude | 0.95× | 0.95× | 1.00× |
| angle     | 1.57× | 1.37× | 1.04× |
| basis     | 2.18× | 2.10× | 2.14× |
| iqp(ZZ)   | 1.96× | 1.81× | 1.14× |
| iqp-z     | 1.35× | 1.32× | 0.91× |
| **phase** | **5.29×** | **5.39×** | **5.30×** |

What changed:

- **Real `@triton.jit` phase kernel** (fp32 / n ≤ 32). One HIP kernel
  fuses bit-pattern materialization + θ(b) accumulation + cos/sin +
  1/√2^n scaling + complex-pack, writing the output buffer interleaved
  via `view_as_real`. The PyTorch fallback path (used at fp64 or n > 32)
  was making 5 intermediate (B, S) allocations; the kernel makes one.
- **Per-engine bits-table cache** (`_bits_cache`): the
  `((idx >> arange(n)) & 1).to(real)` table was being rebuilt on every
  call by `angle`/`iqp`/`phase`. Now cached per (n, dtype). At n=16
  that's a ~4 MiB int64 + ~4 MiB real allocation saved per call.
- **Pair-index cache** (`_pair_cache`): `torch.combinations(arange(n))`
  cached per n.
- **`encode_amplitude`**: replaced
  `torch.complex(amp, zeros_like(amp)).to(complex_dtype)` with a single
  `amp.to(complex_dtype)` (writes (real, 0) interleaved in one kernel
  vs. zeros_like + complex_pack + cast = three).
- **`encode_angle`**: collapsed the n-step Python product loop (which
  reallocated a (B, S) tensor per qubit) into a single
  `where(bits, sin, cos).prod(dim=2)` reduction.
- **`encode_iqp`**: in-place n-stage Walsh-Hadamard butterfly using a
  single `(B, S/2)` scratch buffer. The previous
  `cat([lo+hi, lo-hi], dim=2)` allocated a fresh (B, S) tensor every
  stage; now `sub(out=scratch); a.add_(b); b.copy_(scratch)` reuses one
  workspace across all n stages. Also packs `f` via `torch.complex(cos,
  sin)` in one shot rather than writing to strided `.real`/`.imag`.
- **`_to_2d` fast path**: skip `as_tensor` + `.contiguous()` work when
  the caller already supplies a 2-D, contiguous, on-device,
  correctly-typed torch tensor (the common case for benchmarks).

Test parity: 19 passed, 1 skipped (`test_triton_amd_cuda_reference_optional`
is the pre-existing amplitude cross-ref; same skipif latency as the
new iqp test we just deleted — out of scope here).

Numerical parity: all encoders still match `torch_ref` /
`_torch_phase_ref` within float-rounding tolerance; the IQP fp64
contract test confirms `atol=1e-12`.
@ryankert01 ryankert01 force-pushed the amd-iqp-phase-parity branch from 0ff8115 to e70cb58 Compare May 7, 2026 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants