Skip to content

[Feature][QDP] F32 support for angle/basis encoders, fidelity metrics and pipeline improvements#1275

Open
rich7420 wants to merge 5 commits intoapache:mainfrom
rich7420:pipeline-improvement
Open

[Feature][QDP] F32 support for angle/basis encoders, fidelity metrics and pipeline improvements#1275
rich7420 wants to merge 5 commits intoapache:mainfrom
rich7420:pipeline-improvement

Conversation

@rich7420
Copy link
Copy Markdown
Contributor

Related Issues

Changes

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Why

Angle and basis encoding always dispatched to F64 CUDA kernels even though F32
is sufficient for typical ML workloads. The hard-coded prefetch depth of 16
was also unsafe at high qubit counts (20-qubit amplitude encoding = 512 MB/batch
× 16 = 8 GB of buffered data). The PipelineIterator shutdown sequence also
had no ordered teardown, risking thread leaks on drop.

How

F32 batch kernels for angle and basis

  • Added angle_encode_batch_kernel_f32 (grid-stride) and
    launch_angle_encode_batch_f32 in angle.cu
  • Added basis_encode_kernel_f32, basis_encode_batch_kernel_f32, and their
    launchers in basis.cu
  • Extended encoding_supports_f32 to return true for
    "amplitude" | "angle" | "basis"

Benchmark results — 16 qubits, batch-size 64, 200 batches:

Encoding main (F64) this PR (F32) Speedup
amplitude 1,170 vec/sec 1,184 vec/sec ~1×
angle 7,466 vec/sec 29,790 vec/sec
basis 59,883 vec/sec 314,031 vec/sec

Auto prefetch depth

  • compute_optimal_prefetch_depth() targets a 256 MB CPU buffer, clamped to
    [1, 32]; default prefetch_depth is now 0 (resolved in normalize())

PipelineIterator cleanup

  • rx wrapped in Mutex<Receiver> (required for PyO3 #[pyclass] Sync bound)
  • recycle_tx and producer_handle changed to Option<_> for owned teardown
  • Drop follows correct shutdown order: drop sender → drain channel → join thread

Fidelity metrics (gpu/metrics.rs)

  • CPU-side |⟨ψ|φ⟩|² fidelity and trace distance, including cross-precision
    (F32 vs F64) helpers

Python / benchmark

  • backend("auto") falls back to PyTorch with a RuntimeWarning
  • as_torch_dataset() wraps the loader as a torch.utils.data.IterableDataset
  • --warmup N flag added to benchmark scripts (default 5)
  • Fixed misleading profile label GPU::H2D_Indices_f32GPU::H2D_BasisIndices
  • Restored _validate_loader_args() call on the synthetic path

Checklist

  • Added or updated unit tests for all changes
  • Added or updated documentation for all changes

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves QDP GPU encoding and pipeline ergonomics by adding float32 (F32) support to angle/basis encoders, introducing CPU-side fidelity/trace-distance metrics for validation, and making the prefetch/pipeline shutdown behavior safer and more configurable across Rust and Python entrypoints.

Changes:

  • Add CUDA F32 batch kernels + Rust encoder plumbing for angle and basis encodings, and expand F32 support gating.
  • Introduce auto-computed prefetch depth (targeting ~256MB CPU buffer) and rework PipelineIterator teardown logic.
  • Add GPU metrics helpers (fidelity/trace distance + GPU readback) and Python usability improvements (backend “auto” fallback, torch dataset wrapper, benchmark warmup flag).

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
testing/qdp_python/test_fallback.py Updates invalid-backend test expectations for the loader backend selection.
qdp/qdp-python/qumat_qdp/loader.py Adds 'auto' backend option with warning-based fallback + earlier file/extension validation + as_torch_dataset().
qdp/qdp-python/benchmark/benchmark_throughput.py Adds --warmup flag and forwards warmup batches to benchmark runner.
qdp/qdp-python/benchmark/benchmark_latency.py Adds --warmup flag and forwards warmup batches to benchmark runner.
qdp/qdp-kernels/src/lib.rs Extends kernel FFI surface with new basis/angle F32 entrypoints and batch angle launcher.
qdp/qdp-kernels/src/basis.cu Implements basis encode single/batch F32 CUDA kernels and launchers.
qdp/qdp-kernels/src/angle.cu Implements angle batch F32 CUDA kernel and launcher.
qdp/qdp-core/tests/gpu_fidelity.rs Adds unit + GPU cross-precision fidelity validation tests.
qdp/qdp-core/src/pipeline_runner.rs Adds auto prefetch computation, expands F32 encoding support, and reworks iterator teardown/storage.
qdp/qdp-core/src/gpu/mod.rs Exposes new metrics module and re-exports fidelity/trace distance helpers.
qdp/qdp-core/src/gpu/metrics.rs Adds fidelity/trace-distance implementations plus Linux-only GPU download helpers.
qdp/qdp-core/src/gpu/encodings/basis.rs Adds encode_batch_f32 for basis encoding and wires it to the new F32 kernel.
qdp/qdp-core/src/gpu/encodings/angle.rs Adds encode_batch_f32 for angle encoding and wires it to the new F32 kernel.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

//! They are intended for **testing and validation**, not the hot path.

#[cfg(target_os = "linux")]
use cudarc::driver::CudaDevice;
Comment thread qdp/qdp-core/tests/gpu_fidelity.rs Outdated
);
}

// ═════════════════════════════════════════════════════════════════════���═
Comment on lines +33 to +68
/// Compute the state fidelity |⟨ψ|φ⟩|² between two complex state vectors
/// given as interleaved (re, im) f64 slices of equal length.
///
/// Both slices must have length `2 * state_dim` (re0, im0, re1, im1, ...).
/// Returns a value in [0, 1]. Fidelity == 1 means identical states (up to
/// global phase).
pub fn fidelity_f64(state_a: &[f64], state_b: &[f64]) -> Result<f64> {
if state_a.len() != state_b.len() {
return Err(MahoutError::InvalidInput(format!(
"fidelity: length mismatch ({} vs {})",
state_a.len(),
state_b.len()
)));
}
if !state_a.len().is_multiple_of(2) {
return Err(MahoutError::InvalidInput(
"fidelity: length must be even (interleaved re/im pairs)".to_string(),
));
}

// ⟨ψ|φ⟩ = Σ_i conj(a_i) * b_i
let mut re_acc = 0.0_f64;
let mut im_acc = 0.0_f64;
for i in (0..state_a.len()).step_by(2) {
let a_re = state_a[i];
let a_im = state_a[i + 1];
let b_re = state_b[i];
let b_im = state_b[i + 1];
// conj(a) * b = (a_re - i*a_im)(b_re + i*b_im)
// = (a_re*b_re + a_im*b_im) + i*(a_re*b_im - a_im*b_re)
re_acc += a_re * b_re + a_im * b_im;
im_acc += a_re * b_im - a_im * b_re;
}

Ok(re_acc * re_acc + im_acc * im_acc)
}
Comment thread qdp/qdp-python/qumat_qdp/loader.py Outdated
Comment on lines +247 to +250
``'auto'`` (default-like): tries the Rust backend first and silently
falls back to the PyTorch reference backend if the Rust extension is
unavailable. ``'rust'`` raises if the extension is missing.
``'pytorch'`` always uses the pure-PyTorch path.
Comment thread qdp/qdp-core/src/pipeline_runner.rs Outdated
Comment on lines +142 to +153
let sample_len = match encoding_method.to_lowercase().as_str() {
"angle" => num_qubits,
"basis" => 1,
_ => 1usize << num_qubits, // amplitude / iqp
};
let bytes_per_element = if float32 { 4usize } else { 8usize };
let bytes_per_batch = batch_size * sample_len * bytes_per_element;

if bytes_per_batch == 0 {
return MAX_DEPTH;
}
(TARGET_BYTES / bytes_per_batch).clamp(MIN_DEPTH, MAX_DEPTH)
Comment thread qdp/qdp-core/src/pipeline_runner.rs Outdated
Comment on lines +433 to +434
// Drain any remaining items so the producer's send() unblocks.
while self.rx.lock().unwrap().try_recv().is_ok() {}
Copy link
Copy Markdown
Member

@ryankert01 ryankert01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this pr high levelly, lg so far

Copy link
Copy Markdown
Member

@400Ping 400Ping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

@viiccwen
Copy link
Copy Markdown
Contributor

F32 support for angle encoding is duplicated to #1268

@rich7420
Copy link
Copy Markdown
Contributor Author

@viiccwen yes! I think #1268 should be merged first. then I'll rebase it

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends QDP’s GPU encoding pipeline to support float32 (F32) for angle and basis encoders (including batched CUDA tensor paths), adds CPU-side fidelity/trace-distance metrics for validation, and improves pipeline robustness via auto prefetch sizing and safer iterator teardown.

Changes:

  • Added/extended F32 CUDA support for angle/basis encodings (including zero-copy batched paths) and updated Python/Rust dispatch + validation.
  • Introduced GPU validation helpers (finite checks, basis-index validate+cast) and CPU-side fidelity/trace-distance utilities with new GPU precision comparison tests.
  • Improved pipeline ergonomics: auto-computed prefetch depth, safer PipelineIterator drop order, Python loader “auto” backend fallback + iterable dataset wrapper, and benchmark warmup flag.

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
testing/qdp_python/test_fallback.py Updates backend validation test for new backend options.
testing/qdp_python/test_dlpack_validation.py Adjusts CUDA F32 angle validation to accept 2D batch path.
testing/qdp/test_bindings.py Updates bindings tests to assert F32 CUDA angle batch support.
qdp/qdp-python/src/pytorch.rs Updates CUDA tensor validation rules for angle/basis F32 acceptance.
qdp/qdp-python/src/engine.rs Centralizes CUDA tensor dispatch to route F32 angle/basis to correct zero-copy paths.
qdp/qdp-python/src/dlpack.rs Marks DLPack helper as currently unused but preserved for planned refactor.
qdp/qdp-python/qumat_qdp/loader.py Adds backend('auto') fallback with warnings, earlier file/streaming validation, and as_torch_dataset().
qdp/qdp-python/benchmark/benchmark_throughput.py Adds --warmup support and passes it to benchmark runner.
qdp/qdp-python/benchmark/benchmark_latency.py Adds --warmup support and passes it to benchmark runner.
qdp/qdp-kernels/src/validation.cu Adds f64 finite-check and basis-index validation/cast kernels + launchers.
qdp/qdp-kernels/src/lib.rs Extends FFI surface for new kernels and adds no-CUDA stubs.
qdp/qdp-kernels/src/basis.cu Adds F32 basis kernels + launchers (single + batch).
qdp/qdp-core/tests/gpu_ptr_encoding.rs Adds tests for basis F32 GPU-pointer encode APIs (incl. rejection cases).
qdp/qdp-core/tests/gpu_fidelity.rs Adds fidelity/trace-distance unit tests + GPU F32 vs F64 comparisons.
qdp/qdp-core/tests/gpu_angle_encoding.rs Adds large-batch async pipeline test for F32 angle batch encoding.
qdp/qdp-core/src/reader.rs Notes structural limitation: file readers still materialize f64 before f32 casting.
qdp/qdp-core/src/pipeline_runner.rs Implements auto prefetch depth, extends F32 encoding support set, and hardens iterator teardown.
qdp/qdp-core/src/lib.rs Documents that F32 GPU-pointer APIs don’t dispatch by encoding method; adds basis F32 GPU-pointer APIs.
qdp/qdp-core/src/gpu/validation.rs Adds reusable GPU-side validation helpers for finite checks and basis indices.
qdp/qdp-core/src/gpu/pipeline.rs Generalizes async pipeline to typed buffers and updates copy API semantics.
qdp/qdp-core/src/gpu/mod.rs Exposes new metrics and validation modules/exports.
qdp/qdp-core/src/gpu/metrics.rs Adds fidelity/trace-distance implementations and GPU download helpers.
qdp/qdp-core/src/gpu/memory.rs Makes pinned host buffers generic over element type (e.g., f32/f64).
qdp/qdp-core/src/gpu/encodings/mod.rs Adds notes about dispatcher allocations and future refactor direction.
qdp/qdp-core/src/gpu/encodings/basis.rs Adds basis index bounds checks and implements F32 basis GPU-pointer path via validate+cast kernel.
qdp/qdp-core/src/gpu/encodings/angle.rs Adds stricter input validation, finite checks, and F32 async pipeline path for large batches.
qdp/qdp-core/src/gpu/buffer_pool.rs Generalizes pinned buffer pool/handle over element type (f32/f64).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 98 to 116
/// Async H2D copy on the copy stream.
///
/// # Safety
/// `src` must be valid for `len_elements` `f64` values and properly aligned.
/// `dst` must point to device memory for `len_elements` `f64` values on the same device.
/// `src` must be valid for `len_bytes` bytes and properly aligned.
/// `dst` must point to device memory for `len_bytes` bytes on the same device.
/// Both pointers must remain valid until the copy completes on `stream_copy`.
pub unsafe fn async_copy_to_device(
&self,
src: *const c_void,
dst: *mut c_void,
len_elements: usize,
len_bytes: usize,
) -> Result<()> {
crate::profile_scope!("GPU::H2D_Copy");
unsafe {
let ret = cudaMemcpyAsync(
dst,
src,
len_elements * std::mem::size_of::<f64>(),
len_bytes,
CUDA_MEMCPY_HOST_TO_DEVICE,
Comment on lines 615 to 637
@@ -546,6 +636,42 @@ pub extern "C" fn launch_check_finite_batch_f32(
999
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants