Skip to content

[Feature][QDP] Add MPI-ready distributed amplitude execution scaffolding#1296

Open
viiccwen wants to merge 8 commits intoapache:mainfrom
viiccwen:feature/qdp-multigpu-plan
Open

[Feature][QDP] Add MPI-ready distributed amplitude execution scaffolding#1296
viiccwen wants to merge 8 commits intoapache:mainfrom
viiccwen:feature/qdp-multigpu-plan

Conversation

@viiccwen
Copy link
Copy Markdown
Contributor

@viiccwen viiccwen commented May 4, 2026

Related Issues

Closed #1295

Changes

  • add distributed amplitude planning, layout, runtime, and state scaffolding for QDP multi-GPU execution
  • add DistributedExecutionContext so distributed execution is driven by a bundled mesh-plus-collectives object rather than ad hoc device and collective parameters
  • add a CollectiveCommunicator seam and an in-process LocalCollectiveCommunicator implementation for the current single-process path
  • ensure distributed execution resolves CUDA handles from planned device_id values so shard metadata, device handles, and active device context stay aligned
  • add a distributed q34 probe plus runtime / planner / topology / communicator coverage for reordered execution paths
  • update CUDA arch targeting so the probe builds on the GPUs available on this host

Why

  • establish a QDP-native distributed amplitude foundation that can scale beyond a single GPU without depending on Lightning-specific assumptions
  • make the current single-process path extensible toward future MPI-backed collectives
  • verify that the implementation can successfully materialize and encode a 34-qubit float32 distributed state on this host

How

  • separate distributed planning, execution context, runtime, and state responsibilities
  • drive shard execution from planned placement metadata instead of raw mesh iteration order
  • keep the current collectives local and in-process, while shaping the execution boundary for future MPI-backed implementations
  • validate the path with distributed tests and the distributed_multigpu_q34_probe example
sequenceDiagram
    participant Caller as QdpEngine caller
    participant Engine as QdpEngine
    participant Mesh as DeviceMesh
    participant Planner as PlacementPlanner
    participant Ctx as DistributedExecutionContext
    participant Runtime as distributed runtime
    participant Comm as LocalCollectiveCommunicator

    Caller->>Engine: encode distributed amplitude request
    Engine->>Engine: validate input and resolve request
    Engine->>Mesh: build distributed mesh
    Engine->>Ctx: construct execution context
    Engine->>Planner: build placement plan
    Planner-->>Engine: placement + shard ranges
    Engine->>Runtime: execute distributed encode
    Runtime->>Runtime: bind planned device handles
    Runtime->>Comm: reduce local norm contributions
    Comm-->>Runtime: global norm
    Runtime-->>Engine: DistributedStateVector
    Engine-->>Caller: sharded distributed state
Loading

@400Ping
Copy link
Copy Markdown
Member

400Ping commented May 4, 2026

Nice one, will probably take a look on Thursday.

Copy link
Copy Markdown
Member

@ryankert01 ryankert01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, nice initiative. Curious of can it be plugging in our current lighting.gpu (penny lane) workflow?

mahout.qdp -> lighting.gpu (by zero copy)

@viiccwen viiccwen force-pushed the feature/qdp-multigpu-plan branch from 4145efa to baa6a90 Compare May 5, 2026 12:25
@viiccwen viiccwen changed the title Feat(GPU): add single-node distributed amplitude scaffolding [Feature][QDP] Add MPI-ready distributed amplitude execution scaffolding May 5, 2026
@viiccwen
Copy link
Copy Markdown
Contributor Author

viiccwen commented May 6, 2026

@ryankert01, after researching, I think the main gap is on the lightning.gpu integration boundary.

From what I can tell, the missing piece is not “can Mahout produce a GPU-resident state?” but “can lightning.gpu ingest an external GPU state buffer through a stable public interface without copying?”.

So my current estimate would be:

  1. expose or formalize an external-state ingest path on the PennyLane / lightning.gpu side
  2. add a Mahout-side bridge/adapter for mahout.qdp -> lightning.gpu
  3. add end-to-end tests and examples for the workflow

So yes: for a proper zero-copy integration, I would expect roughly 3 PRs for a narrow MVP.

@ryankert01
Copy link
Copy Markdown
Member

ryankert01 commented May 7, 2026

@viiccwen I think since it's 3 PRs away and so big. Can we postpone it to the next release? It will be more mature at next release and we can think about its detail.

@viiccwen
Copy link
Copy Markdown
Contributor Author

viiccwen commented May 7, 2026

@viiccwen I think since it's 3 PRs away and so big. Can we postpone it to the next release? It will be more mature at next release and we can think about its detail.

Sure, it'll be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add MPI-ready single-node distributed amplitude execution in QDP

3 participants