Skip to content

Support for the Ascend backend: add npu runtime environment & JIT codegen#160

Open
PPPoint-t wants to merge 13 commits into
InfiniTensor:masterfrom
PPPoint-t:dev-ascend-add-npu-codegen
Open

Support for the Ascend backend: add npu runtime environment & JIT codegen#160
PPPoint-t wants to merge 13 commits into
InfiniTensor:masterfrom
PPPoint-t:dev-ascend-add-npu-codegen

Conversation

@PPPoint-t

@PPPoint-t PPPoint-t commented Jun 8, 2026

Copy link
Copy Markdown

Prepare the runtime environment needed by Ascend/NPU execution & Add the JIT codegen path for Ascend/NPU kernels.

  • Resolve the ninetoothed cache directory from NINETOOTHED_CACHE_DIR
  • Fall back to writable cache locations when the home cache is unavailable
  • Make default config calculation tolerate missing CUDA-style device properties
  • Add NPU device discovery support in test utilities
  • Generate both CUDA and NPU variants from CodeGenerator
  • Guard generated source with runtime NPU availability checks
  • Select _npu kernel and launch symbols in JIT.__call__
  • Add initial Ascendifier rewrites for CANN libdevice, load fallback, clamp lowering, dtype compatibility, and autotune key filtering

To make reviewing manageable, the entire Ascend NPU backend feature is split into 4 stacked phases. Reviewers can use the links below to view the clean diff of each phase directly within this repository:

Note: These links are just for architectural preview. I will submit official PRs to this repository sequentially as each phase gets merged.

Rewrite problematic clamp on Ascend, add cache/config fallbacks
Replace the old name-based broadcast and loop/dot/where handling with a narrow SDPA key-boundary tail mask rewrite, including the stable_qk mask after exp2.

Keep Ascend codegen compatibility fixes localized in Ascendifier: autotune key filtering, square block config rewrites, Ascend-safe config pruning, load fallback normalization, clamp lowering, and CANN libdevice routing.

Clean up helper naming and rewrite state flow so the AST passes are easier to follow without changing the generated SDPA behavior.
Drop the injected Ascend autotune prune helper and its debug-only meta selection plumbing, while keeping autotune key filtering and square block config rewrites so axis-limit handling still runs.
Format all modified Ascend backend files to maintain consistent code style.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant