Skip to content

Enable SME2 Streaming SVE in ARM#9126

Open
stevesuzuki-arm wants to merge 3 commits intohalide:mainfrom
stevesuzuki-arm:pr-sme2
Open

Enable SME2 Streaming SVE in ARM#9126
stevesuzuki-arm wants to merge 3 commits intohalide:mainfrom
stevesuzuki-arm:pr-sme2

Conversation

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor

Enable SME2 Streaming SVE in ARM

This PR adds initial ARM SME2 streaming-mode support to Halide,
which allows us to compute with longer vector length SVE on targets with SME2.

A new sme_streaming(enable, var) scheduling directive provides the users
the option to control which loop is computed in streaming-mode.

The change introduces a new Target::SME2 feature and Target::streaming_vector_bits.
natural_vector_size() now depends on whether in streaming-mode or not,
because streaming_vector_bits may not be the same as vector_bits.

In Halide lowering, a new LowerSMEStreamingTasks pass is added,
which extracts the loop with streaming-mode as internal closure function
so that we can attach the LLVM function attributes to transit to/from streaming-mode.

  • aarch64_pstate_sm_body to emit smstart/smstop transition
  • NoInline to prevent streaming closure from inlined to non-streaming function

In CodeGen, target_vscale() depends on whether streaming-mode or not
and it varies even in a Module, although it is constant within Function boundary.
In streaming-mode, vector type code-gen and intrinsic selection are
performed based on streaming_vector_bits (streaming vscale).
In terms of coverage, it is almost the same as existing SVE2 code-gen
while SME2 specific instruction has not been enabled for now.

Additionally, the following changes are implemented:

  • Auto-detect SME2 and streaming_vector_bits on host CPU
  • Fall back from streaming SVE when vectorization factors are not feasible
  • Gather/scatter in streaming mode is scalarized with warning
  • Add runtime checks for streaming vscale mismatches with compile-time vscale

Breaking changes

  • The signature of halide_error_vscale_invalid() is changed. Will consider to have separate API if necessary

Checklist

  • Tests added or updated (not required for docs, CI config, or typo fixes)
  • Documentation updated (if public API changed)
  • Python bindings updated (if public API changed)

Added:
- Target::SME2 definition
- streaming_vector_bits in Target for SME2
- Auto-detect SME2 and streaming_vector_bits
- sme_streaming() scheduling directive in Func and Pipeline
- DeviceAPI::Host_SMEStreaming in IR "For"
- LowerSMEStreamingTasks pass to extract streaming closure
- Attribute in LoweredFunc for streaming closure
- LLVM Function attribute to control streaming mode
  - NoInline to prevent streaming closure from inlined
  - "aarch64_pstate_sm_body" to emit smstart/smstop transition
- Disable gather/scatter in SME streaming mode

Tests:
- Add correctness/sme_streaming
- Run simd_op_check_sve2 in SME streaming mode
- Add test to assert runtime streaming vscale
@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

This PR is ready for review. I will touch on this in dev meeting if I have a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants