Skip to content

[Tile] Use unpacked vector field for Tile16x16/Tile32x32 register storage#722

Open
hughperkins wants to merge 5 commits into
mainfrom
hp/tiles-use-unpacked-vector
Open

[Tile] Use unpacked vector field for Tile16x16/Tile32x32 register storage#722
hughperkins wants to merge 5 commits into
mainfrom
hp/tiles-use-unpacked-vector

Conversation

@hughperkins

Copy link
Copy Markdown
Collaborator

Summary

Replace the hand-rolled r0..rN-1: dtype field declarations and their matching cascades in Tile16x16 / Tile32x32 with a single

r: qd.types.vector(_TILE, dtype, unpacked=True)

field, accessed as self.r[k]. With python-int / qd.static-resolved indices the unpacked vector still maps to one independent register slot per use, so the generated PTX/LLVM IR is unchanged — but the source shrinks dramatically (net -870 lines).

Also drops the now-redundant private helpers (_get_col, _set_col, _r) and the _REGS field-name table. These were all _-prefixed and only used internally to the two tile modules.

Test plan

  • pre-commit run -a (black, ruff, pylint): clean
  • pyright python/quadrants/lang/simt/_tile16.py python/quadrants/lang/simt/_tile32.py: 0 errors
  • python tests/run_tests.py -v -t1 test_tile on an RTX PRO 6000 cluster node: 732 passed, 182 skipped, 0 failed (~10 min); covers cuda+vulkan, f32+f64, ndarray+field for both tile sizes, including cholesky_, solve_triangular_, qd.outer(...) rank-1 updates, slice load/store, and the blocked-Cholesky demo
  • Existing 68 PURE.VIOLATION warnings on TILE / SIZE test globals are pre-existing and unrelated

Made with Cursor

…rage

Replace hand-rolled ``r0..rN-1: dtype`` field declarations and their
matching ``if k == 0: self.r0 = val; ...`` cascades with a single
``r: qd.types.vector(_TILE, dtype, unpacked=True)`` field accessed via
``self.r[k]``.  This shrinks the surface area significantly (net -870
lines) without changing the generated PTX/LLVM IR: with python-int /
qd.static-resolved indices the unpacked field still maps to one
register slot per use, matching what the explicit cascade produced.

Also removes the now-redundant private helpers ``_get_col``,
``_set_col``, ``_r`` and the ``_REGS`` field-name table.
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

…ed on N

The two factory bodies were structurally identical except for ``_TILE = 16``
vs ``_TILE = 32``.  Replace them with a single ``_make_tile_class(N, dtype)``
factory and a single ``_TileProxy(N)`` proxy class, then instantiate
``Tile16x16Proxy = _TileProxy(16)`` and ``Tile32x32Proxy = _TileProxy(32)``.

Net diff for this commit: -343 lines.  Same generated IR.

Updates the few internal consumers (``simt/__init__.py``, ``tile_slicing.py``,
``quadrants/__init__.py``, ``tests/python/test_tile.py``) and a couple of stale
``test_tile16`` references in the docs.
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

@hughperkins

Copy link
Copy Markdown
Collaborator Author

Benchmarks on genesis:

20260609_tiles_unpacked

dex_hand regression seems concerning 🤔

@hughperkins

Copy link
Copy Markdown
Collaborator Author

tests pass at least:

Screenshot 2026-06-08 at 12 45 19

…-vector

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	python/quadrants/lang/simt/_tile16.py
#	python/quadrants/lang/simt/_tile32.py
#	tests/python/test_tile.py
@hughperkins

Copy link
Copy Markdown
Collaborator Author

Genesis benchmarks:

20260629_tiles

@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

…acked-vector + fusion refactor

1. ``_trsm`` had been changed from a runtime ``for c in range(_TILE)`` loop to a fully-unrolled
   ``qd.static(range(N))`` so that ``self.r[j]`` (an unpacked vector field) could be accessed with a
   python-int index.  Forcing full unrolling spikes the live set: the resulting PTX for the blocked
   Cholesky kernel jumps from 653 to 894 .b32 registers (+37%) and the shuffle count from 174 to 304,
   producing a measurable ~9.4% slowdown on ``misc/demos/cholesky_blocked.py`` (N=92, 4096 envs) and
   (per genesis benchmarking) a 4-7% regression on contact-heavy Newton scenarios (box_pyramid,
   dex_hand, double_smplx).  Restore the runtime ``range(N)`` outer/inner loops and introduce
   ``_get_col`` / ``_set_col`` helpers that emit explicit static-unrolled cascades over the unpacked
   vector slots -- functionally equivalent to the hand-rolled ``r0..rN-1`` cascade ``_tile16.py`` /
   ``_tile32.py`` used to carry, but driven by the new ``self.r[kk]`` access.  Post-fix the PTX for
   the demo kernel is byte-identical to main (modulo the session nonce).

2. ``_resolve_vec2d`` / ``_resolve_vec3d`` had been seeded with ``v = dtype(0.0)``.  This trips the
   identity-keyed type-construction path in the AST transformer when the kernel's ``dtype`` is a
   ``copy.deepcopy`` of a primitive (which is what ``qd.init(default_fp=...)`` produces).  Swap to
   the identity-independent ``qd.cast(0.0, dtype)``, matching the pre-fusion fix (#738) that I lost
   during the merge.  Restores ``test_vec_proxy_non_identity_dtype`` to passing.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant