Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
81cfca7
[FMHA] gfx950 dualwave SWP forward kernel: split-K, varlen, arbitrary…
yanguahe Jun 16, 2026
aae5470
[Kernel] Add W4A6 (MXFP6 A x MXFP4 B) support to preshuffle GEMM (#684)
amd-satre Jun 16, 2026
5e97cfc
[gfx1250][gemm] Add PTPC FP8/A8W4, non-tile-aligned M, and strided A/…
aoli26 Jun 16, 2026
50c5c61
[Chore] Bump version to 0.2.1 (#690)
coderfeli Jun 16, 2026
4094b35
[Enh] Improve type closure for primitive func (#552)
sjfeng1999 Jun 16, 2026
baae05f
fmha: trim kernel comments; add small-seq_len benchmark shapes (#693)
yanguahe Jun 16, 2026
31df76d
[Bugfix] rmsnorm: annotate known_block_size on large-M small-N path (…
jhinpan Jun 16, 2026
b25dbe1
enh(hotspot_analyzer): add --kernel filter for CSV metadata matching …
Arist12 Jun 16, 2026
0fab09f
[Refactor] update get_c_pointers to c_abi_spec (#682)
sjfeng1999 Jun 16, 2026
b57dee5
[Chore] Bump version to 0.2.2 (#697)
coderfeli Jun 16, 2026
3fd1ae5
ci: install mori from pip instead of source (#692)
yanboshao Jun 17, 2026
3a80579
enh(test_common): add profiler-safe HIP-event timing path to run_perf…
Arist12 Jun 17, 2026
2649837
[Fix] Align tensor integer storage with pointer types (#700)
sjfeng1999 Jun 17, 2026
8d541ab
[FEAT] Update location tracing coverage (#702)
sjfeng1999 Jun 17, 2026
5d4f772
[gfx1250][gemm] Make mxscale B-scale preshuffle tile-independent (#679)
aoli26 Jun 18, 2026
1baf0d2
[FMHA] gfx950: batch-aware dense seq_len routing (DUALWAVE_SWP vs gen…
jhinpan Jun 18, 2026
7d521ad
[Fix] Call aiter pa_reduce_v1 by keyword to track arg-order change (#…
coderfeli Jun 20, 2026
523ca1c
[FMHA] add flash_attn_interface L2 wrapper and cross-length Q/KV (seq…
yanguahe Jun 21, 2026
aeb5afc
fp8_gemm_4wave: pin MFMA accumulator in AGPR (+5~13% across medium-la…
benenzhu Jun 22, 2026
0b24879
[gfx1250][gemm] Fix A-scale VGPR and optimize decode GEMM (#705)
aoli26 Jun 22, 2026
0d07b39
fix slice return type on coord_tensor (#707)
tingqli Jun 22, 2026
f65d6a0
[Enh] More readable DslError traceback (#703)
sjfeng1999 Jun 22, 2026
eb7d69c
Optimize rmsnorm/layernorm to get better performance than aiter/trito…
cschenjunlin Jun 22, 2026
8fe8a67
[Fix] Capture the width of ir.Value correctly in the IntTupleAttrBuil…
sjfeng1999 Jun 24, 2026
a35627a
[Fix] Let Layout permissive in the IntTupleLike ops (#728)
sjfeng1999 Jun 24, 2026
245748b
chore: ignore .humanize artifacts and add local claude settings
jhinpan Jun 24, 2026
8560e01
Add MXFP4 MoE tuning foundation: legality filter, harness, ledger, spec
jhinpan Jun 24, 2026
9d50b08
Round 1: fix legality LDS + executable harness + measured baseline
jhinpan Jun 24, 2026
9fd1181
Round 2: valid baseline (strict aiter e2e, median+p95, hardened valid…
jhinpan Jun 24, 2026
5fbe54b
Round 3: strict/AOT/model-correct aiter guardrail; a4w4 validated; a8…
jhinpan Jun 24, 2026
799313e
Round 4: truthful timed-loop median+p95 + auditable a8w4 evidence
jhinpan Jun 24, 2026
23a1286
Round 5: faithful L2-flush timed loop + reproducible a4w4 baseline
jhinpan Jun 24, 2026
cd036cf
Round 6: stabilize e2e (reps=3), characterize residual small-token noise
jhinpan Jun 24, 2026
e0eda12
Round 6 follow-up: reconcile ledger/attempts provenance with repeatab…
jhinpan Jun 24, 2026
1fc7485
Round 7: actually pin GPU clocks; exhaust in-protocol repeatability l…
jhinpan Jun 24, 2026
bbdb9bb
Round 8: harness-enforce + verify clock pinning in the measurement dr…
jhinpan Jun 24, 2026
05e0ee4
Round 9: correct repeatability regime (not small-token-only); _main f…
jhinpan Jun 24, 2026
55d1ca8
Continuation R0: regime-aware no-regression band (small-token 8us floor)
jhinpan Jun 24, 2026
dd9a83d
Continuation R0: first a4w4 tile-sweep result -- DS V3 small-token wi…
jhinpan Jun 24, 2026
cb62aee
R1: reproducible candidate CLI; DS V3 a4w4 tile_n=128 e2e-verified (P…
jhinpan Jun 24, 2026
f1fc96c
R1: DS V3 a4w4 tile_n=128 small-token win CONFIRMED (re-run stable, P…
jhinpan Jun 24, 2026
22d0495
R2: fail-closed candidate CLI; correct the DS V3 overclaim (honest ev…
jhinpan Jun 24, 2026
81961b6
R3: full provenance for rejected candidates; correct ledger no-regres…
jhinpan Jun 24, 2026
b920522
R4: complete rejected-candidate provenance contract (selection + csv/…
jhinpan Jun 24, 2026
8527041
R5: resolve AC-1.1 Kimi K2 token-128 repeatability via live pinned-cl…
jhinpan Jun 24, 2026
61c677b
R6: two-metric Kimi K2 a4w4 repeatability (AC-1.1 MET) via durable CLI
jhinpan Jun 24, 2026
83c2d3a
R7: replayable repeatability provenance + selected-candidate AOT/corr…
jhinpan Jun 24, 2026
e0ea86f
R8: integrate selected-candidate gate into compare_csvs via claimable…
jhinpan Jun 24, 2026
2f688d6
R9: DS V3 a4w4 32/64 decided -- stage1 tile-only cannot win (routed t…
jhinpan Jun 24, 2026
d617ce8
R10: complete legal k1=256 stage1 coverage for DS V3 32/64; narrow R9…
jhinpan Jun 24, 2026
8df53ee
R11: repair R10 evidence integrity (dedup rejections, fail-closed mis…
jhinpan Jun 24, 2026
2bd319e
R12: fix R11 supersede-link defect + style gate; add supersede-link scan
jhinpan Jun 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .claude/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"permissions": {
"deny": ["AskUserQuestion"]
}
}
83 changes: 64 additions & 19 deletions .claude/skills/kernel-trace-analysis/scripts/hotspot_analyzer.py
Original file line number Diff line number Diff line change
Expand Up @@ -238,18 +238,33 @@ def print_source_detail(hotspot, source_cache, context=3):
print(f" stall={fmt_cycles(inst.stall_cycles):>7} type={inst.stall_type:<12} {inst.asm}")


def read_kernel_metadata(dispatch_dir):
def read_kernel_metadata(dispatch_dir, kernel_filter=""):
"""Read authoritative resource counts from ``out_kernel_trace.csv`` if present.

The ATT ``code.json`` only contains the (possibly single-CU, possibly
vgpr-form) disassembly, so it cannot reveal accum_vgpr / SGPR / LDS /
workgroup size. The kernel-trace CSV carries the real launch metadata.
Searches the dispatch dir and its parent (staging often copies the CSV
next to the ui_output_agent_* dir). Returns {} if not found.

Row selection priority:
1. ``kernel_filter`` substring matched against Kernel_Name, optionally
narrowed by Dispatch_Id when the dir name encodes ``dispatch_<id>``
(rocprofv3 ``ui_output_agent_*_dispatch_<id>`` layout). Dispatch_Id
matching avoids false matches when a PyTorch reference kernel shares
the same name substring.
2. Bidirectional name heuristic against the directory basename (legacy
path for timestamped dirs like ``20240101_120000_pa_decode_kernel``).
"""
candidates = []
for base in (dispatch_dir, os.path.dirname(os.path.abspath(dispatch_dir))):
candidates += glob.glob(os.path.join(base, "*kernel_trace*.csv"))

dir_name = os.path.basename(os.path.abspath(dispatch_dir))
# Extract the dispatch id from rocprofv3's ui_output_agent_<N>_dispatch_<id> layout.
_dispatch_id_m = re.search(r"dispatch_(\d+)$", dir_name)
dispatch_id = _dispatch_id_m.group(1) if _dispatch_id_m else None

for path in candidates:
try:
with open(path) as f:
Expand All @@ -258,24 +273,40 @@ def read_kernel_metadata(dispatch_dir):
continue
if not rows or "Accum_VGPR_Count" not in rows[0]:
continue
# Pick the row whose kernel matches the dispatch dir name. The dir is
# usually staged as "<timestamp>_<short_kernel_name>" while the CSV
# Kernel_Name has a trailing index (e.g. dir ".._pa_decode_ps_kernel"
# vs kernel "pa_decode_ps_kernel_0"), so match bidirectionally on the
# timestamp-stripped short name.
dir_name = os.path.basename(os.path.abspath(dispatch_dir))
short = re.sub(r"^\d{8}_\d{6}_", "", dir_name) # strip YYYYMMDD_HHMMSS_

def _matches(kn):
if not kn:
return False
return kn in dir_name or short in kn or kn.startswith(short) or short.startswith(kn)

has_dispatch_col = "Dispatch_Id" in rows[0]

chosen = None
for r in rows:
if _matches(r.get("Kernel_Name", "")):
chosen = r
break
if kernel_filter:
# Explicit filter: kernel name substring, narrowed by Dispatch_Id when available.
can_disambiguate = bool(dispatch_id and has_dispatch_col)
matches = [r for r in rows if kernel_filter in r.get("Kernel_Name", "")]
if can_disambiguate:
matches = [r for r in matches if str(r.get("Dispatch_Id", "")).strip() == dispatch_id]
if matches:
chosen = matches[0]
if not can_disambiguate and len(matches) > 1:
# First-substring-wins: no dispatch id available to pick between same-named rows.
print(
f" warning: --kernel '{kernel_filter}' matched {len(matches)} rows in "
f"{os.path.basename(path)} with no dispatch id to disambiguate; using the "
"first match (pass a more specific --kernel)"
)
else:
# Legacy heuristic: bidirectional substring match against the dir basename.
# Works for timestamped dirs like ``20240101_120000_pa_decode_kernel``.
short = re.sub(r"^\d{8}_\d{6}_", "", dir_name) # strip YYYYMMDD_HHMMSS_

def _matches(kn):
if not kn:
return False
return kn in dir_name or short in kn or kn.startswith(short) or short.startswith(kn)

for r in rows:
if _matches(r.get("Kernel_Name", "")):
chosen = r
break

if chosen is None:
continue # no matching row in this CSV — try the next candidate

Expand Down Expand Up @@ -457,7 +488,10 @@ def print_reg_pressure(reg_info):
print_header("Register Pressure & Occupancy")
print(f" Architecture: {reg_info['arch']}")
if not reg_info["has_meta"]:
print(" (no kernel_trace CSV found — accum/LDS/SGPR estimated from ISA only)")
print(
" (kernel_trace CSV not matched — accum/LDS/SGPR estimated from ISA only; "
"pass --kernel <name_substr> to enable CSV metadata lookup)"
)
if reg_info["is_vgpr_form"]:
print(f" arch_vgpr: {reg_info['arch_vgpr']} (MFMA vgpr-form: accumulators in arch file, no AGPR)")
else:
Expand Down Expand Up @@ -496,6 +530,17 @@ def main():
"--detail", action="store_true", help="Show source snippet + instruction breakdown under each source hotspot"
)
parser.add_argument("--context", type=int, default=3, help="Source lines of context around hotspot (default: 3)")
parser.add_argument(
"--kernel",
default="",
metavar="SUBSTR",
help="Kernel name substring for CSV metadata lookup "
"(e.g. 'pa_mqa_logits_fp4_kernel_0'). "
"Required when the dispatch dir name does not encode the kernel name, "
"as with rocprofv3 ui_output_agent_*_dispatch_<id> directories. "
"Combined with the dispatch id from the dir name when a Dispatch_Id "
"column is present in the CSV.",
)
args = parser.parse_args()

if not os.path.isdir(args.dispatch_dir):
Expand All @@ -515,7 +560,7 @@ def main():
print(f" Total cycles: {fmt_cycles(total_cycles)}")
print(f" Total stalls: {fmt_cycles(total_stall)} ({100*total_stall/total_cycles:.1f}% of total cycles)")

meta = read_kernel_metadata(args.dispatch_dir)
meta = read_kernel_metadata(args.dispatch_dir, kernel_filter=args.kernel)
reg_info = detect_arch_and_reg_pressure(instructions, meta)
print_reg_pressure(reg_info)

Expand Down
6 changes: 2 additions & 4 deletions .github/workflows/flydsl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -453,10 +453,8 @@ jobs:
timeout-minutes: 15
run: |
docker exec flydsl_test bash -c "
apt-get install -y libpci-dev libibverbs-dev &&
rm -rf /tmp/mori &&
git clone --depth 1 --recursive --shallow-submodules https://github.com/ROCm/mori.git /tmp/mori &&
cd /tmp/mori && python3 -m pip install . &&
apt-get install -y libpci-dev libibverbs-dev libgrpc++1.51 libgrpc29 &&
python3 -m pip install amd_mori &&
MORI_PRECOMPILE=1 python3 -c 'import mori'
"

Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,5 @@ Thumbs.db
# Sphinx documentation build
docs/_build/
python/flydsl/_mlir

.humanize*
35 changes: 35 additions & 0 deletions docs/a8w4_evidence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# a8w4 Strict-Path Correctness Evidence (locked ref 523ca1c7)

All a8w4 (fp8 activation x fp4 weight) points run through the strict, model-correct
aiter path (`scripts/aiter_strict_point.py`: true per-model activation/gate,
`strict_accuracy=True`). Correctness is gated on `logits_diff <= 0.01`. a8w4 is
correctness-BLOCKED in this environment (see the `kernels/moe_tuning_spec.py` quarantine
note). Categories: `correctness` = strict accuracy assertion (logits ~0.98);
`runtime` = kernel/runtime rejection (e.g. Unsupported scales/output); `pass` = logits<=0.01.

| model | total | correctness-fail | runtime-fail | pass |
|---|---|---|---|---|
| deepseek_v3 | 16 | 4 | 12 | 0 |
| deepseek_v4 | 16 | 10 | 6 | 0 |
| gpt_oss | 8 | 4 | 3 | 1 |
| kimi_k2 | 16 | 9 | 7 | 0 |

## Representative per-row errors

| model | token | category | error |
|---|---|---|---|
| deepseek_v3 | 1 | runtime | RuntimeError: Unsupported scales/output dtype! |
| deepseek_v3 | 16 | correctness | AssertionError: accuracy check failed: checkAllclose err=0.9967564344406128, logits_diff=0 |
| deepseek_v4 | 1 | runtime | RuntimeError: Unsupported scales/output dtype! |
| deepseek_v4 | 16 | correctness | AssertionError: accuracy check failed: checkAllclose err=0.996712863445282, logits_diff=0. |
| kimi_k2 | 1 | runtime | RuntimeError: Unsupported scales/output dtype! |
| kimi_k2 | 16 | correctness | AssertionError: accuracy check failed: checkAllclose err=0.996957004070282, logits_diff=0. |
| gpt_oss | 256 | pass | |
| gpt_oss | 512 | correctness | AssertionError: accuracy check failed: checkAllclose err=0.9967130422592163, logits_diff=0 |
| gpt_oss | 4096 | runtime | TypeError: __init__(): incompatible function arguments. The following argument types are s |

Source: `docs/baseline_523ca1c7_a8w4_strict.csv` (per-row strict_error, error_category,
aot_status, flydsl_command, kernel-path metrics). aot_status=no_aot for all a8w4: no aiter
AOT cache entry exists for these a8w4 shapes, so the strict runner runs without the AOT
gate; the kernel still compiles+runs and then fails the strict correctness gate or a runtime
scale/output check -- a real correctness/runtime block, not merely a missing AOT artifact.
Loading
Loading