Skip to content

perf: phased codecpipeline#3885

Open
d-v-b wants to merge 71 commits intozarr-developers:mainfrom
d-v-b:perf/prepared-write-v2
Open

perf: phased codecpipeline#3885
d-v-b wants to merge 71 commits intozarr-developers:mainfrom
d-v-b:perf/prepared-write-v2

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented Apr 8, 2026

This PR defines a new codec pipeline class called PhasedCodecPipeline that enables much higher performance for chunk encoding and decoding than the current BatchedCodecPipeline.

The approach here is to completely ignore how the v3 spec defines array -> bytes codecs 😆. Instead of treating codecs as functions that mix IO and compute, we treat codec encoding and decoding as a sequence:

  1. preparatory IO, async
    fetch exactly what we need to fetch from storage, given the codecs we have. So if there's a sharding codec in the first array->bytes position, the codec pipeline knows it must fetch the shard index, then fetch the involved subchunks, before passing them to compute.
  2. pure compute. sync. Apply filters and compressors. safe to parallelize over chunks.
  3. (if writing): final IO, async. reconcile the in-memory compressed chunks against our model of the stored chunk. Write out bytes.

Basically, we use the first array -> bytes codec to figure out what kind of preparatory IO and final IO we need to perform, and the rest of the codecs to figure out what kind of chunk encoding we need to do. Separating IO from compute in different phases makes things simpler and faster.

Happy to chat more about this direction. IMO the spec should be re-written with this framing, because it makes much more sense than trying to shoe-horn sharding in as a codec.

I don't want to make our benchmarking suite any bigger but on my laptop this codec pipeline is 2-5x faster than the batchedcodec pipeline for a lot of workloads. I can include some of those benchmarks later.

This was mostly written by claude, based on previous work in #3719. All these changes should be non-breaking, so I think this is in principle safe for us to play around with in a patch or minor release.

Edit: this PR depends on changes submitted in #3907 and #3908

d-v-b added 4 commits April 7, 2026 10:38
`PreparedWrite` models a set of per-chunk changes that would be applied to a stored chunk. `SupportsChunkPacking`
is a protocol for array -> bytes codecs that can use `PreparedWrite` objects to update an existing chunk.
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 8, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 86.62207% with 80 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.10%. Comparing base (0ea15fd) to head (5d3064e).

Files with missing lines Patch % Lines
src/zarr/core/codec_pipeline.py 86.18% 63 Missing ⚠️
src/zarr/codecs/numcodecs/_codecs.py 66.66% 8 Missing ⚠️
src/zarr/codecs/sharding.py 92.30% 7 Missing ⚠️
src/zarr/codecs/_v2.py 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3885      +/-   ##
==========================================
- Coverage   93.07%   92.10%   -0.98%     
==========================================
  Files          85       87       +2     
  Lines       11228    11801     +573     
==========================================
+ Hits        10451    10869     +418     
- Misses        777      932     +155     
Files with missing lines Coverage Δ
src/zarr/abc/codec.py 98.86% <100.00%> (+0.09%) ⬆️
src/zarr/core/array.py 97.82% <100.00%> (+0.01%) ⬆️
src/zarr/core/config.py 100.00% <ø> (ø)
src/zarr/codecs/_v2.py 90.19% <80.00%> (-3.43%) ⬇️
src/zarr/codecs/sharding.py 75.42% <92.30%> (-13.99%) ⬇️
src/zarr/codecs/numcodecs/_codecs.py 93.18% <66.66%> (-3.21%) ⬇️
src/zarr/core/codec_pipeline.py 87.62% <86.18%> (-6.57%) ⬇️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 9, 2026

@TomAugspurger how would this design work with CUDA codecs?

d-v-b and others added 14 commits April 9, 2026 15:28
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…thread

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace batch-and-wait with streaming: chunks flow through
fetch → decode → scatter/store independently via asyncio.gather
with per-chunk coroutines. No chunk waits for all others to
finish a stage before advancing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When inner codecs are fixed-size and the store supports byte-range
writes, write individual inner chunks directly via set_range instead
of read-modify-write of the full shard blob.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12 tasks covering: global thread pool, streaming read/write,
SimpleChunkLayout fast path, ByteRangeSetter protocol, partial
shard writes, and benchmark verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Skip BasicIndexer/ChunkGrid creation for non-sharded layouts by
directly calling inner_transform.decode_chunk on the raw buffer.
Adds test to verify the fast path produces correct output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the 3-phase batch approach (fetch ALL → compute ALL → store ALL)
with a streaming pipeline where each chunk flows through fetch → compute
→ store independently via asyncio.gather, improving memory usage and
latency by allowing IO and compute to overlap across chunks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@d-v-b d-v-b force-pushed the perf/prepared-write-v2 branch from 5d3064e to b67a5a0 Compare April 15, 2026 09:51
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 15, 2026
…PhasedCodecPipeline

All pipeline contract tests now run against both implementations:
roundtrip, sharded roundtrip, partial writes, missing chunks,
strided reads, multidimensional, and compression.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@d-v-b d-v-b force-pushed the perf/prepared-write-v2 branch from b67a5a0 to 4746601 Compare April 15, 2026 09:57
d-v-b and others added 26 commits April 15, 2026 12:02
The pre-extracted decode function used the default layout's baked-in
ArraySpec shape, which fails for rectilinear chunks where each chunk
may have a different size. Pass chunk_shape explicitly when it differs
from the default.

Fixes doctest failure: "cannot reshape array of size 500 into shape (1,1)"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a 2d-rectilinear array config (chunks=[[10,20,30],[50,50]]) to
the parametrized test matrix. This would have caught the decode_chunk
reshape bug fixed in the previous commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ration

ChunkLayout now owns both the IO strategy (fetch_sync/fetch) and the
compute strategy (decode/encode). The pipeline's read_sync loop becomes
uniform:

    raw = layout.fetch_sync(byte_getter, ...)   # IO
    decoded = layout.decode(raw, chunk_spec)     # compute
    out[out_selection] = decoded[chunk_selection] # scatter

SimpleChunkLayout.decode calls inner_transform.decode_chunk directly.
ShardedChunkLayout.decode chooses between vectorized numpy decode
(for dense fixed-size shards) and per-inner-chunk decode (general case)
internally — the pipeline doesn't need to know.

This eliminates the ad-hoc if/elif chain in read_sync that previously
handled 4+ different cases with duplicated logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests define the interface for declarative IO planning: given an array
region and codec configuration, produce a ReadPlan data structure that
fully describes what byte-range reads are needed.

Covers:
- Non-sharded: full reads, partial reads, single element, 2D
- Sharded fixed-size: full shard, single inner chunk, two inner chunks
- Sharded variable-size: compressed inner chunk (index read required)
- Nested sharding: skipped (future)

All tests currently fail with NotImplementedError — _plan_read is a
stub defining the target interface we want to build toward.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use the existing RangeByteRequest from zarr.abc.store instead of a
custom ByteRange dataclass. ShardIndex now maps coords to
RangeByteRequest | None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the old fetch_sync/fetch/decode/encode methods on ChunkLayout
with resolve_index, fetch_chunks, decode_chunks, merge_and_encode,
store_chunks_sync, and store_chunks_async.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nkLayout, read_sync, write_sync

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PhasedCodecPipeline reads the shard index + individual chunks separately,
so partial reads issue more get calls than the old full-blob fetch path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove old methods that are no longer called after the four-phase
refactoring: _transform_read, _decode_shard, _decode_shard_vectorized,
_encode_shard_vectorized, _transform_write, _transform_write_shard,
_encode_per_chunk, _decode_vectorized, _encode_vectorized, _fetch_chunks,
_fetch_chunks_sync, chunk_byte_offset, inner_chunk_byte_length.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers: no sharding, single-level sharding (with/without outer BB
codecs), nested sharding, N-level nesting. For each: full read,
partial read, full write, partial write. Documents optimal IO and
compute sequence for each scenario.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ShardIndex carries leaf_transform. resolve_index is the only
layout-specific method. fetch_chunks, decode_chunks, merge_and_encode
are generic. Handles nested sharding by flattening index resolution
and using the leaf codec chain for decode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 tasks: add leaf_transform to ShardIndex, extract generic functions,
refactor read/write paths, implement recursive resolve_index for
nested sharding, remove dead code, update tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add module-level fetch_chunks_sync, fetch_chunks_async, and
decode_chunks_from_index functions that use ShardIndex.leaf_transform
for IO and decode operations. Refactor PhasedCodecPipeline.read_sync
and async read._process_chunk to use these generic functions instead
of delegating to layout methods.

Add is_sharded field to ShardIndex to distinguish "None = read full
blob" (simple layouts) from "None = absent inner chunk" (sharded
layouts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…index + pack_and_store

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…arding

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test now uses decode_chunks_from_index with ShardIndex.leaf_transform
instead of the deleted pipeline._transform_read.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pack_blob

- Add leaf_transform property to ChunkLayout base class (returns
  inner_transform) and override on ShardedChunkLayout (traverses
  nested ShardingCodecs to find innermost codec chain)
- Fix write path complete-overwrite to use layout.leaf_transform
  instead of layout.inner_transform (was using wrong transform for
  nested sharding)
- Fix decode_chunks_from_index to use index.is_sharded instead of
  fragile shape-based is_simple heuristic
- Add _pack_nested to ShardedChunkLayout: groups flat leaf chunks
  by inner shard, packs each group into an inner shard blob, then
  packs into outer shard — produces correct nested shard structure
- Remove dead unpack_blob from all layout classes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…se codec chain directly

Remove ~840 lines of ChunkLayout hierarchy (ShardIndex, SimpleChunkLayout,
ShardedChunkLayout, fetch_chunks_sync/async, decode_chunks_from_index,
merge_and_encode_from_index). The pipeline now uses ChunkTransform directly
for sync decode/encode and falls back to the async codec API otherwise.

Also fix ShardingCodec._encode_sync to respect write_empty_chunks config
by skipping inner chunks that are all fill_value.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove test_read_plan.py and test_write_plan.py (tested removed
  layout abstraction)
- Fix test_evolve_from_array_spec to check _sync_transform instead
  of layout
- Replace test_simple_layout_decode_skips_indexer with
  test_sync_transform_encode_decode_roundtrip
- Add n_workers parameter to read_sync/write_sync for thread-pool
  parallelism

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
read_sync and write_sync now support n_workers parameter. When > 0,
the decode (read) or decode+merge+encode (write) compute steps are
parallelized across threads via ThreadPoolExecutor.map. IO remains
sequential.

This helps when codecs release the GIL (gzip, blosc, zstd) — e.g.
gzip decompression is 41% of read time and runs entirely in C.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant