perf: phased codecpipeline by d-v-b · Pull Request #3885 · zarr-developers/zarr-python

d-v-b · 2026-04-08T13:34:35Z

This PR defines a new codec pipeline class called PhasedCodecPipeline that enables much higher performance for chunk encoding and decoding than the current BatchedCodecPipeline.

The approach here is to completely ignore how the v3 spec defines array -> bytes codecs 😆. Instead of treating codecs as functions that mix IO and compute, we treat codec encoding and decoding as a sequence:

preparatory IO, async
fetch exactly what we need to fetch from storage, given the codecs we have. So if there's a sharding codec in the first array->bytes position, the codec pipeline knows it must fetch the shard index, then fetch the involved subchunks, before passing them to compute.
pure compute. sync. Apply filters and compressors. safe to parallelize over chunks.
(if writing): final IO, async. reconcile the in-memory compressed chunks against our model of the stored chunk. Write out bytes.

Basically, we use the first array -> bytes codec to figure out what kind of preparatory IO and final IO we need to perform, and the rest of the codecs to figure out what kind of chunk encoding we need to do. Separating IO from compute in different phases makes things simpler and faster.

Happy to chat more about this direction. IMO the spec should be re-written with this framing, because it makes much more sense than trying to shoe-horn sharding in as a codec.

I don't want to make our benchmarking suite any bigger but on my laptop this codec pipeline is 2-5x faster than the batchedcodec pipeline for a lot of workloads. I can include some of those benchmarks later.

This was mostly written by claude, based on previous work in #3719. All these changes should be non-breaking, so I think this is in principle safe for us to play around with in a patch or minor release.

Edit: this PR depends on changes submitted in #3907 and #3908

`PreparedWrite` models a set of per-chunk changes that would be applied to a stored chunk. `SupportsChunkPacking` is a protocol for array -> bytes codecs that can use `PreparedWrite` objects to update an existing chunk.

…into perf/prepared-write-v2

codecov · 2026-04-08T16:52:11Z

Codecov Report

❌ Patch coverage is 86.62207% with 80 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.10%. Comparing base (0ea15fd) to head (5d3064e).

Files with missing lines	Patch %	Lines
src/zarr/core/codec_pipeline.py	86.18%	63 Missing ⚠️
src/zarr/codecs/numcodecs/_codecs.py	66.66%	8 Missing ⚠️
src/zarr/codecs/sharding.py	92.30%	7 Missing ⚠️
src/zarr/codecs/_v2.py	80.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3885      +/-   ##
==========================================
- Coverage   93.07%   92.10%   -0.98%     
==========================================
  Files          85       87       +2     
  Lines       11228    11801     +573     
==========================================
+ Hits        10451    10869     +418     
- Misses        777      932     +155

Files with missing lines	Coverage Δ
src/zarr/abc/codec.py	`98.86% <100.00%> (+0.09%)`	⬆️
src/zarr/core/array.py	`97.82% <100.00%> (+0.01%)`	⬆️
src/zarr/core/config.py	`100.00% <ø> (ø)`
src/zarr/codecs/_v2.py	`90.19% <80.00%> (-3.43%)`	⬇️
src/zarr/codecs/sharding.py	`75.42% <92.30%> (-13.99%)`	⬇️
src/zarr/codecs/numcodecs/_codecs.py	`93.18% <66.66%> (-3.21%)`	⬇️
src/zarr/core/codec_pipeline.py	`87.62% <86.18%> (-6.57%)`	⬇️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…into perf/prepared-write-v2

d-v-b · 2026-04-09T08:36:25Z

@TomAugspurger how would this design work with CUDA codecs?

…rf/prepared-write-v2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…thread Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace batch-and-wait with streaming: chunks flow through fetch → decode → scatter/store independently via asyncio.gather with per-chunk coroutines. No chunk waits for all others to finish a stage before advancing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When inner codecs are fixed-size and the store supports byte-range writes, write individual inner chunks directly via set_range instead of read-modify-write of the full shard blob. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

12 tasks covering: global thread pool, streaming read/write, SimpleChunkLayout fast path, ByteRangeSetter protocol, partial shard writes, and benchmark verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Skip BasicIndexer/ChunkGrid creation for non-sharded layouts by directly calling inner_transform.decode_chunk on the raw buffer. Adds test to verify the fast path produces correct output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the 3-phase batch approach (fetch ALL → compute ALL → store ALL) with a streaming pipeline where each chunk flows through fetch → compute → store independently via asyncio.gather, improving memory usage and latency by allowing IO and compute to overlap across chunks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…PhasedCodecPipeline All pipeline contract tests now run against both implementations: roundtrip, sharded roundtrip, partial writes, missing chunks, strided reads, multidimensional, and compression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The pre-extracted decode function used the default layout's baked-in ArraySpec shape, which fails for rectilinear chunks where each chunk may have a different size. Pass chunk_shape explicitly when it differs from the default. Fixes doctest failure: "cannot reshape array of size 500 into shape (1,1)" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a 2d-rectilinear array config (chunks=[[10,20,30],[50,50]]) to the parametrized test matrix. This would have caught the decode_chunk reshape bug fixed in the previous commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ration ChunkLayout now owns both the IO strategy (fetch_sync/fetch) and the compute strategy (decode/encode). The pipeline's read_sync loop becomes uniform: raw = layout.fetch_sync(byte_getter, ...) # IO decoded = layout.decode(raw, chunk_spec) # compute out[out_selection] = decoded[chunk_selection] # scatter SimpleChunkLayout.decode calls inner_transform.decode_chunk directly. ShardedChunkLayout.decode chooses between vectorized numpy decode (for dense fixed-size shards) and per-inner-chunk decode (general case) internally — the pipeline doesn't need to know. This eliminates the ad-hoc if/elif chain in read_sync that previously handled 4+ different cases with duplicated logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tests define the interface for declarative IO planning: given an array region and codec configuration, produce a ReadPlan data structure that fully describes what byte-range reads are needed. Covers: - Non-sharded: full reads, partial reads, single element, 2D - Sharded fixed-size: full shard, single inner chunk, two inner chunks - Sharded variable-size: compressed inner chunk (index read required) - Nested sharding: skipped (future) All tests currently fail with NotImplementedError — _plan_read is a stub defining the target interface we want to build toward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use the existing RangeByteRequest from zarr.abc.store instead of a custom ByteRange dataclass. ShardIndex now maps coords to RangeByteRequest | None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the old fetch_sync/fetch/decode/encode methods on ChunkLayout with resolve_index, fetch_chunks, decode_chunks, merge_and_encode, store_chunks_sync, and store_chunks_async. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nkLayout, read_sync, write_sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PhasedCodecPipeline reads the shard index + individual chunks separately, so partial reads issue more get calls than the old full-blob fetch path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove old methods that are no longer called after the four-phase refactoring: _transform_read, _decode_shard, _decode_shard_vectorized, _encode_shard_vectorized, _transform_write, _transform_write_shard, _encode_per_chunk, _decode_vectorized, _encode_vectorized, _fetch_chunks, _fetch_chunks_sync, chunk_byte_offset, inner_chunk_byte_length. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Covers: no sharding, single-level sharding (with/without outer BB codecs), nested sharding, N-level nesting. For each: full read, partial read, full write, partial write. Documents optimal IO and compute sequence for each scenario. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ShardIndex carries leaf_transform. resolve_index is the only layout-specific method. fetch_chunks, decode_chunks, merge_and_encode are generic. Handles nested sharding by flattening index resolution and using the leaf codec chain for decode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

8 tasks: add leaf_transform to ShardIndex, extract generic functions, refactor read/write paths, implement recursive resolve_index for nested sharding, remove dead code, update tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add module-level fetch_chunks_sync, fetch_chunks_async, and decode_chunks_from_index functions that use ShardIndex.leaf_transform for IO and decode operations. Refactor PhasedCodecPipeline.read_sync and async read._process_chunk to use these generic functions instead of delegating to layout methods. Add is_sharded field to ShardIndex to distinguish "None = read full blob" (simple layouts) from "None = absent inner chunk" (sharded layouts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…af_transform

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…index + pack_and_store Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…arding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Test now uses decode_chunks_from_index with ShardIndex.leaf_transform instead of the deleted pipeline._transform_read. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…pack_blob - Add leaf_transform property to ChunkLayout base class (returns inner_transform) and override on ShardedChunkLayout (traverses nested ShardingCodecs to find innermost codec chain) - Fix write path complete-overwrite to use layout.leaf_transform instead of layout.inner_transform (was using wrong transform for nested sharding) - Fix decode_chunks_from_index to use index.is_sharded instead of fragile shape-based is_simple heuristic - Add _pack_nested to ShardedChunkLayout: groups flat leaf chunks by inner shard, packs each group into an inner shard blob, then packs into outer shard — produces correct nested shard structure - Remove dead unpack_blob from all layout classes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…se codec chain directly Remove ~840 lines of ChunkLayout hierarchy (ShardIndex, SimpleChunkLayout, ShardedChunkLayout, fetch_chunks_sync/async, decode_chunks_from_index, merge_and_encode_from_index). The pipeline now uses ChunkTransform directly for sync decode/encode and falls back to the async codec API otherwise. Also fix ShardingCodec._encode_sync to respect write_empty_chunks config by skipping inner chunks that are all fill_value. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove test_read_plan.py and test_write_plan.py (tested removed layout abstraction) - Fix test_evolve_from_array_spec to check _sync_transform instead of layout - Replace test_simple_layout_decode_skips_indexer with test_sync_transform_encode_decode_roundtrip - Add n_workers parameter to read_sync/write_sync for thread-pool parallelism Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

read_sync and write_sync now support n_workers parameter. When > 0, the decode (read) or decode+merge+encode (write) compute steps are parallelized across threads via ThreadPoolExecutor.map. IO remains sequential. This helps when codecs release the GIL (gzip, blosc, zstd) — e.g. gzip decompression is 41% of read time and runs entirely in C. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rf/prepared-write-v2

d-v-b added 4 commits April 7, 2026 10:38

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

a072c31

…into perf/prepared-write-v2

feat: new codec pipeline that uses sync path

47a407f

feat: complete second codecpipeline

3c27e49

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 8, 2026

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

9b834a4

…into perf/prepared-write-v2

d-v-b added 2 commits April 8, 2026 19:51

fix: handle rectilinear chunks

c731cf2

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

9e25150

…into perf/prepared-write-v2

This was referenced Apr 8, 2026

perf/prepared write #3727

Closed

perf/sharding chunk transform #3729

Closed

perf/chunkrequest #3730

Closed

sketch out sync codecs + threadpool #3715

Closed

fixup

ae0580c

d-v-b mentioned this pull request Apr 9, 2026

[do not merge] benchmarks + tests for phased codecpipeline #3891

Open

d-v-b and others added 14 commits April 9, 2026 15:28

fix: fixup

053f2ee

Merge branch 'main' of github.com:zarr-developers/zarr-python into pe…

ba393d3

…rf/prepared-write-v2

fix: wire up prototype in setitem

cfe9539

refactor: define chunklayout class

0b2512b

perf: only fetch the chunks we need

5fb28b9

docs: add design spec for PhasedCodecPipeline performance fix

9b620c0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: update spec — global thread pool for compute, not inline or to_…

2348fba

…thread Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove duplicate @staticmethod decorator on _scatter

5ff788d

feat: add global thread pool for codec compute

4616238

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

d-v-b force-pushed the perf/prepared-write-v2 branch from 5d3064e to b67a5a0 Compare April 15, 2026 09:51

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 15, 2026

d-v-b force-pushed the perf/prepared-write-v2 branch from b67a5a0 to 4746601 Compare April 15, 2026 09:57

d-v-b and others added 26 commits April 15, 2026 12:02

refactor: replace ByteRange with RangeByteRequest in ShardIndex

2650261

Use the existing RangeByteRequest from zarr.abc.store instead of a custom ByteRange dataclass. ShardIndex now maps coords to RangeByteRequest | None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: implement four-phase model on SimpleChunkLayout, ShardedChu…

3d200f6

…nkLayout, read_sync, write_sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: update async read and write to use four-phase model

c390708

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: update expected get count for PhasedCodecPipeline shard reads

64e38d8

PhasedCodecPipeline reads the shard index + individual chunks separately, so partial reads issue more get calls than the old full-blob fetch path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add leaf_transform to ShardIndex

5db2a12

refactor: write paths use generic merge_and_encode with ShardIndex.le…

638d57f

…af_transform

feat: recursive resolve_index for nested sharding with leaf_transform

f7772cc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: remove dead layout methods — ChunkLayout owns only resolve_…

8831672

…index + pack_and_store Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: update read/write plan tests for RangeByteRequest and nested sh…

cd64c3d

…arding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: update stale test referencing removed _transform_read method

ae48c67

Test now uses decode_chunks_from_index with ShardIndex.leaf_transform instead of the deleted pipeline._transform_read. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' of github.com:zarr-developers/zarr-python into pe…

b4995ab

…rf/prepared-write-v2

chore: lint

b43f229

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: phased codecpipeline#3885

perf: phased codecpipeline#3885
d-v-b wants to merge 71 commits intozarr-developers:mainfrom
d-v-b:perf/prepared-write-v2

d-v-b commented Apr 8, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

d-v-b commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

d-v-b commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

d-v-b commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

d-v-b commented Apr 8, 2026 •

edited

Loading

codecov bot commented Apr 8, 2026 •

edited

Loading