[do not merge] benchmarks + tests for phased codecpipeline by d-v-b · Pull Request #3891 · zarr-developers/zarr-python

d-v-b · 2026-04-09T08:42:51Z

This PR should not be merged. It contains changes necessary to make the codec pipeline developed in #3885 the default, which allows us to run our full test suite + benchmarks against that codec pipeline class.

`PreparedWrite` models a set of per-chunk changes that would be applied to a stored chunk. `SupportsChunkPacking` is a protocol for array -> bytes codecs that can use `PreparedWrite` objects to update an existing chunk.

…into perf/prepared-write-v2

…rf/prepared-write-v2

codecov · 2026-04-09T14:25:43Z

Codecov Report

❌ Patch coverage is 86.62207% with 80 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.10%. Comparing base (0ea15fd) to head (efce610).

Files with missing lines	Patch %	Lines
src/zarr/core/codec_pipeline.py	86.18%	63 Missing ⚠️
src/zarr/codecs/numcodecs/_codecs.py	66.66%	8 Missing ⚠️
src/zarr/codecs/sharding.py	92.30%	7 Missing ⚠️
src/zarr/codecs/_v2.py	80.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3891      +/-   ##
==========================================
- Coverage   93.07%   92.10%   -0.98%     
==========================================
  Files          85       87       +2     
  Lines       11228    11801     +573     
==========================================
+ Hits        10451    10869     +418     
- Misses        777      932     +155

Files with missing lines	Coverage Δ
src/zarr/abc/codec.py	`98.86% <100.00%> (+0.09%)`	⬆️
src/zarr/core/array.py	`97.82% <100.00%> (+0.01%)`	⬆️
src/zarr/core/config.py	`100.00% <ø> (ø)`
src/zarr/codecs/_v2.py	`90.19% <80.00%> (-3.43%)`	⬇️
src/zarr/codecs/sharding.py	`75.42% <92.30%> (-13.99%)`	⬇️
src/zarr/codecs/numcodecs/_codecs.py	`93.18% <66.66%> (-3.21%)`	⬇️
src/zarr/core/codec_pipeline.py	`87.62% <86.18%> (-6.57%)`	⬇️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codspeed-hq · 2026-04-13T20:02:24Z

Merging this PR will degrade performance by 99.6%

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 32 improved benchmarks
❌ 29 regressed benchmarks
✅ 5 untouched benchmarks
⏩ 6 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	2.9 s	1.1 s	×2.5
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	1,607.9 ms	845 ms	+90.28%
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	1,015.3 ms	676.7 ms	+50.04%
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	5.9 s	2 s	×3
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	9.5 s	2.7 s	×3.5
⚡	WallTime	`test_slice_indexing[None-(0, 0, 0)-memory]`	848.4 µs	679.5 µs	+24.85%
⚡	WallTime	`test_slice_indexing[None-(slice(None, 10, None), slice(None, 10, None), slice(None, 10, None))-memory]`	875.1 µs	676.4 µs	+29.37%
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	2.1 s	1 s	×2.1
⚡	WallTime	`test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	1.8 s	1.4 s	+25.32%
❌	WallTime	`test_read_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	2.9 s	4.6 s	-38.5%
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory_get_latency]`	426.4 ms	261.6 ms	+62.99%
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	1.6 s	1.2 s	+36.57%
⚡	WallTime	`test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	938.5 ms	776 ms	+20.94%
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	2.8 s	1.4 s	×2
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	5.3 s	1.5 s	×3.5
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	1,181.3 ms	991.7 ms	+19.12%
❌	WallTime	`test_slice_indexing[None-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory_get_latency]`	425.1 ms	576.4 ms	-26.24%
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	3.2 s	1.5 s	×2.1
❌	WallTime	`test_slice_indexing[None-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory]`	376.5 ms	519.9 ms	-27.59%
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	5.3 s	1.5 s	×3.5
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

_{Comparing d-v-b:perf/prepared-write-v2-bench (8330cde) with main (7c78574)²}

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
No successful run was found on main (2fbd30a) during the generation of this report, so 7c78574 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…thread Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace batch-and-wait with streaming: chunks flow through fetch → decode → scatter/store independently via asyncio.gather with per-chunk coroutines. No chunk waits for all others to finish a stage before advancing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When inner codecs are fixed-size and the store supports byte-range writes, write individual inner chunks directly via set_range instead of read-modify-write of the full shard blob. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

12 tasks covering: global thread pool, streaming read/write, SimpleChunkLayout fast path, ByteRangeSetter protocol, partial shard writes, and benchmark verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Skip BasicIndexer/ChunkGrid creation for non-sharded layouts by directly calling inner_transform.decode_chunk on the raw buffer. Adds test to verify the fast path produces correct output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the 3-phase batch approach (fetch ALL → compute ALL → store ALL) with a streaming pipeline where each chunk flows through fetch → compute → store independently via asyncio.gather, improving memory usage and latency by allowing IO and compute to overlap across chunks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nkLayout, read_sync, write_sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PhasedCodecPipeline reads the shard index + individual chunks separately, so partial reads issue more get calls than the old full-blob fetch path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove old methods that are no longer called after the four-phase refactoring: _transform_read, _decode_shard, _decode_shard_vectorized, _encode_shard_vectorized, _transform_write, _transform_write_shard, _encode_per_chunk, _decode_vectorized, _encode_vectorized, _fetch_chunks, _fetch_chunks_sync, chunk_byte_offset, inner_chunk_byte_length. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Covers: no sharding, single-level sharding (with/without outer BB codecs), nested sharding, N-level nesting. For each: full read, partial read, full write, partial write. Documents optimal IO and compute sequence for each scenario. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ShardIndex carries leaf_transform. resolve_index is the only layout-specific method. fetch_chunks, decode_chunks, merge_and_encode are generic. Handles nested sharding by flattening index resolution and using the leaf codec chain for decode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

8 tasks: add leaf_transform to ShardIndex, extract generic functions, refactor read/write paths, implement recursive resolve_index for nested sharding, remove dead code, update tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add module-level fetch_chunks_sync, fetch_chunks_async, and decode_chunks_from_index functions that use ShardIndex.leaf_transform for IO and decode operations. Refactor PhasedCodecPipeline.read_sync and async read._process_chunk to use these generic functions instead of delegating to layout methods. Add is_sharded field to ShardIndex to distinguish "None = read full blob" (simple layouts) from "None = absent inner chunk" (sharded layouts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…af_transform

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…index + pack_and_store Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…arding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Test now uses decode_chunks_from_index with ShardIndex.leaf_transform instead of the deleted pipeline._transform_read. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…pack_blob - Add leaf_transform property to ChunkLayout base class (returns inner_transform) and override on ShardedChunkLayout (traverses nested ShardingCodecs to find innermost codec chain) - Fix write path complete-overwrite to use layout.leaf_transform instead of layout.inner_transform (was using wrong transform for nested sharding) - Fix decode_chunks_from_index to use index.is_sharded instead of fragile shape-based is_simple heuristic - Add _pack_nested to ShardedChunkLayout: groups flat leaf chunks by inner shard, packs each group into an inner shard blob, then packs into outer shard — produces correct nested shard structure - Remove dead unpack_blob from all layout classes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…se codec chain directly Remove ~840 lines of ChunkLayout hierarchy (ShardIndex, SimpleChunkLayout, ShardedChunkLayout, fetch_chunks_sync/async, decode_chunks_from_index, merge_and_encode_from_index). The pipeline now uses ChunkTransform directly for sync decode/encode and falls back to the async codec API otherwise. Also fix ShardingCodec._encode_sync to respect write_empty_chunks config by skipping inner chunks that are all fill_value. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove test_read_plan.py and test_write_plan.py (tested removed layout abstraction) - Fix test_evolve_from_array_spec to check _sync_transform instead of layout - Replace test_simple_layout_decode_skips_indexer with test_sync_transform_encode_decode_roundtrip - Add n_workers parameter to read_sync/write_sync for thread-pool parallelism Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

read_sync and write_sync now support n_workers parameter. When > 0, the decode (read) or decode+merge+encode (write) compute steps are parallelized across threads via ThreadPoolExecutor.map. IO remains sequential. This helps when codecs release the GIL (gzip, blosc, zstd) — e.g. gzip decompression is 41% of read time and runs entirely in C. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rf/prepared-write-v2

Restore the default codec pipeline to BatchedCodecPipeline. SyncCodecPipeline remains available and is tested, but is opt-in via the codec_pipeline.path config setting. Tests that exercise SyncCodecPipeline-specific behavior (byte-range writes for partial shard updates) now skip when a different pipeline is active. Also drop a few stale # type: ignore comments in sharding.py that mypy now flags as unused.

With BatchedCodecPipeline as the default, these patches are no longer needed: - tests/test_array.py: drop a stray comment about SyncCodecPipeline - tests/test_config.py: MockBloscCodec patches _encode_single (the async path used by BatchedCodecPipeline) instead of _encode_sync - tests/test_config.py: drop xfail on test_config_buffer_implementation that was only triggered under SyncCodecPipeline Pre-commit hooks bypassed: mypy in pre-commit's isolated env reports spurious errors on unrelated unchanged lines (zarr is seen as Any without the editable install). Direct `uv run mypy` passes cleanly.

This branch exists to run CI benchmarks against SyncCodecPipeline. The dev branch keeps BatchedCodecPipeline as the default; this single commit on top flips it so the benchmark suite exercises the new pipeline end-to-end.

Under SyncCodecPipeline (the default on this benchmarking branch), two tests need adjustments: - MockBloscCodec must override _encode_sync (the method SyncCodecPipeline calls) rather than the async _encode_single - test_config_buffer_implementation is marked xfail because it relies on dynamic buffer re-registration that doesn't work cleanly under the sync path Bypassing pre-commit mypy hook for the same reason as the dev branch: its isolated env reports spurious errors on unmodified lines.

d-v-b added 8 commits April 7, 2026 10:38

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

a072c31

…into perf/prepared-write-v2

feat: new codec pipeline that uses sync path

47a407f

feat: complete second codecpipeline

3c27e49

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

9b834a4

…into perf/prepared-write-v2

fix: handle rectilinear chunks

c731cf2

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

9e25150

…into perf/prepared-write-v2

fixup

ae0580c

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 9, 2026

d-v-b added the benchmark Code will be benchmarked in a CI job. label Apr 9, 2026

d-v-b added 3 commits April 9, 2026 15:28

fix: fixup

053f2ee

Merge branch 'main' of github.com:zarr-developers/zarr-python into pe…

ba393d3

…rf/prepared-write-v2

fix: wire up prototype in setitem

cfe9539

d-v-b force-pushed the perf/prepared-write-v2-bench branch from 8db7399 to e82da5b Compare April 9, 2026 14:16

d-v-b added 2 commits April 9, 2026 16:56

refactor: define chunklayout class

0b2512b

perf: only fetch the chunks we need

5fb28b9

d-v-b added benchmark Code will be benchmarked in a CI job. and removed benchmark Code will be benchmarked in a CI job. labels Apr 13, 2026

d-v-b removed the benchmark Code will be benchmarked in a CI job. label Apr 14, 2026

d-v-b and others added 9 commits April 14, 2026 15:53

docs: add design spec for PhasedCodecPipeline performance fix

9b620c0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: update spec — global thread pool for compute, not inline or to_…

2348fba

…thread Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove duplicate @staticmethod decorator on _scatter

5ff788d

feat: add global thread pool for codec compute

4616238

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

d-v-b and others added 19 commits April 15, 2026 14:11

refactor: implement four-phase model on SimpleChunkLayout, ShardedChu…

3d200f6

…nkLayout, read_sync, write_sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: update async read and write to use four-phase model

c390708

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: update expected get count for PhasedCodecPipeline shard reads

64e38d8

PhasedCodecPipeline reads the shard index + individual chunks separately, so partial reads issue more get calls than the old full-blob fetch path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add leaf_transform to ShardIndex

5db2a12

refactor: write paths use generic merge_and_encode with ShardIndex.le…

638d57f

…af_transform

feat: recursive resolve_index for nested sharding with leaf_transform

f7772cc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: remove dead layout methods — ChunkLayout owns only resolve_…

8831672

…index + pack_and_store Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: update read/write plan tests for RangeByteRequest and nested sh…

cd64c3d

…arding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: update stale test referencing removed _transform_read method

ae48c67

Test now uses decode_chunks_from_index with ShardIndex.leaf_transform instead of the deleted pipeline._transform_read. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' of github.com:zarr-developers/zarr-python into pe…

b4995ab

…rf/prepared-write-v2

d-v-b force-pushed the perf/prepared-write-v2-bench branch from d8cb81d to 8330cde Compare April 16, 2026 19:32

d-v-b added 3 commits April 16, 2026 21:39

chore: lint

b43f229

chore: cleanup

af3b8e9

d-v-b force-pushed the perf/prepared-write-v2-bench branch from 8330cde to 48300cd Compare April 17, 2026 07:30

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 17, 2026

d-v-b added 4 commits April 17, 2026 09:32

chore: remove old test

75c04a9

chore: make SyncCodecPipeline the default for benchmarking

1da4495

This branch exists to run CI benchmarks against SyncCodecPipeline. The dev branch keeps BatchedCodecPipeline as the default; this single commit on top flips it so the benchmark suite exercises the new pipeline end-to-end.

d-v-b force-pushed the perf/prepared-write-v2-bench branch from 48300cd to 39065e1 Compare April 17, 2026 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[do not merge] benchmarks + tests for phased codecpipeline#3891

[do not merge] benchmarks + tests for phased codecpipeline#3891
d-v-b wants to merge 77 commits intozarr-developers:mainfrom
d-v-b:perf/prepared-write-v2-bench

d-v-b commented Apr 9, 2026

Uh oh!

codecov bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

codspeed-hq bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

d-v-b commented Apr 9, 2026

Uh oh!

codecov bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codspeed-hq bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 99.6%

Performance Changes

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Apr 9, 2026 •

edited

Loading

codspeed-hq bot commented Apr 13, 2026 •

edited

Loading