Open
Conversation
`PreparedWrite` models a set of per-chunk changes that would be applied to a stored chunk. `SupportsChunkPacking` is a protocol for array -> bytes codecs that can use `PreparedWrite` objects to update an existing chunk.
…into perf/prepared-write-v2
…into perf/prepared-write-v2
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3885 +/- ##
==========================================
- Coverage 93.07% 92.10% -0.98%
==========================================
Files 85 87 +2
Lines 11228 11801 +573
==========================================
+ Hits 10451 10869 +418
- Misses 777 932 +155
🚀 New features to boost your workflow:
|
…into perf/prepared-write-v2
This was referenced Apr 8, 2026
Closed
Closed
Contributor
Author
|
@TomAugspurger how would this design work with CUDA codecs? |
…rf/prepared-write-v2
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…thread Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace batch-and-wait with streaming: chunks flow through fetch → decode → scatter/store independently via asyncio.gather with per-chunk coroutines. No chunk waits for all others to finish a stage before advancing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When inner codecs are fixed-size and the store supports byte-range writes, write individual inner chunks directly via set_range instead of read-modify-write of the full shard blob. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12 tasks covering: global thread pool, streaming read/write, SimpleChunkLayout fast path, ByteRangeSetter protocol, partial shard writes, and benchmark verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Skip BasicIndexer/ChunkGrid creation for non-sharded layouts by directly calling inner_transform.decode_chunk on the raw buffer. Adds test to verify the fast path produces correct output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the 3-phase batch approach (fetch ALL → compute ALL → store ALL) with a streaming pipeline where each chunk flows through fetch → compute → store independently via asyncio.gather, improving memory usage and latency by allowing IO and compute to overlap across chunks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5d3064e to
b67a5a0
Compare
…PhasedCodecPipeline All pipeline contract tests now run against both implementations: roundtrip, sharded roundtrip, partial writes, missing chunks, strided reads, multidimensional, and compression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b67a5a0 to
4746601
Compare
The pre-extracted decode function used the default layout's baked-in ArraySpec shape, which fails for rectilinear chunks where each chunk may have a different size. Pass chunk_shape explicitly when it differs from the default. Fixes doctest failure: "cannot reshape array of size 500 into shape (1,1)" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a 2d-rectilinear array config (chunks=[[10,20,30],[50,50]]) to the parametrized test matrix. This would have caught the decode_chunk reshape bug fixed in the previous commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ration
ChunkLayout now owns both the IO strategy (fetch_sync/fetch) and the
compute strategy (decode/encode). The pipeline's read_sync loop becomes
uniform:
raw = layout.fetch_sync(byte_getter, ...) # IO
decoded = layout.decode(raw, chunk_spec) # compute
out[out_selection] = decoded[chunk_selection] # scatter
SimpleChunkLayout.decode calls inner_transform.decode_chunk directly.
ShardedChunkLayout.decode chooses between vectorized numpy decode
(for dense fixed-size shards) and per-inner-chunk decode (general case)
internally — the pipeline doesn't need to know.
This eliminates the ad-hoc if/elif chain in read_sync that previously
handled 4+ different cases with duplicated logic.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests define the interface for declarative IO planning: given an array region and codec configuration, produce a ReadPlan data structure that fully describes what byte-range reads are needed. Covers: - Non-sharded: full reads, partial reads, single element, 2D - Sharded fixed-size: full shard, single inner chunk, two inner chunks - Sharded variable-size: compressed inner chunk (index read required) - Nested sharding: skipped (future) All tests currently fail with NotImplementedError — _plan_read is a stub defining the target interface we want to build toward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use the existing RangeByteRequest from zarr.abc.store instead of a custom ByteRange dataclass. ShardIndex now maps coords to RangeByteRequest | None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the old fetch_sync/fetch/decode/encode methods on ChunkLayout with resolve_index, fetch_chunks, decode_chunks, merge_and_encode, store_chunks_sync, and store_chunks_async. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nkLayout, read_sync, write_sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PhasedCodecPipeline reads the shard index + individual chunks separately, so partial reads issue more get calls than the old full-blob fetch path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove old methods that are no longer called after the four-phase refactoring: _transform_read, _decode_shard, _decode_shard_vectorized, _encode_shard_vectorized, _transform_write, _transform_write_shard, _encode_per_chunk, _decode_vectorized, _encode_vectorized, _fetch_chunks, _fetch_chunks_sync, chunk_byte_offset, inner_chunk_byte_length. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers: no sharding, single-level sharding (with/without outer BB codecs), nested sharding, N-level nesting. For each: full read, partial read, full write, partial write. Documents optimal IO and compute sequence for each scenario. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ShardIndex carries leaf_transform. resolve_index is the only layout-specific method. fetch_chunks, decode_chunks, merge_and_encode are generic. Handles nested sharding by flattening index resolution and using the leaf codec chain for decode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 tasks: add leaf_transform to ShardIndex, extract generic functions, refactor read/write paths, implement recursive resolve_index for nested sharding, remove dead code, update tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add module-level fetch_chunks_sync, fetch_chunks_async, and decode_chunks_from_index functions that use ShardIndex.leaf_transform for IO and decode operations. Refactor PhasedCodecPipeline.read_sync and async read._process_chunk to use these generic functions instead of delegating to layout methods. Add is_sharded field to ShardIndex to distinguish "None = read full blob" (simple layouts) from "None = absent inner chunk" (sharded layouts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…index + pack_and_store Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…arding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test now uses decode_chunks_from_index with ShardIndex.leaf_transform instead of the deleted pipeline._transform_read. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pack_blob - Add leaf_transform property to ChunkLayout base class (returns inner_transform) and override on ShardedChunkLayout (traverses nested ShardingCodecs to find innermost codec chain) - Fix write path complete-overwrite to use layout.leaf_transform instead of layout.inner_transform (was using wrong transform for nested sharding) - Fix decode_chunks_from_index to use index.is_sharded instead of fragile shape-based is_simple heuristic - Add _pack_nested to ShardedChunkLayout: groups flat leaf chunks by inner shard, packs each group into an inner shard blob, then packs into outer shard — produces correct nested shard structure - Remove dead unpack_blob from all layout classes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…se codec chain directly Remove ~840 lines of ChunkLayout hierarchy (ShardIndex, SimpleChunkLayout, ShardedChunkLayout, fetch_chunks_sync/async, decode_chunks_from_index, merge_and_encode_from_index). The pipeline now uses ChunkTransform directly for sync decode/encode and falls back to the async codec API otherwise. Also fix ShardingCodec._encode_sync to respect write_empty_chunks config by skipping inner chunks that are all fill_value. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove test_read_plan.py and test_write_plan.py (tested removed layout abstraction) - Fix test_evolve_from_array_spec to check _sync_transform instead of layout - Replace test_simple_layout_decode_skips_indexer with test_sync_transform_encode_decode_roundtrip - Add n_workers parameter to read_sync/write_sync for thread-pool parallelism Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
read_sync and write_sync now support n_workers parameter. When > 0, the decode (read) or decode+merge+encode (write) compute steps are parallelized across threads via ThreadPoolExecutor.map. IO remains sequential. This helps when codecs release the GIL (gzip, blosc, zstd) — e.g. gzip decompression is 41% of read time and runs entirely in C. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rf/prepared-write-v2
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR defines a new codec pipeline class called
PhasedCodecPipelinethat enables much higher performance for chunk encoding and decoding than the currentBatchedCodecPipeline.The approach here is to completely ignore how the v3 spec defines array -> bytes codecs 😆. Instead of treating codecs as functions that mix IO and compute, we treat codec encoding and decoding as a sequence:
fetch exactly what we need to fetch from storage, given the codecs we have. So if there's a sharding codec in the first array->bytes position, the codec pipeline knows it must fetch the shard index, then fetch the involved subchunks, before passing them to compute.
Basically, we use the first array -> bytes codec to figure out what kind of preparatory IO and final IO we need to perform, and the rest of the codecs to figure out what kind of chunk encoding we need to do. Separating IO from compute in different phases makes things simpler and faster.
Happy to chat more about this direction. IMO the spec should be re-written with this framing, because it makes much more sense than trying to shoe-horn sharding in as a codec.
I don't want to make our benchmarking suite any bigger but on my laptop this codec pipeline is 2-5x faster than the batchedcodec pipeline for a lot of workloads. I can include some of those benchmarks later.
This was mostly written by claude, based on previous work in #3719. All these changes should be non-breaking, so I think this is in principle safe for us to play around with in a patch or minor release.
Edit: this PR depends on changes submitted in #3907 and #3908