PERF: speed up HDF5 read/append/select_as_multiple#65607
Open
jbrockmendel wants to merge 2 commits into
Open
Conversation
pandas-devGH-26771, pandas-devGH-47726) * read_hdf fixed format: drop redundant .copy() after concat — block memory layout is already preserved by np.asfortranarray in the per-block construction (pandas-devGH-60469). * HDFStore.select_as_multiple: skip the coordinate-based read when there is no where clause, since every row is selected anyway and a sequential read is much faster than a coordinate-based one. * DataCol/IndexCol set_attr: skip re-writing HDF5 attributes whose value already matches what's on disk. Pytables deletes and recreates each attribute per setattr, so on wide-table appends this previously dominated runtime (~20ms per attribute × 3 attributes × N data columns). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
closes #25839
closes #26771
closes #47726
Three low-effort perf wins for HDF5 IO, each tied to a long-standing performance issue.
1.
read_hdffixed format (GH-47726)BlockManagerFixed.readdidconcat(dfs, axis=1).copy(). The trailing.copy()was originally added to force a column-major layout afterconcatconsolidated blocks under CoW. Since GH-60469 each per-block array passed to theDataFrameconstructor is already F-ordered vianp.asfortranarray, so the post-concat copy is now pure overhead.Removing it preserves the block memory layout that
test_hdfstore_stridesexercises (which was the blocker the last time this was attempted) and saves a full-frame copy on every fixed-format read.2.
HDFStore.select_as_multiple(GH-26771)TableIterator.get_result(coordinates=True)always materialized all coordinates from the selector table — even whenwhere=None. PyTables then performs a coordinate-based read of each table, which is much slower than a sequential read. When there is nowhereclause every row is selected anyway, so we now skip the coordinate computation.Benchmark on a 500k×8 frame stored as two halves:
select+concatselect_as_multiple3.
to_hdf(..., append=True, data_columns=True, index=False)(GH-25839)Every
set_attrcall on a column does threesetattrcalls on the HDF5 table's attribute set. PyTables implements attribute updates by deleting and recreating the attribute on disk — roughly 20ms each — so wide-table appends spent essentially all their time re-writing identical column metadata. With 100 data columns this dominated runtime.validate_attr/validate_metadataalready enforce that the values we'd be writing match what's on disk during an append, so a new_set_attr_if_changedhelper checks the current value first and skips the write when nothing has changed.On the GH-25839 reproducer (300 × 100 mixed-dtype frame):
(~200× speedup; scales with number of data columns.)
Test plan
pytest pandas/tests/io/pytables/— 587 passed, 1 skippedtest_hdfstore_stridesstill passes (memory layout preserved)🤖 Generated with Claude Code