PERF: speed up HDF5 read/append/select_as_multiple by jbrockmendel · Pull Request #65607 · pandas-dev/pandas

jbrockmendel · 2026-05-11T21:03:14Z

closes #25839
closes #26771
closes #47726

Three low-effort perf wins for HDF5 IO, each tied to a long-standing performance issue.

1. `read_hdf` fixed format (GH-47726)

BlockManagerFixed.read did concat(dfs, axis=1).copy(). The trailing .copy() was originally added to force a column-major layout after concat consolidated blocks under CoW. Since GH-60469 each per-block array passed to the DataFrame constructor is already F-ordered via np.asfortranarray, so the post-concat copy is now pure overhead.

Removing it preserves the block memory layout that test_hdfstore_strides exercises (which was the blocker the last time this was attempted) and saves a full-frame copy on every fixed-format read.

2. `HDFStore.select_as_multiple` (GH-26771)

TableIterator.get_result(coordinates=True) always materialized all coordinates from the selector table — even when where=None. PyTables then performs a coordinate-based read of each table, which is much slower than a sequential read. When there is no where clause every row is selected anyway, so we now skip the coordinate computation.

Benchmark on a 500k×8 frame stored as two halves:

	before	after
`select` + `concat`	0.015s	0.015s
`select_as_multiple`	0.092s (~5× slower)	0.023s (~1.5× slower)

3. `to_hdf(..., append=True, data_columns=True, index=False)` (GH-25839)

Every set_attr call on a column does three setattr calls on the HDF5 table's attribute set. PyTables implements attribute updates by deleting and recreating the attribute on disk — roughly 20ms each — so wide-table appends spent essentially all their time re-writing identical column metadata. With 100 data columns this dominated runtime.

validate_attr/validate_metadata already enforce that the values we'd be writing match what's on disk during an append, so a new _set_attr_if_changed helper checks the current value first and skips the write when nothing has changed.

On the GH-25839 reproducer (300 × 100 mixed-dtype frame):

run	before	after
first (create)	0.06s	0.06s
second (append)	6.02s	0.03s

(~200× speedup; scales with number of data columns.)

Test plan

pytest pandas/tests/io/pytables/ — 587 passed, 1 skipped
Manual reproducers from all three issues verified
test_hdfstore_strides still passes (memory layout preserved)

🤖 Generated with Claude Code

pandas-devGH-26771, pandas-devGH-47726) * read_hdf fixed format: drop redundant .copy() after concat — block memory layout is already preserved by np.asfortranarray in the per-block construction (pandas-devGH-60469). * HDFStore.select_as_multiple: skip the coordinate-based read when there is no where clause, since every row is selected anyway and a sequential read is much faster than a coordinate-based one. * DataCol/IndexCol set_attr: skip re-writing HDF5 attributes whose value already matches what's on disk. Pytables deletes and recreates each attribute per setattr, so on wide-table appends this previously dominated runtime (~20ms per attribute × 3 attributes × N data columns). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jbrockmendel and others added 2 commits May 11, 2026 14:02

Merge branch 'main' into perf-hdf

f8235ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PERF: speed up HDF5 read/append/select_as_multiple#65607

PERF: speed up HDF5 read/append/select_as_multiple#65607
jbrockmendel wants to merge 2 commits into
pandas-dev:mainfrom
jbrockmendel:perf-hdf

jbrockmendel commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jbrockmendel commented May 11, 2026

1. read_hdf fixed format (GH-47726)

2. HDFStore.select_as_multiple (GH-26771)

3. to_hdf(..., append=True, data_columns=True, index=False) (GH-25839)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `read_hdf` fixed format (GH-47726)

2. `HDFStore.select_as_multiple` (GH-26771)

3. `to_hdf(..., append=True, data_columns=True, index=False)` (GH-25839)