Skip to content

PERF: speed up HDF5 read/append/select_as_multiple#65607

Open
jbrockmendel wants to merge 2 commits into
pandas-dev:mainfrom
jbrockmendel:perf-hdf
Open

PERF: speed up HDF5 read/append/select_as_multiple#65607
jbrockmendel wants to merge 2 commits into
pandas-dev:mainfrom
jbrockmendel:perf-hdf

Conversation

@jbrockmendel
Copy link
Copy Markdown
Member

closes #25839
closes #26771
closes #47726

Three low-effort perf wins for HDF5 IO, each tied to a long-standing performance issue.

1. read_hdf fixed format (GH-47726)

BlockManagerFixed.read did concat(dfs, axis=1).copy(). The trailing .copy() was originally added to force a column-major layout after concat consolidated blocks under CoW. Since GH-60469 each per-block array passed to the DataFrame constructor is already F-ordered via np.asfortranarray, so the post-concat copy is now pure overhead.

Removing it preserves the block memory layout that test_hdfstore_strides exercises (which was the blocker the last time this was attempted) and saves a full-frame copy on every fixed-format read.

2. HDFStore.select_as_multiple (GH-26771)

TableIterator.get_result(coordinates=True) always materialized all coordinates from the selector table — even when where=None. PyTables then performs a coordinate-based read of each table, which is much slower than a sequential read. When there is no where clause every row is selected anyway, so we now skip the coordinate computation.

Benchmark on a 500k×8 frame stored as two halves:

before after
select + concat 0.015s 0.015s
select_as_multiple 0.092s (~5× slower) 0.023s (~1.5× slower)

3. to_hdf(..., append=True, data_columns=True, index=False) (GH-25839)

Every set_attr call on a column does three setattr calls on the HDF5 table's attribute set. PyTables implements attribute updates by deleting and recreating the attribute on disk — roughly 20ms each — so wide-table appends spent essentially all their time re-writing identical column metadata. With 100 data columns this dominated runtime.

validate_attr/validate_metadata already enforce that the values we'd be writing match what's on disk during an append, so a new _set_attr_if_changed helper checks the current value first and skips the write when nothing has changed.

On the GH-25839 reproducer (300 × 100 mixed-dtype frame):

run before after
first (create) 0.06s 0.06s
second (append) 6.02s 0.03s

(~200× speedup; scales with number of data columns.)

Test plan

  • pytest pandas/tests/io/pytables/ — 587 passed, 1 skipped
  • Manual reproducers from all three issues verified
  • test_hdfstore_strides still passes (memory layout preserved)

🤖 Generated with Claude Code

jbrockmendel and others added 2 commits May 11, 2026 14:02
pandas-devGH-26771, pandas-devGH-47726)

* read_hdf fixed format: drop redundant .copy() after concat — block
  memory layout is already preserved by np.asfortranarray in the per-block
  construction (pandas-devGH-60469).
* HDFStore.select_as_multiple: skip the coordinate-based read when there
  is no where clause, since every row is selected anyway and a sequential
  read is much faster than a coordinate-based one.
* DataCol/IndexCol set_attr: skip re-writing HDF5 attributes whose value
  already matches what's on disk. Pytables deletes and recreates each
  attribute per setattr, so on wide-table appends this previously
  dominated runtime (~20ms per attribute × 3 attributes × N data
  columns).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant