Fix xdist worker race condition on scanpy dataset cache#83
Merged
Conversation
Under pytest -n auto, multiple xdist workers invoked sc.datasets.pbmc3k() concurrently, racing on the single shared cache file pbmc3k_raw.h5ad. One worker reading while another wrote/downloaded intermittently produced an HDF5 "filter returned failure during read" OSError during adata_pbmc3k setup. Give each worker its own scanpy datasetdir keyed by PYTEST_XDIST_WORKER via a pytest_configure hook, eliminating the shared-file contention. The fixture stays function-scoped, so downstream mutation behavior is unchanged.
…c-c6v97q # Conflicts: # CHANGELOG.md
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #83 +/- ##
=======================================
Coverage 86.37% 86.37%
=======================================
Files 13 13
Lines 1387 1387
=======================================
Hits 1198 1198
Misses 189 189 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixed intermittent HDF5 read failures when running tests under
pytest -n autoby isolating each xdist worker's scanpy dataset cache to its own directory. Previously, multiple workers would race on the sharedpbmc3k_raw.h5adcache file, causing one worker to read while another was still downloading/writing.Behavior Or Invariants Changed
tests/data/scanpy_cache/{worker_id}/tests/data/scanpy_cache/to.gitignoreTests Run
Existing test suite passes. The fix is validated by running tests under parallel execution (
pytest -n auto), which previously surfaced the race condition intermittently.Reviewer Focus
pytest_configure()hook inconftest.py: Ensures cache isolation is set up before any tests runPYTEST_XDIST_WORKER: Only applies isolation when running under xdistparents=True, exist_ok=True: Safe for both single-worker and multi-worker scenariosContext
The
adata_pbmc3kfixture usessc.datasets.pbmc3k(), which downloads a large dataset to a shared cache directory. Under parallel test execution, multiple workers attempt to read/write this file simultaneously, causing HDF5 synchronization errors. This is a common issue with pytest-xdist when fixtures depend on shared external resources.The solution leverages pytest's
pytest_configurehook to detect xdist workers (via thePYTEST_XDIST_WORKERenvironment variable) and redirect each worker's scanpy cache before any tests execute.Open Questions Or Follow-Ups
None. This is a straightforward isolation fix with no behavioral changes to the actual test logic.
https://claude.ai/code/session_018GKskG6NPe5KeUyhSWfLLn