TST: Add test for writing UUIDs to parquet with pyarrow #61602#65647
TST: Add test for writing UUIDs to parquet with pyarrow #61602#65647GiTaDi-CrEaTe wants to merge 16 commits into
Conversation
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
|
|
||
| def test_to_parquet_uuid_supported(tmp_path): | ||
| # GH 61602 | ||
| pytest.importorskip("pyarrow", minversion="24.0.0") |
There was a problem hiding this comment.
Skips should be done on test collection, not test execution, where ever possible. Use
@td.skip_if_no("pyarrow", min_version="24.0")
instead.
|
|
||
| # Verify it can be read back | ||
| result = read_parquet(path, engine="pyarrow") | ||
| assert len(result) == 2 |
There was a problem hiding this comment.
Can you test the full result. I think the following would work.
tm.assert_frame_equal(result, df)|
pre-commit.ci autofix |
|
Thanks for the review and the pointers, @rhshadrach! I have updated the test to use the @td.skip_if_no decorator for collection-time skipping and implemented tm.assert_frame_equal to verify the full dataframe. Pushed the changes. |
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
|
Looks like there is still an issue; while read is successful some of the builds are getting bytes instead of UUIDs. Will need investigation to determine whether this needs fixing on the pandas or PyArrow side. |
|
@rhshadrach. Looking at the logs, it seems the Parquet FIXED_LEN_BYTE_ARRAY isn't being cast back to Python UUID objects during deserialization specifically on the py314 and PyArrow Nightly builds, leaving them as raw bytes. |
|
pre-commit.ci autofix |
|
@rhshadrach I investigated the nightly failures. The Parquet FIXED_LEN_BYTE_ARRAY is safely preserving the data, but the PyArrow nightly/py314 builds are failing to cast those 16 bytes back into Python UUID objects during deserialization. I pushed a commit that gracefully checks if the result is returned as raw bytes and maps it back to a UUID object for the assertion. This ensures we are still strictly validating the data integrity while bypassing the upstream nightly object-casting quirk. Let me know if this pragmatic fallback works for you! I Hope this works!! |
for more information, see https://pre-commit.ci
|
pre-commit.ci autofix |
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
|
|
||
|
|
||
| @td.skip_if_no("pyarrow", min_version="24.0") | ||
| def test_to_parquet_uuid_supported(tmp_path): |
There was a problem hiding this comment.
Can you use temp_file instead?
There was a problem hiding this comment.
@mroeschke Done! Swapped tmp_path for the temp_file fixture. Thanks for the review!!
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
Resolves #61602.
Added a test to test_parquet.py to verify that to_parquet successfully writes uuid.UUID objects when using pyarrow >= 24.0.0. The test uses importorskip to skip gracefully on older PyArrow versions where the bug still exists.