Change materialize to use the parquet encoders and df_loaders by arienandalibi · Pull Request #2549 · Pometry/Raphtory

arienandalibi · 2026-04-08T05:35:09Z

What changes were proposed in this pull request?

Instead of materialize having its own custom ingestion implementation, we change it to leverage the parquet encoders/serialization pathway which generate RecordBatches, and then ingest these using the df_loaders pathway.

Why are the changes needed?

Reduce complexity by removing an unused ingestion pathway, and promote future maintainability.

Does this PR introduce any user-facing change? If yes is this documented?

It shouldn't

How was this patch tested?

Tested using the SNB SF1 and SF3 datasets and compared new materialize implementation with the old one.

Are there any further changes required?

There shouldn't be

…s, and layer_ids

…re-compute new IDs and turn them into RecordBatches

…ock the graph to get parallel iterators over edges. We filter to respect GraphView filtering behaviour.

…ill use ArrowWriter<File> for now, but we will add support for loading into a graph

# Conflicts: # raphtory/src/serialise/parquet/mod.rs

…ges, and graph.

…ng explode_layers() on each EdgeView.

… function can now be passed to these functions to determine how the sinks will be created. This will allow us to pass a sink which is a crossbeam_channel to send RecordBatches elsewhere.

# Conflicts: # raphtory/src/serialise/parquet/mod.rs

…w materialize function

…dge_meta and node_meta.

…k and reusing the old one.

…f encoding everything and then ingesting everything (which would keep everything in memory at once).

… when run on a big graph.

…another thread pool.

…rage so that it doesn't run out of memory

…anning each segment for each row. Now using this path in the new materialize_using_recordbatches function.

… as much as possible

…gesT. This includes Node GIDs and node types. Propagated changes to materialize_using_recordbatches. Filtered test passes.

# Conflicts: # raphtory/src/arrow_loader/df_loaders/edges.rs # raphtory/src/arrow_loader/df_loaders/nodes.rs # raphtory/src/serialise/parquet/edges.rs # raphtory/src/serialise/parquet/model.rs # raphtory/src/serialise/parquet/nodes.rs

…ty layers are created.

… compacted for each layer. This saves a lot of disk space when saved to a directory.

…and copied from the graph on disk.

…en creating the arrow writer for parquet encoding.

…ing resolved, only looked up, which was causing failures related to metadata not being added for nodes that haven't already been resolved.

… materialization bug regarding persistent graphs

…n persistent graphs

…ith ordering of parquet files when loading data. Instead, we now have atomically incrementing counters for file ids.

…o avoid a deadlock when the scope's num_threads is 1. Removed resizing of segments to the max eid to avoid empty segments when a graph is filtered. This was leading to empty graph errors.

…m_df, and internalise otherwise

arienandalibi added 29 commits March 12, 2026 12:02

Updating Parquet* structs to support manually passed export vids, eid…

b3881c1

…s, and layer_ids

Allowed IDs to be passed to parquet serialization. Will allow us to p…

fe45031

…re-compute new IDs and turn them into RecordBatches

Changed Parquet encoding to take GraphView instead of GraphStorage. L…

c6d43e1

…ock the graph to get parallel iterators over edges. We filter to respect GraphView filtering behaviour.

Fixed node and edge parallel iterator creation

d4d63ea

Making the parquet encoders generic over the writer (now sink). We st…

8ddff63

…ill use ArrowWriter<File> for now, but we will add support for loading into a graph

Merge branch 'db_v4' into db_v4_bulk_ingestion

2bb7b49

# Conflicts: # raphtory/src/serialise/parquet/mod.rs

Changed Parquet writer from ArrowWriter to generic sink for nodes, ed…

dbfa3f1

…ges, and graph.

Fixed possible ParquetDelEdge layer_id and layer_name issues by calli…

e261d28

…ng explode_layers() on each EdgeView.

Fixed path error

c469443

Made all the encode_* functions generic over the sink. A sink factory…

d90debf

… function can now be passed to these functions to determine how the sinks will be created. This will allow us to pass a sink which is a crossbeam_channel to send RecordBatches elsewhere.

Adding Receiver side on materialize

a10a251

Merge branch 'db_v4' into db_v4_bulk_ingestion

82935de

# Conflicts: # raphtory/src/serialise/parquet/mod.rs

Hid new materialize behind IO feature and added a test to test the ne…

33790e3

…w materialize function

Adding logic to ingest data using load_*_from_df functions

9ca997e

Fixed deadlock. It had to do with LayerMappers being shared between e…

f11458f

…dge_meta and node_meta.

Removed unused variable bindings

80dfc59

Fixed deadlock caused by DictMapper deep_clone not creating a new loc…

93382c1

…k and reusing the old one.

Merge branch 'db_v4' into db_v4_bulk_ingestion

68d1233

Merge branch 'db_v4' into db_v4_bulk_ingestion

365f777

Working on making materialize stream RecordBatches properly instead o…

bb6730e

…f encoding everything and then ingesting everything (which would keep everything in memory at once).

Changed std::thread::scope for a rayon::scope

321ddbc

Added a test that times the old and new materialize functions

8ff88fc

Debugging materialize_using_recordbatches to see why it freezes/hangs…

a37485b

… when run on a big graph.

Changed to make encoding using its own thread pool and ingestion use …

aa9f7d9

…another thread pool.

Switched materialize test to use graph paths and have disk backed sto…

3df56dc

…rage so that it doesn't run out of memory

Improved ingestion time on the "load_*_from_df" path by avoiding resc…

0387dc1

…anning each segment for each row. Now using this path in the new materialize_using_recordbatches function.

Switched assert_graph_equals to be parallel instead of multi-threaded…

f170591

… as much as possible

Merge branch 'db_v4' into db_v4_bulk_ingestion

68a9928

Rustfmt

eef6972

arienandalibi self-assigned this Apr 8, 2026

arienandalibi added 30 commits April 16, 2026 04:23

Added test for a filtered graph

9bc6e2f

Renamed parquet folder to parquet_encoder

f9c41dd

Fixed encoders to pass relevant information in NodesT, EdgesC, and Ed…

a80fd13

…gesT. This includes Node GIDs and node types. Propagated changes to materialize_using_recordbatches. Filtered test passes.

Lower channel size

e97d323

Merge branch 'db_v4' into db_v4_bulk_ingestion

bbce32c

# Conflicts: # raphtory/src/arrow_loader/df_loaders/edges.rs # raphtory/src/arrow_loader/df_loaders/nodes.rs # raphtory/src/serialise/parquet/edges.rs # raphtory/src/serialise/parquet/model.rs # raphtory/src/serialise/parquet/nodes.rs

Fixes after merge

cc8f453

Fixed test

c8110ea

Fixed io feature gating

40d0687

Added layer creation before creating the temporal graph to ensure emp…

aa079cd

…ty layers are created.

Updated edges iteration in parquet encoders so that EIDs get resolved…

b6fb8bc

… compacted for each layer. This saves a lot of disk space when saved to a directory.

Clean up after filtered sf1 test

11ade14

No need to set the env vars for raphtory settings, they are imported …

166f3e2

…and copied from the graph on disk.

Merge branch 'db_v4' into db_v4_bulk_ingestion

f86c7c9

Added layer names to the parquet files to avoid filename collision wh…

ad382bf

…en creating the arrow writer for parquet encoding.

Cleaned up test_materialize.rs imports

78616fe

Switched old materialize for the new one to run tests

3586a90

Fix bug in resolve_node_and_meta_for_node_col where nodes were not be…

6f72d82

…ing resolved, only looked up, which was causing failures related to metadata not being added for nodes that haven't already been resolved.

Materialize edge deletions before edge c props (edge metadata) to fix…

1a53892

… materialization bug regarding persistent graphs

Attempting to fix temporal properties not being serialized properly o…

23a6cdd

…n persistent graphs

Got rid of layer_n in parquet filenames. They were causing problems w…

643bd6c

…ith ordering of parquet files when loading data. Instead, we now have atomically incrementing counters for file ids.

Preserve property mappers in materialize

967ecbd

Fix bugs in materialize. Switch rayon::scope for std::thread::scope t…

28ac52a

…o avoid a deadlock when the scope's num_threads is 1. Removed resizing of segments to the max eid to avoid empty segments when a graph is filtered. This was leading to empty graph errors.

Remove sf3 paths in test_materialize_sf10.rs,

1e7e8f4

Remove channel for producer in materialize

9909cde

Added flag to resolve nodes when materializing in load_node_props_fro…

a6e7fb9

…m_df, and internalise otherwise

First try at is_materializing flag in load_node_props_from_df

fd838a4

Fixed test_materialize_sf10.rs feature gating on imports

9075538

Merge branch 'refs/heads/db_v4' into db_v4_bulk_ingestion

9ba9c32

Added t_len for NodeStorageInner

245ad95

Clean up imports a lil

65daec9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change materialize to use the parquet encoders and df_loaders#2549

Change materialize to use the parquet encoders and df_loaders#2549
arienandalibi wants to merge 75 commits intodb_v4from
db_v4_bulk_ingestion

arienandalibi commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arienandalibi commented Apr 8, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change? If yes is this documented?

How was this patch tested?

Are there any further changes required?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants