Skip to content

Change materialize to use the parquet encoders and df_loaders#2549

Draft
arienandalibi wants to merge 75 commits intodb_v4from
db_v4_bulk_ingestion
Draft

Change materialize to use the parquet encoders and df_loaders#2549
arienandalibi wants to merge 75 commits intodb_v4from
db_v4_bulk_ingestion

Conversation

@arienandalibi
Copy link
Copy Markdown
Collaborator

What changes were proposed in this pull request?

Instead of materialize having its own custom ingestion implementation, we change it to leverage the parquet encoders/serialization pathway which generate RecordBatches, and then ingest these using the df_loaders pathway.

Why are the changes needed?

Reduce complexity by removing an unused ingestion pathway, and promote future maintainability.

Does this PR introduce any user-facing change? If yes is this documented?

It shouldn't

How was this patch tested?

Tested using the SNB SF1 and SF3 datasets and compared new materialize implementation with the old one.

Are there any further changes required?

There shouldn't be

…re-compute new IDs and turn them into RecordBatches
…ock the graph to get parallel iterators over edges. We filter to respect GraphView filtering behaviour.
…ill use ArrowWriter<File> for now, but we will add support for loading into a graph
# Conflicts:
#	raphtory/src/serialise/parquet/mod.rs
… function can now be passed to these functions to determine how the sinks will be created. This will allow us to pass a sink which is a crossbeam_channel to send RecordBatches elsewhere.
# Conflicts:
#	raphtory/src/serialise/parquet/mod.rs
…f encoding everything and then ingesting everything (which would keep everything in memory at once).
…anning each segment for each row. Now using this path in the new materialize_using_recordbatches function.
@arienandalibi arienandalibi self-assigned this Apr 8, 2026
…gesT. This includes Node GIDs and node types. Propagated changes to materialize_using_recordbatches. Filtered test passes.
# Conflicts:
#	raphtory/src/arrow_loader/df_loaders/edges.rs
#	raphtory/src/arrow_loader/df_loaders/nodes.rs
#	raphtory/src/serialise/parquet/edges.rs
#	raphtory/src/serialise/parquet/model.rs
#	raphtory/src/serialise/parquet/nodes.rs
… compacted for each layer. This saves a lot of disk space when saved to a directory.
…en creating the arrow writer for parquet encoding.
…ing resolved, only looked up, which was causing failures related to metadata not being added for nodes that haven't already been resolved.
… materialization bug regarding persistent graphs
…ith ordering of parquet files when loading data. Instead, we now have atomically incrementing counters for file ids.
…o avoid a deadlock when the scope's num_threads is 1. Removed resizing of segments to the max eid to avoid empty segments when a graph is filtered. This was leading to empty graph errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants