Change materialize to use the parquet encoders and df_loaders#2549
Draft
arienandalibi wants to merge 75 commits intodb_v4from
Draft
Change materialize to use the parquet encoders and df_loaders#2549arienandalibi wants to merge 75 commits intodb_v4from
arienandalibi wants to merge 75 commits intodb_v4from
Conversation
…re-compute new IDs and turn them into RecordBatches
…ock the graph to get parallel iterators over edges. We filter to respect GraphView filtering behaviour.
…ill use ArrowWriter<File> for now, but we will add support for loading into a graph
# Conflicts: # raphtory/src/serialise/parquet/mod.rs
…ng explode_layers() on each EdgeView.
… function can now be passed to these functions to determine how the sinks will be created. This will allow us to pass a sink which is a crossbeam_channel to send RecordBatches elsewhere.
# Conflicts: # raphtory/src/serialise/parquet/mod.rs
…w materialize function
…dge_meta and node_meta.
…k and reusing the old one.
…f encoding everything and then ingesting everything (which would keep everything in memory at once).
… when run on a big graph.
…another thread pool.
…rage so that it doesn't run out of memory
…anning each segment for each row. Now using this path in the new materialize_using_recordbatches function.
… as much as possible
…gesT. This includes Node GIDs and node types. Propagated changes to materialize_using_recordbatches. Filtered test passes.
# Conflicts: # raphtory/src/arrow_loader/df_loaders/edges.rs # raphtory/src/arrow_loader/df_loaders/nodes.rs # raphtory/src/serialise/parquet/edges.rs # raphtory/src/serialise/parquet/model.rs # raphtory/src/serialise/parquet/nodes.rs
…ty layers are created.
… compacted for each layer. This saves a lot of disk space when saved to a directory.
…and copied from the graph on disk.
…en creating the arrow writer for parquet encoding.
…ing resolved, only looked up, which was causing failures related to metadata not being added for nodes that haven't already been resolved.
… materialization bug regarding persistent graphs
…n persistent graphs
…ith ordering of parquet files when loading data. Instead, we now have atomically incrementing counters for file ids.
…o avoid a deadlock when the scope's num_threads is 1. Removed resizing of segments to the max eid to avoid empty segments when a graph is filtered. This was leading to empty graph errors.
…m_df, and internalise otherwise
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Instead of materialize having its own custom ingestion implementation, we change it to leverage the parquet encoders/serialization pathway which generate RecordBatches, and then ingest these using the df_loaders pathway.
Why are the changes needed?
Reduce complexity by removing an unused ingestion pathway, and promote future maintainability.
Does this PR introduce any user-facing change? If yes is this documented?
It shouldn't
How was this patch tested?
Tested using the SNB SF1 and SF3 datasets and compared new materialize implementation with the old one.
Are there any further changes required?
There shouldn't be