Skip to content

Streaming tar packing & unpacking for Merkle tree operations#562

Open
malcolmgreaves wants to merge 1 commit into
mainfrom
mg/stream_tar
Open

Streaming tar packing & unpacking for Merkle tree operations#562
malcolmgreaves wants to merge 1 commit into
mainfrom
mg/stream_tar

Conversation

@malcolmgreaves
Copy link
Copy Markdown
Collaborator

Adds new reusable streaming tar unpacking and packing implementations.
Ensures that the FileBackend and LmdbBackend Merkle tree store use
these in their MerklePacker and MerkleUnpacker implementations.

Adds additional tests around streaming implementations. Additional
manual testing ensures that all client <> server operations remain
unaffected. Only change is an O(max tar entry) vs. O(entire archive)
change for peak memory use.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • Refactor
    • Optimized internal archive processing operations for improved memory efficiency and performance
    • Enhanced error handling for archive validation failures

Walkthrough

This PR introduces shared streaming tar/gzip utilities (util::tar_stream) for memory-efficient pack/unpack operations, then refactors the file backend, LMDB pack, and LMDB unpack implementations to use these helpers. Path traversal and entry-type validation are now centralized in the streaming module, and merkle node pairs are processed immediately on arrival rather than buffered.

Changes

Merkle node streaming tar/gzip transport

Layer / File(s) Summary
Streaming tar/gzip utilities and error integration
crates/lib/src/util/tar_stream.rs, crates/lib/src/util.rs, crates/lib/src/core/db/merkle_node/merkle_node_db.rs
New tar_stream module provides stream_pack (lazy tar entry iteration with gzip encoding) and stream_unpack (tar iteration with pre-callback path/type validation and state threading). Error conversions map TarStreamError into MerkleDbError and OxenError variants. Comprehensive test suite validates round-trip behavior, lazy consumption, traversal rejection, state threading, and compression equivalence.
File backend tar unpacking refactor
crates/lib/src/core/db/merkle_node/file_backend.rs
extract_tar_under now delegates tar+gzip parsing and validation to stream_unpack. Per-entry callback maps tar paths to destination locations, applies overwrite logic, creates directories, writes file payloads, and extracts merkle hashes via normalized destination paths.
LMDB pack streaming refactoring
crates/lib/src/core/db/merkle_node/lmdb/pack.rs
Refactors from eager tar building to lazy streaming of per-hash PackEntry objects through stream_pack. New entries_for_hash and build_entries_for_hash generate directory, node, and children entries with in-memory Cursor bodies. Test suite adds rebuild_tar_with_filter utility and validates streaming unpack handling of out-of-order pairs, missing children, and orphan entries.
LMDB unpack streaming refactoring
crates/lib/src/core/db/merkle_node/lmdb/unpack.rs
Refactors from buffered tar-entry collection to streaming hash-paired state machine via stream_unpack. New UnpackState incrementally pairs node and children payloads by hash, commits rows immediately upon pair completion, defers embedded children without their own pair, and uses put_node/put_link helpers with overwrite-aware LMDB writes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Oxen-AI/Oxen#515: Both PRs touch file_backend.rs's tar-gz unpacking and entry validation—this PR refactors to use shared util::tar_stream::stream_unpack, while the retrieved PR introduced the FileBackend merkle pack/unpack with its own tar extraction and validation.
  • Oxen-AI/Oxen#512: Main PR refactors file_backend.rs's extract_tar_under to use the new util::tar_stream::stream_unpack, directly tied to the retrieved PR's introduction of that same file backend module.
  • Oxen-AI/Oxen#546: Both PRs refactor merkle node tar+gzip pack/unpack in LMDB/file backends (notably lmdb/{pack,unpack}.rs) and introduce/refine the same util::tar_stream helpers and shared transport functionality.

Suggested reviewers

  • CleanCut
  • jcelliott

Poem

🐰 A stream flows through the archive so deep,
No buffered bytes for merkle to keep—
Pairs arrive and commit right away,
While paths are guarded from going astray.
Streaming and packing in harmony play! 📦✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main change: introducing streaming tar packing and unpacking functionality for Merkle tree operations across multiple backend implementations.
Description check ✅ Passed The description is clearly related to the changeset, explaining the new streaming tar implementations and their integration into FileBackend and LmdbBackend Merkle tree operations.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mg/stream_tar

Comment @coderabbitai help to get the list of available commands and usage tips.

Adds new reusable streaming tar unpacking and packing implementations.
Ensures that the `FileBackend` and `LmdbBackend` Merkle tree store use
these in their `MerklePacker` and `MerkleUnpacker` implementations.

Adds additional tests around streaming implementations. Additional
manual testing ensures that all client <> server operations remain
unaffected. Only change is an `O(max tar entry)` vs. `O(entire archive)`
change for peak memory use.
@malcolmgreaves malcolmgreaves marked this pull request as ready for review May 18, 2026 20:55
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/lib/src/core/db/merkle_node/file_backend.rs`:
- Around line 346-352: When handling EntryKind::File in the unpack callback,
ensure parent directories are created before calling
util::fs::atomic_write_from_reader_sync so unpacking is order-insensitive; check
dst_path.parent() and create_dir_all on it (propagate errors as
MerkleDbError::FsTransport) prior to invoking atomic_write_from_reader_sync on
dst_path to match pre-refactor behavior and avoid failures when directory
entries are missing or later in the stream.

In `@crates/lib/src/core/db/merkle_node/lmdb/unpack.rs`:
- Around line 122-123: The code preallocates Vec::with_capacity(entry.size as
usize) using untrusted tar header sizes (entry.size), which can cause OOM/panic;
change the allocation to avoid trusting entry.size by either capping the allowed
size first or using a safe growth path: replace Vec::with_capacity(entry.size as
usize) with Vec::new() and then use try_reserve(checked_max) before reading, or
validate/cap entry.size against a configured MAX_ENTRY_SIZE and only then
reserve; update both occurrences that call Vec::with_capacity(entry.size as
usize) and ensure read_to_end(&mut buf)? still works with the validated/capped
size.

In `@crates/lib/src/util/tar_stream.rs`:
- Around line 143-150: The current check only rejects ParentDir components;
extend the validation in the tar path-check block to also detect and reject
std::path::Component::RootDir and std::path::Component::Prefix on the incoming
path variable so absolute ("/...") and Windows drive-qualified ("C:\...") paths
return TarStreamError::PathTraversal (use the existing
TarStreamError::PathTraversal { path: path.display().to_string() }.into()).
Update the same validation loop that calls path.components().any(...) to include
matches!(c, Component::ParentDir | Component::RootDir | Component::Prefix) to
prevent Path::join() from being bypassed; add tests for absolute and
drive-qualified entries to cover these cases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b022ccf0-0936-49a6-a448-aba2914e5089

📥 Commits

Reviewing files that changed from the base of the PR and between 208d640 and 224c15b.

📒 Files selected for processing (6)
  • crates/lib/src/core/db/merkle_node/file_backend.rs
  • crates/lib/src/core/db/merkle_node/lmdb/pack.rs
  • crates/lib/src/core/db/merkle_node/lmdb/unpack.rs
  • crates/lib/src/core/db/merkle_node/merkle_node_db.rs
  • crates/lib/src/util.rs
  • crates/lib/src/util/tar_stream.rs

Comment on lines +346 to +352
match entry.kind {
EntryKind::Directory => {
std::fs::create_dir_all(&dst_path)?;
}
EntryKind::File => {
util::fs::atomic_write_from_reader_sync(&dst_path, entry.body)
.map_err(|err| MerkleDbError::FsTransport(Box::new(err)))?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Create parent directories before writing file entries.

The old unpack path created dst_path.parent() for every file before writing, but this callback now only creates directories when the tar contains explicit Directory entries. That makes unpacking order-sensitive: a valid tar with node/children before its parent dirs, or with directory records omitted, will now fail at atomic_write_from_reader_sync even though the pre-refactor code accepted it.

Suggested fix
                 EntryKind::File => {
+                    if let Some(parent) = dst_path.parent() {
+                        std::fs::create_dir_all(parent)?;
+                    }
                     util::fs::atomic_write_from_reader_sync(&dst_path, entry.body)
                         .map_err(|err| MerkleDbError::FsTransport(Box::new(err)))?;
                 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
match entry.kind {
EntryKind::Directory => {
std::fs::create_dir_all(&dst_path)?;
}
EntryKind::File => {
util::fs::atomic_write_from_reader_sync(&dst_path, entry.body)
.map_err(|err| MerkleDbError::FsTransport(Box::new(err)))?;
match entry.kind {
EntryKind::Directory => {
std::fs::create_dir_all(&dst_path)?;
}
EntryKind::File => {
if let Some(parent) = dst_path.parent() {
std::fs::create_dir_all(parent)?;
}
util::fs::atomic_write_from_reader_sync(&dst_path, entry.body)
.map_err(|err| MerkleDbError::FsTransport(Box::new(err)))?;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lib/src/core/db/merkle_node/file_backend.rs` around lines 346 - 352,
When handling EntryKind::File in the unpack callback, ensure parent directories
are created before calling util::fs::atomic_write_from_reader_sync so unpacking
is order-insensitive; check dst_path.parent() and create_dir_all on it
(propagate errors as MerkleDbError::FsTransport) prior to invoking
atomic_write_from_reader_sync on dst_path to match pre-refactor behavior and
avoid failures when directory entries are missing or later in the stream.

Comment on lines +122 to +123
let mut buf = Vec::with_capacity(entry.size as usize);
entry.body.read_to_end(&mut buf)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't preallocate from untrusted tar header sizes.

entry.size comes straight from archive metadata. Casting it to usize and feeding it to Vec::with_capacity makes a crafted tarball an easy OOM/panic vector before a single payload byte is validated. Either cap the allowed entry size first or read into a plain Vec::new() / checked try_reserve path.

Suggested change
-            let mut buf = Vec::with_capacity(entry.size as usize);
+            let mut buf = Vec::new();
             entry.body.read_to_end(&mut buf)?;
...
-            let mut buf = Vec::with_capacity(entry.size as usize);
+            let mut buf = Vec::new();
             entry.body.read_to_end(&mut buf)?;

Also applies to: 131-132

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lib/src/core/db/merkle_node/lmdb/unpack.rs` around lines 122 - 123,
The code preallocates Vec::with_capacity(entry.size as usize) using untrusted
tar header sizes (entry.size), which can cause OOM/panic; change the allocation
to avoid trusting entry.size by either capping the allowed size first or using a
safe growth path: replace Vec::with_capacity(entry.size as usize) with
Vec::new() and then use try_reserve(checked_max) before reading, or validate/cap
entry.size against a configured MAX_ENTRY_SIZE and only then reserve; update
both occurrences that call Vec::with_capacity(entry.size as usize) and ensure
read_to_end(&mut buf)? still works with the validated/capped size.

Comment on lines +143 to +150
if path
.components()
.any(|c| matches!(c, std::path::Component::ParentDir))
{
return Err(TarStreamError::PathTraversal {
path: path.display().to_string(),
}
.into());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== stream_unpack call sites ==\n'
rg -n -C4 '\bstream_unpack\s*\(' crates/lib/src/core/db/merkle_node

printf '\n== file backend path materialization ==\n'
rg -n -C4 'join\(|push\(|create_dir|File::create|write_all' crates/lib/src/core/db/merkle_node/file_backend.rs

Repository: Oxen-AI/Oxen

Length of output: 88


🏁 Script executed:

# Find the tar_stream.rs file
find . -name "tar_stream.rs" -type f

# Find all stream_unpack call sites across the repo
rg -n "stream_unpack" --type rust

# Find extraction/file backend code
find . -name "*backend*" -type f | head -20

Repository: Oxen-AI/Oxen

Length of output: 2702


🏁 Script executed:

# Read the tar_stream.rs validation code
sed -n '140,160p' crates/lib/src/util/tar_stream.rs

# Read the file_backend.rs around the stream_unpack call and path handling
sed -n '310,360p' crates/lib/src/core/db/merkle_node/file_backend.rs

Repository: Oxen-AI/Oxen

Length of output: 3303


🏁 Script executed:

# Check existing tests in tar_stream.rs for path validation
sed -n '270,420p' crates/lib/src/util/tar_stream.rs

Repository: Oxen-AI/Oxen

Length of output: 5805


🏁 Script executed:

# Check if there are more tests after line 420
sed -n '420,550p' crates/lib/src/util/tar_stream.rs | head -100

Repository: Oxen-AI/Oxen

Length of output: 4049


🏁 Script executed:

# Check Rust's Path::join behavior with absolute paths by looking at usage/documentation
# Also verify if there are any other path validation mechanisms we might have missed
sed -n '100,145p' crates/lib/src/util/tar_stream.rs

Repository: Oxen-AI/Oxen

Length of output: 2017


🏁 Script executed:

# Let's verify Rust's Path::join behavior by checking if there are any other defenses
# Also verify the exact behavior with a simple test
cat > /tmp/test_path_join.rs << 'EOF'
use std::path::Path;

fn main() {
    // Test absolute path joining
    let base = Path::new("/oxen/hidden");
    let absolute_entry = Path::new("/etc/passwd");
    let result = base.join(absolute_entry);
    println!("Base: {:?}", base);
    println!("Entry (absolute): {:?}", absolute_entry);
    println!("Result: {:?}", result);
    println!("Does result start with base? {}", result.starts_with(base));
}
EOF
rustc /tmp/test_path_join.rs -o /tmp/test_path_join 2>/dev/null && /tmp/test_path_join

Repository: Oxen-AI/Oxen

Length of output: 40


🏁 Script executed:

# Let's verify Rust's Path::join behavior by checking Rust documentation patterns
# and confirm there are no other path safety checks we missed
rg -n "is_absolute|starts_with|canonicalize" crates/lib/src/core/db/merkle_node/file_backend.rs

# Also check if there's any sanitization in util::fs
find . -name "*.rs" -path "*/util/fs*" -type f

Repository: Oxen-AI/Oxen

Length of output: 162


🏁 Script executed:

# Check util/fs.rs for any path sanitization that might be relevant
cat crates/lib/src/util/fs.rs | head -200

Repository: Oxen-AI/Oxen

Length of output: 6337


🏁 Script executed:

# Check the atomic_write_from_reader_sync function to see if it validates paths
rg -n "atomic_write_from_reader_sync" crates/lib/src/util/fs.rs -A 20 | head -40

Repository: Oxen-AI/Oxen

Length of output: 1848


Reject absolute and drive-qualified tar paths at validation boundary.

Line 143 only blocks .., so absolute paths (e.g., /etc/passwd) and Windows drive-qualified paths (e.g., C:\foo) still reach the callback. When file_backend.rs calls st.oxen_hidden.join(entry.path), Rust's Path::join() with an absolute right operand replaces the base path entirely, allowing files to be extracted outside the intended destination root.

The proposed fix correctly adds checks for RootDir and Prefix components to prevent this escape. Existing tests only cover .. traversal; absolute paths are not validated.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/lib/src/util/tar_stream.rs` around lines 143 - 150, The current check
only rejects ParentDir components; extend the validation in the tar path-check
block to also detect and reject std::path::Component::RootDir and
std::path::Component::Prefix on the incoming path variable so absolute ("/...")
and Windows drive-qualified ("C:\...") paths return
TarStreamError::PathTraversal (use the existing TarStreamError::PathTraversal {
path: path.display().to_string() }.into()). Update the same validation loop that
calls path.components().any(...) to include matches!(c, Component::ParentDir |
Component::RootDir | Component::Prefix) to prevent Path::join() from being
bypassed; add tests for absolute and drive-qualified entries to cover these
cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant