Streaming tar packing & unpacking for Merkle tree operations#562
Streaming tar packing & unpacking for Merkle tree operations#562malcolmgreaves wants to merge 1 commit into
Conversation
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughThis PR introduces shared streaming tar/gzip utilities ( ChangesMerkle node streaming tar/gzip transport
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Adds new reusable streaming tar unpacking and packing implementations. Ensures that the `FileBackend` and `LmdbBackend` Merkle tree store use these in their `MerklePacker` and `MerkleUnpacker` implementations. Adds additional tests around streaming implementations. Additional manual testing ensures that all client <> server operations remain unaffected. Only change is an `O(max tar entry)` vs. `O(entire archive)` change for peak memory use.
8a5fc96 to
224c15b
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/lib/src/core/db/merkle_node/file_backend.rs`:
- Around line 346-352: When handling EntryKind::File in the unpack callback,
ensure parent directories are created before calling
util::fs::atomic_write_from_reader_sync so unpacking is order-insensitive; check
dst_path.parent() and create_dir_all on it (propagate errors as
MerkleDbError::FsTransport) prior to invoking atomic_write_from_reader_sync on
dst_path to match pre-refactor behavior and avoid failures when directory
entries are missing or later in the stream.
In `@crates/lib/src/core/db/merkle_node/lmdb/unpack.rs`:
- Around line 122-123: The code preallocates Vec::with_capacity(entry.size as
usize) using untrusted tar header sizes (entry.size), which can cause OOM/panic;
change the allocation to avoid trusting entry.size by either capping the allowed
size first or using a safe growth path: replace Vec::with_capacity(entry.size as
usize) with Vec::new() and then use try_reserve(checked_max) before reading, or
validate/cap entry.size against a configured MAX_ENTRY_SIZE and only then
reserve; update both occurrences that call Vec::with_capacity(entry.size as
usize) and ensure read_to_end(&mut buf)? still works with the validated/capped
size.
In `@crates/lib/src/util/tar_stream.rs`:
- Around line 143-150: The current check only rejects ParentDir components;
extend the validation in the tar path-check block to also detect and reject
std::path::Component::RootDir and std::path::Component::Prefix on the incoming
path variable so absolute ("/...") and Windows drive-qualified ("C:\...") paths
return TarStreamError::PathTraversal (use the existing
TarStreamError::PathTraversal { path: path.display().to_string() }.into()).
Update the same validation loop that calls path.components().any(...) to include
matches!(c, Component::ParentDir | Component::RootDir | Component::Prefix) to
prevent Path::join() from being bypassed; add tests for absolute and
drive-qualified entries to cover these cases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b022ccf0-0936-49a6-a448-aba2914e5089
📒 Files selected for processing (6)
crates/lib/src/core/db/merkle_node/file_backend.rscrates/lib/src/core/db/merkle_node/lmdb/pack.rscrates/lib/src/core/db/merkle_node/lmdb/unpack.rscrates/lib/src/core/db/merkle_node/merkle_node_db.rscrates/lib/src/util.rscrates/lib/src/util/tar_stream.rs
| match entry.kind { | ||
| EntryKind::Directory => { | ||
| std::fs::create_dir_all(&dst_path)?; | ||
| } | ||
| EntryKind::File => { | ||
| util::fs::atomic_write_from_reader_sync(&dst_path, entry.body) | ||
| .map_err(|err| MerkleDbError::FsTransport(Box::new(err)))?; |
There was a problem hiding this comment.
Create parent directories before writing file entries.
The old unpack path created dst_path.parent() for every file before writing, but this callback now only creates directories when the tar contains explicit Directory entries. That makes unpacking order-sensitive: a valid tar with node/children before its parent dirs, or with directory records omitted, will now fail at atomic_write_from_reader_sync even though the pre-refactor code accepted it.
Suggested fix
EntryKind::File => {
+ if let Some(parent) = dst_path.parent() {
+ std::fs::create_dir_all(parent)?;
+ }
util::fs::atomic_write_from_reader_sync(&dst_path, entry.body)
.map_err(|err| MerkleDbError::FsTransport(Box::new(err)))?;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| match entry.kind { | |
| EntryKind::Directory => { | |
| std::fs::create_dir_all(&dst_path)?; | |
| } | |
| EntryKind::File => { | |
| util::fs::atomic_write_from_reader_sync(&dst_path, entry.body) | |
| .map_err(|err| MerkleDbError::FsTransport(Box::new(err)))?; | |
| match entry.kind { | |
| EntryKind::Directory => { | |
| std::fs::create_dir_all(&dst_path)?; | |
| } | |
| EntryKind::File => { | |
| if let Some(parent) = dst_path.parent() { | |
| std::fs::create_dir_all(parent)?; | |
| } | |
| util::fs::atomic_write_from_reader_sync(&dst_path, entry.body) | |
| .map_err(|err| MerkleDbError::FsTransport(Box::new(err)))?; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/lib/src/core/db/merkle_node/file_backend.rs` around lines 346 - 352,
When handling EntryKind::File in the unpack callback, ensure parent directories
are created before calling util::fs::atomic_write_from_reader_sync so unpacking
is order-insensitive; check dst_path.parent() and create_dir_all on it
(propagate errors as MerkleDbError::FsTransport) prior to invoking
atomic_write_from_reader_sync on dst_path to match pre-refactor behavior and
avoid failures when directory entries are missing or later in the stream.
| let mut buf = Vec::with_capacity(entry.size as usize); | ||
| entry.body.read_to_end(&mut buf)?; |
There was a problem hiding this comment.
Don't preallocate from untrusted tar header sizes.
entry.size comes straight from archive metadata. Casting it to usize and feeding it to Vec::with_capacity makes a crafted tarball an easy OOM/panic vector before a single payload byte is validated. Either cap the allowed entry size first or read into a plain Vec::new() / checked try_reserve path.
Suggested change
- let mut buf = Vec::with_capacity(entry.size as usize);
+ let mut buf = Vec::new();
entry.body.read_to_end(&mut buf)?;
...
- let mut buf = Vec::with_capacity(entry.size as usize);
+ let mut buf = Vec::new();
entry.body.read_to_end(&mut buf)?;Also applies to: 131-132
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/lib/src/core/db/merkle_node/lmdb/unpack.rs` around lines 122 - 123,
The code preallocates Vec::with_capacity(entry.size as usize) using untrusted
tar header sizes (entry.size), which can cause OOM/panic; change the allocation
to avoid trusting entry.size by either capping the allowed size first or using a
safe growth path: replace Vec::with_capacity(entry.size as usize) with
Vec::new() and then use try_reserve(checked_max) before reading, or validate/cap
entry.size against a configured MAX_ENTRY_SIZE and only then reserve; update
both occurrences that call Vec::with_capacity(entry.size as usize) and ensure
read_to_end(&mut buf)? still works with the validated/capped size.
| if path | ||
| .components() | ||
| .any(|c| matches!(c, std::path::Component::ParentDir)) | ||
| { | ||
| return Err(TarStreamError::PathTraversal { | ||
| path: path.display().to_string(), | ||
| } | ||
| .into()); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== stream_unpack call sites ==\n'
rg -n -C4 '\bstream_unpack\s*\(' crates/lib/src/core/db/merkle_node
printf '\n== file backend path materialization ==\n'
rg -n -C4 'join\(|push\(|create_dir|File::create|write_all' crates/lib/src/core/db/merkle_node/file_backend.rsRepository: Oxen-AI/Oxen
Length of output: 88
🏁 Script executed:
# Find the tar_stream.rs file
find . -name "tar_stream.rs" -type f
# Find all stream_unpack call sites across the repo
rg -n "stream_unpack" --type rust
# Find extraction/file backend code
find . -name "*backend*" -type f | head -20Repository: Oxen-AI/Oxen
Length of output: 2702
🏁 Script executed:
# Read the tar_stream.rs validation code
sed -n '140,160p' crates/lib/src/util/tar_stream.rs
# Read the file_backend.rs around the stream_unpack call and path handling
sed -n '310,360p' crates/lib/src/core/db/merkle_node/file_backend.rsRepository: Oxen-AI/Oxen
Length of output: 3303
🏁 Script executed:
# Check existing tests in tar_stream.rs for path validation
sed -n '270,420p' crates/lib/src/util/tar_stream.rsRepository: Oxen-AI/Oxen
Length of output: 5805
🏁 Script executed:
# Check if there are more tests after line 420
sed -n '420,550p' crates/lib/src/util/tar_stream.rs | head -100Repository: Oxen-AI/Oxen
Length of output: 4049
🏁 Script executed:
# Check Rust's Path::join behavior with absolute paths by looking at usage/documentation
# Also verify if there are any other path validation mechanisms we might have missed
sed -n '100,145p' crates/lib/src/util/tar_stream.rsRepository: Oxen-AI/Oxen
Length of output: 2017
🏁 Script executed:
# Let's verify Rust's Path::join behavior by checking if there are any other defenses
# Also verify the exact behavior with a simple test
cat > /tmp/test_path_join.rs << 'EOF'
use std::path::Path;
fn main() {
// Test absolute path joining
let base = Path::new("/oxen/hidden");
let absolute_entry = Path::new("/etc/passwd");
let result = base.join(absolute_entry);
println!("Base: {:?}", base);
println!("Entry (absolute): {:?}", absolute_entry);
println!("Result: {:?}", result);
println!("Does result start with base? {}", result.starts_with(base));
}
EOF
rustc /tmp/test_path_join.rs -o /tmp/test_path_join 2>/dev/null && /tmp/test_path_joinRepository: Oxen-AI/Oxen
Length of output: 40
🏁 Script executed:
# Let's verify Rust's Path::join behavior by checking Rust documentation patterns
# and confirm there are no other path safety checks we missed
rg -n "is_absolute|starts_with|canonicalize" crates/lib/src/core/db/merkle_node/file_backend.rs
# Also check if there's any sanitization in util::fs
find . -name "*.rs" -path "*/util/fs*" -type fRepository: Oxen-AI/Oxen
Length of output: 162
🏁 Script executed:
# Check util/fs.rs for any path sanitization that might be relevant
cat crates/lib/src/util/fs.rs | head -200Repository: Oxen-AI/Oxen
Length of output: 6337
🏁 Script executed:
# Check the atomic_write_from_reader_sync function to see if it validates paths
rg -n "atomic_write_from_reader_sync" crates/lib/src/util/fs.rs -A 20 | head -40Repository: Oxen-AI/Oxen
Length of output: 1848
Reject absolute and drive-qualified tar paths at validation boundary.
Line 143 only blocks .., so absolute paths (e.g., /etc/passwd) and Windows drive-qualified paths (e.g., C:\foo) still reach the callback. When file_backend.rs calls st.oxen_hidden.join(entry.path), Rust's Path::join() with an absolute right operand replaces the base path entirely, allowing files to be extracted outside the intended destination root.
The proposed fix correctly adds checks for RootDir and Prefix components to prevent this escape. Existing tests only cover .. traversal; absolute paths are not validated.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/lib/src/util/tar_stream.rs` around lines 143 - 150, The current check
only rejects ParentDir components; extend the validation in the tar path-check
block to also detect and reject std::path::Component::RootDir and
std::path::Component::Prefix on the incoming path variable so absolute ("/...")
and Windows drive-qualified ("C:\...") paths return
TarStreamError::PathTraversal (use the existing TarStreamError::PathTraversal {
path: path.display().to_string() }.into()). Update the same validation loop that
calls path.components().any(...) to include matches!(c, Component::ParentDir |
Component::RootDir | Component::Prefix) to prevent Path::join() from being
bypassed; add tests for absolute and drive-qualified entries to cover these
cases.
Adds new reusable streaming tar unpacking and packing implementations.
Ensures that the
FileBackendandLmdbBackendMerkle tree store usethese in their
MerklePackerandMerkleUnpackerimplementations.Adds additional tests around streaming implementations. Additional
manual testing ensures that all client <> server operations remain
unaffected. Only change is an
O(max tar entry)vs.O(entire archive)change for peak memory use.