Skip to content

ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files#2601

Open
QianyongY wants to merge 6 commits intoapache:mainfrom
QianyongY:features/ORC-2149
Open

ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files#2601
QianyongY wants to merge 6 commits intoapache:mainfrom
QianyongY:features/ORC-2149

Conversation

@QianyongY
Copy link
Copy Markdown
Contributor

@QianyongY QianyongY commented Apr 15, 2026

What changes were proposed in this pull request?

Extends the Java merge tool so that, for inputs sharing the same schema, you can still merge to one ORC file by default, or use -m / --maxSize to write multiple ORC files under an output directory as part-xxxxx.orc, batching by input file size.

Why are the changes needed?

Users often need to merge many compatible ORC files without producing a single huge output file. This adds an optional mode that caps output size at whole-file boundaries while keeping the existing single-file behavior when --maxSize is not set.

How was this patch tested?

Add UT

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the ORC Java merge tool to optionally split merged output into multiple ORC part files under an output directory when --maxSize is provided, while keeping the existing single-output behavior by default.

Changes:

  • Add --maxSize option to merge to batch inputs (by on-disk size) into multiple output part files.
  • Add a new unit test covering multi-part merge behavior.
  • Update CLI help text and documentation to describe the new multi-output mode.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
site/_docs/java-tools.md Updates merge tool docs to describe single vs multi-output mode and examples.
java/tools/src/java/org/apache/orc/tools/MergeFiles.java Implements --maxSize parsing and multi-part merge batching logic.
java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java Adds a unit test to validate multi-part output behavior.
java/tools/src/java/org/apache/orc/tools/Driver.java Updates driver help text for the merge subcommand description.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java Outdated
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java Outdated
Comment thread java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java Outdated
Comment thread java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java Outdated
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java
@dongjoon-hyun
Copy link
Copy Markdown
Member

Thank you for making a PR, @QianyongY .

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java Outdated
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java Outdated
Comment thread site/_docs/java-tools.md
Comment thread java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java Outdated
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java Outdated
Comment thread java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java Outdated
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java Outdated
@dongjoon-hyun dongjoon-hyun changed the title ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files. ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files Apr 21, 2026
@QianyongY
Copy link
Copy Markdown
Contributor Author

QianyongY commented Apr 22, 2026

Thanks @cxzl25 and @dongjoon-hyun . Latest push:

  • Deterministic ordering, overflow-safe batching, --maxSize docs clarified.
  • Non-empty output dirs now abort by default; opt in with --overwrite to
    delete existing contents. Shared between --maxSize and --preserveStructure.
  • New --preserveStructure mode (partition-tree-aware merge, skips hidden entries).
  • Tests added for all of the above. PTAL.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java Outdated
Comment thread java/tools/src/java/org/apache/orc/tools/MergeFiles.java Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

System.setOut(origOut);
}
String output = myOut.toString(StandardCharsets.UTF_8);
System.out.println(output);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests unconditionally print the captured tool output to stdout, which can make CI logs noisy. Consider removing this println (or only printing on assertion failure) since failures can include the captured output in the assertion message.

Suggested change
System.out.println(output);

Copilot uses AI. Check for mistakes.
System.setOut(origOut);
}
String output = myOut.toString(StandardCharsets.UTF_8);
System.out.println(output);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests unconditionally print the captured tool output to stdout, which can make test runs noisy. Consider removing this println (or only printing on failure) and rely on assertion messages to surface the output when needed.

Suggested change
System.out.println(output);

Copilot uses AI. Check for mistakes.
Comment on lines +443 to +454
System.out.println(output);

assertTrue(output.contains("Leaves: 1"),
"Expected exactly 1 leaf (hidden dir ignored), got:\n" + output);
assertFalse(fs.exists(new Path(outputRoot, "_temporary")),
"Hidden input directory should not be mirrored to output");

Path part0 = new Path(outputRoot, "h=01/" + String.format(MergeFiles.PART_FILE_FORMAT, 0));
assertTrue(fs.exists(part0));
try (Reader reader = OrcFile.createReader(part0, OrcFile.readerOptions(conf))) {
assertEquals(30, reader.getNumberOfRows(),
"Hidden _hidden.orc rows must not be merged in");
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests unconditionally print the captured tool output to stdout, which can add noise to CI logs. Consider removing this println (or printing only when a test fails) and include output in assertion failure messages instead.

Suggested change
System.out.println(output);
assertTrue(output.contains("Leaves: 1"),
"Expected exactly 1 leaf (hidden dir ignored), got:\n" + output);
assertFalse(fs.exists(new Path(outputRoot, "_temporary")),
"Hidden input directory should not be mirrored to output");
Path part0 = new Path(outputRoot, "h=01/" + String.format(MergeFiles.PART_FILE_FORMAT, 0));
assertTrue(fs.exists(part0));
try (Reader reader = OrcFile.createReader(part0, OrcFile.readerOptions(conf))) {
assertEquals(30, reader.getNumberOfRows(),
"Hidden _hidden.orc rows must not be merged in");
assertTrue(output.contains("Leaves: 1"),
"Expected exactly 1 leaf (hidden dir ignored), got:\n" + output);
assertFalse(fs.exists(new Path(outputRoot, "_temporary")),
"Hidden input directory should not be mirrored to output.\nTool output:\n" + output);
Path part0 = new Path(outputRoot, "h=01/" + String.format(MergeFiles.PART_FILE_FORMAT, 0));
assertTrue(fs.exists(part0),
"Expected merged output file to exist: " + part0 + "\nTool output:\n" + output);
try (Reader reader = OrcFile.createReader(part0, OrcFile.readerOptions(conf))) {
assertEquals(30, reader.getNumberOfRows(),
"Hidden _hidden.orc rows must not be merged in.\nTool output:\n" + output);

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +123
if (maxSizeBytes > 0) {
mergeIntoMultipleFiles(conf, writerOptions, inputStatuses, inputFiles,
new Path(outputFilename), maxSizeBytes, overwrite);
} else {
mergeIntoSingleFile(writerOptions, inputFiles, new Path(outputFilename), outputFilename);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With --maxSize, inputs are enumerated before the output directory is validated/cleaned (prepareOutputDir happens later). If the output directory is inside an input root and --overwrite is used, you can end up deleting files that were already included in the input list (including stale prior outputs), causing missing inputs or incorrect merges. Consider validating/cleaning the output directory before scanning inputs and/or explicitly excluding the output directory subtree from enumeration.

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +169
private static void mergeIntoMultipleFiles(Configuration conf,
OrcFile.WriterOptions writerOptions,
List<LocatedFileStatus> inputStatuses,
List<Path> inputFiles,
Path outputDir,
long maxSizeBytes,
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mergeIntoMultipleFiles takes both inputStatuses and inputFiles, but inputFiles is just a projection of inputStatuses and is only used for inputFiles.size() in the summary. This duplication makes the API easier to misuse (mismatched lists). Consider dropping the inputFiles parameter and using inputStatuses.size() (or deriving paths locally) to keep the method contract consistent.

Copilot uses AI. Check for mistakes.
Comment on lines 58 to 62
String outputFilename = cli.getOptionValue("output");
if (outputFilename == null || outputFilename.isEmpty()) {
System.err.println("output filename is null");
formatter.printHelp("merge", opts);
return;
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message "output filename is null" is misleading here because the condition also covers the empty-string case (and more broadly this is an output path that can be a directory in multi-file modes). Consider changing it to something like "--output is required" / "output path is missing" to better guide CLI users.

Copilot uses AI. Check for mistakes.
System.setOut(origOut);
}
String output = myOut.toString(StandardCharsets.UTF_8);
System.out.println(output);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests unconditionally print the captured tool output to stdout, which can make CI logs noisy and harder to read. Consider removing this println (or only printing on assertion failure) since the assertions already include helpful context messages.

Suggested change
System.out.println(output);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants