ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files#2601
ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files#2601QianyongY wants to merge 6 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Extends the ORC Java merge tool to optionally split merged output into multiple ORC part files under an output directory when --maxSize is provided, while keeping the existing single-output behavior by default.
Changes:
- Add
--maxSizeoption tomergeto batch inputs (by on-disk size) into multiple output part files. - Add a new unit test covering multi-part merge behavior.
- Update CLI help text and documentation to describe the new multi-output mode.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
site/_docs/java-tools.md |
Updates merge tool docs to describe single vs multi-output mode and examples. |
java/tools/src/java/org/apache/orc/tools/MergeFiles.java |
Implements --maxSize parsing and multi-part merge batching logic. |
java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java |
Adds a unit test to validate multi-part output behavior. |
java/tools/src/java/org/apache/orc/tools/Driver.java |
Updates driver help text for the merge subcommand description. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thank you for making a PR, @QianyongY . |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…guard to merge tool
|
Thanks @cxzl25 and @dongjoon-hyun . Latest push:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…e unmerged file handling
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| System.setOut(origOut); | ||
| } | ||
| String output = myOut.toString(StandardCharsets.UTF_8); | ||
| System.out.println(output); |
There was a problem hiding this comment.
These tests unconditionally print the captured tool output to stdout, which can make CI logs noisy. Consider removing this println (or only printing on assertion failure) since failures can include the captured output in the assertion message.
| System.out.println(output); |
| System.setOut(origOut); | ||
| } | ||
| String output = myOut.toString(StandardCharsets.UTF_8); | ||
| System.out.println(output); |
There was a problem hiding this comment.
These tests unconditionally print the captured tool output to stdout, which can make test runs noisy. Consider removing this println (or only printing on failure) and rely on assertion messages to surface the output when needed.
| System.out.println(output); |
| System.out.println(output); | ||
|
|
||
| assertTrue(output.contains("Leaves: 1"), | ||
| "Expected exactly 1 leaf (hidden dir ignored), got:\n" + output); | ||
| assertFalse(fs.exists(new Path(outputRoot, "_temporary")), | ||
| "Hidden input directory should not be mirrored to output"); | ||
|
|
||
| Path part0 = new Path(outputRoot, "h=01/" + String.format(MergeFiles.PART_FILE_FORMAT, 0)); | ||
| assertTrue(fs.exists(part0)); | ||
| try (Reader reader = OrcFile.createReader(part0, OrcFile.readerOptions(conf))) { | ||
| assertEquals(30, reader.getNumberOfRows(), | ||
| "Hidden _hidden.orc rows must not be merged in"); |
There was a problem hiding this comment.
These tests unconditionally print the captured tool output to stdout, which can add noise to CI logs. Consider removing this println (or printing only when a test fails) and include output in assertion failure messages instead.
| System.out.println(output); | |
| assertTrue(output.contains("Leaves: 1"), | |
| "Expected exactly 1 leaf (hidden dir ignored), got:\n" + output); | |
| assertFalse(fs.exists(new Path(outputRoot, "_temporary")), | |
| "Hidden input directory should not be mirrored to output"); | |
| Path part0 = new Path(outputRoot, "h=01/" + String.format(MergeFiles.PART_FILE_FORMAT, 0)); | |
| assertTrue(fs.exists(part0)); | |
| try (Reader reader = OrcFile.createReader(part0, OrcFile.readerOptions(conf))) { | |
| assertEquals(30, reader.getNumberOfRows(), | |
| "Hidden _hidden.orc rows must not be merged in"); | |
| assertTrue(output.contains("Leaves: 1"), | |
| "Expected exactly 1 leaf (hidden dir ignored), got:\n" + output); | |
| assertFalse(fs.exists(new Path(outputRoot, "_temporary")), | |
| "Hidden input directory should not be mirrored to output.\nTool output:\n" + output); | |
| Path part0 = new Path(outputRoot, "h=01/" + String.format(MergeFiles.PART_FILE_FORMAT, 0)); | |
| assertTrue(fs.exists(part0), | |
| "Expected merged output file to exist: " + part0 + "\nTool output:\n" + output); | |
| try (Reader reader = OrcFile.createReader(part0, OrcFile.readerOptions(conf))) { | |
| assertEquals(30, reader.getNumberOfRows(), | |
| "Hidden _hidden.orc rows must not be merged in.\nTool output:\n" + output); |
| if (maxSizeBytes > 0) { | ||
| mergeIntoMultipleFiles(conf, writerOptions, inputStatuses, inputFiles, | ||
| new Path(outputFilename), maxSizeBytes, overwrite); | ||
| } else { | ||
| mergeIntoSingleFile(writerOptions, inputFiles, new Path(outputFilename), outputFilename); |
There was a problem hiding this comment.
With --maxSize, inputs are enumerated before the output directory is validated/cleaned (prepareOutputDir happens later). If the output directory is inside an input root and --overwrite is used, you can end up deleting files that were already included in the input list (including stale prior outputs), causing missing inputs or incorrect merges. Consider validating/cleaning the output directory before scanning inputs and/or explicitly excluding the output directory subtree from enumeration.
| private static void mergeIntoMultipleFiles(Configuration conf, | ||
| OrcFile.WriterOptions writerOptions, | ||
| List<LocatedFileStatus> inputStatuses, | ||
| List<Path> inputFiles, | ||
| Path outputDir, | ||
| long maxSizeBytes, |
There was a problem hiding this comment.
mergeIntoMultipleFiles takes both inputStatuses and inputFiles, but inputFiles is just a projection of inputStatuses and is only used for inputFiles.size() in the summary. This duplication makes the API easier to misuse (mismatched lists). Consider dropping the inputFiles parameter and using inputStatuses.size() (or deriving paths locally) to keep the method contract consistent.
| String outputFilename = cli.getOptionValue("output"); | ||
| if (outputFilename == null || outputFilename.isEmpty()) { | ||
| System.err.println("output filename is null"); | ||
| formatter.printHelp("merge", opts); | ||
| return; |
There was a problem hiding this comment.
The message "output filename is null" is misleading here because the condition also covers the empty-string case (and more broadly this is an output path that can be a directory in multi-file modes). Consider changing it to something like "--output is required" / "output path is missing" to better guide CLI users.
| System.setOut(origOut); | ||
| } | ||
| String output = myOut.toString(StandardCharsets.UTF_8); | ||
| System.out.println(output); |
There was a problem hiding this comment.
These tests unconditionally print the captured tool output to stdout, which can make CI logs noisy and harder to read. Consider removing this println (or only printing on assertion failure) since the assertions already include helpful context messages.
| System.out.println(output); |
What changes were proposed in this pull request?
Extends the Java merge tool so that, for inputs sharing the same schema, you can still merge to one ORC file by default, or use -m / --maxSize to write multiple ORC files under an output directory as part-xxxxx.orc, batching by input file size.
Why are the changes needed?
Users often need to merge many compatible ORC files without producing a single huge output file. This adds an optional mode that caps output size at whole-file boundaries while keeping the existing single-file behavior when --maxSize is not set.
How was this patch tested?
Add UT