HDDS-14942. Implement manifest selection logic for rewrite based on snapshot delta#10145
HDDS-14942. Implement manifest selection logic for rewrite based on snapshot delta#10145sreejasahithi wants to merge 2 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
@sreejasahithi Thanks for the PR, please find the comments inline. Also can you add test for this change.
| for (ManifestFile manifest : manifests) { | ||
| if (deltaSnapshotIds == null) { | ||
| manifestPaths.add(manifest.path()); | ||
| } else if (manifest.snapshotId() != null |
There was a problem hiding this comment.
Legacy or old Iceberg manifests can have snapshotId() == null ?
There was a problem hiding this comment.
For any manifest list file that was correctly written, snapshotId() will never be null.
The check added here is a defensive check.
| Table endStaticTable = RewriteTablePathOzoneUtils.newStaticTable(endVersionName, table.io()); | ||
|
|
||
| final Set<Long> deltaSnapshotIds; | ||
| if (startMetadata != null) { |
There was a problem hiding this comment.
The entire startMetadata object is passed but only .equals(null) is checked on it. The actual snapshot data is already captured in deltaSnapshots.
There was a problem hiding this comment.
When startMetadata is not provided, deltaSnapshots will be equal to the full set of snapshots collected across all version files during the version file rewrite phase. When startMetadata is provided, deltaSnapshots will contain only those snapshots that are not tracked by the start version i.e. snapshots that appeared in intermediate version files between start and end, minus the snapshots already present in the start version's metadata.
Because deltaSnapshots is built by reading each intermediate version file's JSON as it was written at that point in time, it can include snapshots that were subsequently expired.
So we don't use deltaSnapshots for iterating and instead iterate through the snapshots collected from the endVersion metadata file because we won't be able to read the manifest list associated with the expired snapshots.
In manifestsToRewrite we use deltaSnapshots only to avoid including manifest files that were already rewritten in a previous incremental run. The snapshot_id field on each manifest entry identifies the snapshot that originally created it. By filtering to only those whose snapshot_id falls within deltaSnapshotIds, we select only manifests that are new since the start version and exclude those that were inherited from before it.
| public RewriteTablePathOzoneAction(Table table) { | ||
| this.table = table; | ||
| this.parallelism = Runtime.getRuntime().availableProcessors(); | ||
| this.parallelism = DEFAULT_THREAD_COUNT; |
There was a problem hiding this comment.
Thread count can be passed via command, can be done in subsequent PR.
There was a problem hiding this comment.
Yes , we can pass the thread count via command during which
public RewriteTablePathOzoneAction(Table table, int parallelism) will be used , we can add this when the CLI command is added for the rewrite.
We can add the tests once we implement manifest-list rewrite because manifest-list rewrite will use manifestsToRewrite result to update the copy plan , so it will be easier to test at that time. |
What changes were proposed in this pull request?
This PR provides logic to determine the specific subset of Iceberg manifest files that require path rewriting, avoiding redundant processing of manifests.
What is the link to the Apache JIRA
HDDS-14942
How was this patch tested?
https://github.com/sreejasahithi/ozone/actions/runs/24986534925