Skip to content

HDDS-14942. Implement manifest selection logic for rewrite based on snapshot delta#10145

Open
sreejasahithi wants to merge 2 commits intoapache:masterfrom
sreejasahithi:HDDS-14942
Open

HDDS-14942. Implement manifest selection logic for rewrite based on snapshot delta#10145
sreejasahithi wants to merge 2 commits intoapache:masterfrom
sreejasahithi:HDDS-14942

Conversation

@sreejasahithi
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR provides logic to determine the specific subset of Iceberg manifest files that require path rewriting, avoiding redundant processing of manifests.

  • Compute delta snapshots as the difference between the start and end table metadata versions. If no start version is provided, all snapshots are treated as delta.
  • Iterate over all snapshots in the end-version table and read each snapshot's manifest list in parallel.
  • If no start metadata is provided, include all manifests unconditionally.
  • If start metadata is provided, filter at the manifest level — only include manifests whose snapshotId belongs to the delta snapshot ID set.
  • Deduplicate manifest paths so that manifests shared across multiple snapshots are only collected once.

What is the link to the Apache JIRA

HDDS-14942

How was this patch tested?

https://github.com/sreejasahithi/ozone/actions/runs/24986534925

@sreejasahithi sreejasahithi marked this pull request as ready for review April 28, 2026 08:38
Copy link
Copy Markdown
Contributor

@ashishkumar50 ashishkumar50 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sreejasahithi Thanks for the PR, please find the comments inline. Also can you add test for this change.

for (ManifestFile manifest : manifests) {
if (deltaSnapshotIds == null) {
manifestPaths.add(manifest.path());
} else if (manifest.snapshotId() != null
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy or old Iceberg manifests can have snapshotId() == null ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any manifest list file that was correctly written, snapshotId() will never be null.
The check added here is a defensive check.

Table endStaticTable = RewriteTablePathOzoneUtils.newStaticTable(endVersionName, table.io());

final Set<Long> deltaSnapshotIds;
if (startMetadata != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire startMetadata object is passed but only .equals(null) is checked on it. The actual snapshot data is already captured in deltaSnapshots.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When startMetadata is not provided, deltaSnapshots will be equal to the full set of snapshots collected across all version files during the version file rewrite phase. When startMetadata is provided, deltaSnapshots will contain only those snapshots that are not tracked by the start version i.e. snapshots that appeared in intermediate version files between start and end, minus the snapshots already present in the start version's metadata.
Because deltaSnapshots is built by reading each intermediate version file's JSON as it was written at that point in time, it can include snapshots that were subsequently expired.

So we don't use deltaSnapshots for iterating and instead iterate through the snapshots collected from the endVersion metadata file because we won't be able to read the manifest list associated with the expired snapshots.
In manifestsToRewrite we use deltaSnapshots only to avoid including manifest files that were already rewritten in a previous incremental run. The snapshot_id field on each manifest entry identifies the snapshot that originally created it. By filtering to only those whose snapshot_id falls within deltaSnapshotIds, we select only manifests that are new since the start version and exclude those that were inherited from before it.

public RewriteTablePathOzoneAction(Table table) {
this.table = table;
this.parallelism = Runtime.getRuntime().availableProcessors();
this.parallelism = DEFAULT_THREAD_COUNT;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread count can be passed via command, can be done in subsequent PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , we can pass the thread count via command during which
public RewriteTablePathOzoneAction(Table table, int parallelism) will be used , we can add this when the CLI command is added for the rewrite.

@sreejasahithi sreejasahithi marked this pull request as draft April 29, 2026 07:26
@sreejasahithi
Copy link
Copy Markdown
Contributor Author

@sreejasahithi Thanks for the PR, please find the comments inline. Also can you add test for this change.

We can add the tests once we implement manifest-list rewrite because manifest-list rewrite will use manifestsToRewrite result to update the copy plan , so it will be easier to test at that time.

@sreejasahithi sreejasahithi marked this pull request as ready for review April 29, 2026 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants