diff --git a/docs/07-system-internals/07-features/03-bucket-snapshots.md b/docs/07-system-internals/07-features/03-bucket-snapshots.md index 19aca64ed1..e4a20603fa 100644 --- a/docs/07-system-internals/07-features/03-bucket-snapshots.md +++ b/docs/07-system-internals/07-features/03-bucket-snapshots.md @@ -4,5 +4,7 @@ sidebar_label: Bucket Snapshot # Bucket Snapshot System Internals +For snapshot defragmentation (YAML sidecars, on-disk layout, configuration, and workflow), see [Snapshot Defragmentation](./snapshot-defragmentation). + For detailed information on how snapshots are deleted internally, see the [Snapshot Deletion Lifecycle](./bucket-snapshot-deletion-lifecycle) page. diff --git a/docs/07-system-internals/07-features/06-improve-ozone-snapshot-scale-with-snapshot-defragmentation.md b/docs/07-system-internals/07-features/06-improve-ozone-snapshot-scale-with-snapshot-defragmentation.md deleted file mode 100644 index fdbed49e34..0000000000 --- a/docs/07-system-internals/07-features/06-improve-ozone-snapshot-scale-with-snapshot-defragmentation.md +++ /dev/null @@ -1,170 +0,0 @@ -# Improve Ozone Snapshot Scale with Snapshot Defragmentation - -## Improving Snapshot Scale - -[HDDS-13003](https://issues.apache.org/jira/browse/HDDS-13003) - -## Problem Statement - -In Apache Ozone, snapshots currently take a checkpoint of the Active Object Store (AOS) RocksDB each time a snapshot is created and track the compaction of SST files over time. This model works efficiently when snapshots are short-lived, as they merely serve as hard links to the AOS RocksDB. However, over time, if an older snapshot persists while significant churn occurs in the AOS RocksDB (due to compactions and writes), the snapshot RocksDB may diverge significantly from both the AOS RocksDB and other snapshot RocksDB instances. This divergence increases storage requirements linearly with the number of snapshots. - -## Solution Proposal - -The primary inefficiency in the current snapshot mechanism stems from constant RocksDB compactions in AOS, which can cause a key, file, or directory entry to appear in multiple SST files. Ideally, each unique key, file, or directory entry should reside in only one SST file, eliminating redundant storage and mitigating the multiplier effect caused by snapshots. If implemented correctly, the total RocksDB size would be proportional to the total number of unique keys in the system rather than the number of snapshots. - -## Snapshot Defragmentation - -Currently, snapshot RocksDBs has automatic RocksDB compaction disabled intentionally to preserve snapshot diff performance, preventing any form of compaction. However, snapshots can be defragmented in the way that the next active snapshot in the chain is a checkpoint of its previous active snapshot plus a diff stored in separate SST files (one SST for each column family changed). The proposed approach involves rewriting snapshots iteratively from the beginning of the snapshot chain and restructuring them in a separate directory. - -Note: Snapshot Defragmentation was previously called Snapshot Compaction earlier during the design phase. It is not RocksDB compaction. Thus the rename to avoid such confusion. We are also not going to enable RocksDB auto compaction on snapshot RocksDBs. - ---- - -### 1. Introducing Last Defragmentation Time - -A new boolean flag (`needsDefrag`), timestamp (`lastDefragTime`), int `version` would be added to snapshot metadata. -`needsDefrag` tells the system whether a snapshot is pending defrag (`true`) or if it is already defragged and up to date (`false`). This helps manage and automate the defrag workflow, ensuring snapshots are efficiently stored and maintained. `needsDefrag` defaults to `false` during initialization and when absent. -A new list of `Map>` (`notDefraggedSstFileList`) also would be added to snapshot meta as part of snapshot create operation; this would be storing the original list of SST files in the not defragged copy of the snapshot corresponding to keyTable/fileTable/DirectoryTable. This should be done as part of the snapshot create operation. -Since this is not going to be consistent across all OMs this would have to be written to a local YAML file inside the snapshot directory and this can be maintained in the SnapshotChainManager in memory on startup. So all updates should not go through Ratis. -An additional `Map>>` (`defraggedSstFileList`) also would be added to snapshotMeta. This will be maintaining a list of sstFiles of different versions of defragged snapshots. The key here would be the version number of snapshot DBs. - -### 2. Snapshot Cache Lock for Read Prevention - -A snapshot lock will be introduced in the snapshot cache to prevent reads on a specific snapshot during the last step of defragmentation. This ensures no active reads occur while we are replacing the underlying RocksDB instance. The swap should be instantaneous. - ---- - -### 3. Directory Structure Changes - -Snapshots currently reside under `db.snapshots/checkpointState/` directory. The proposal introduces a `db.snapshots/checkpointStateDefragged/` directory for defragged snapshots. The directory format should be as follows: - -```text -db.snapshots/checkpointState/ -``` - -### 4. Optimized Snapshot Diff Computation - -To compute a snapshot diff: - -- If both snapshots are defragged, their defragged versions will be used. The diff between two defragged snapshot should be present in one SST file. -- If the target snapshot is not defragged & the source snapshot is defragged (other way is not possible as we always defrag snapshots in order) and if the DAG has all the sst files corresponding to the not defragged snapshot version of the defragged snapshot which would be captured as part of the snapshot metadata, then an efficient diff can be performed with the information present in the DAG. Use `notDefraggedSstFileList` from each of the snapshot’s meta -- Otherwise, a full diff will be computed between the defragged source and the defragged target snapshot. Delta SST files would be computed corresponding to the latest version number of the target snapshot(version number of target snapshot would always be greater) -- Changes in the full diff logic is required to check inode ids of sst files and remove the common sst files b/w source and target snapshots. - ---- - -### 5. Snapshot Defragmentation Workflow - -A background snapshot defragmentation service should be added which would be done by iterating through the snapshot chain in the same order as the global snapshot chain. This is to ensure the snapshot created after is always defragged after all the snapshots previously created are defragged. Snapshot defragmentation should only occur once the snapshot has undergone SST filtering. The following steps outline the process: - -- **Create a RocksDB checkpoint** of the path previous snapshot corresponding to the bucket in the chain (if it exists). `version` of previous snapshot should be strictly greater than the current snapshot’s `version` otherwise skip compacting this snapshot in this iteration. - -- **Acquire the `SNAPSHOT_GC_LOCK`** for the snapshot ID to prevent garbage collection during defragmentation[This is to keep contents of deleted Table contents same while defragmentation consistent]. - 1. If there is no path previous snapshot then - 1. Take a checkpoint of the same RocksDB instance remove keys that don’t correspond to the bucket from tables `keyTable`, `fileTable`, `directoryTable,deletedTable,deletedDirectoryTable` by running RocksDB delete range api. This should be done if the snapshot has never been defragged before i.e. if `lastDefragTime` is zero or null. Otherwise just update the `needsDefrag` to False. - 2. We can trigger a forced manual compaction on the RocksDB instance(i & ii can be behind a flag where in we can just work with the checkpoint of the RocksDB if the flag is disabled). - - 2. If path previous snapshot exists: - 1. **Compute the diff** between tables (`keyTable`, `fileTable`, `directoryTable`) of the checkpoint and the current snapshot using snapshot diff functionality. - 2. **Flush changed objects** into separate SST files using the SST file writer, categorizing them by table type. - 3. **Ingest these SST files** into the RocksDB checkpoint using the `ingestFile` API. - -- Check if the entire current snapshot has been flushed to disk otherwise wait for the flush to happen. - -- Truncate `deletedTable,deletedDirectoryTable,snapshotRenamedTable etc. (All tables excepting keyTable/fileTable/directoryTable)` in checkpointed RocksDB and ingest the entire table from deletedTable and deletedDirectoryTable from the current snapshot RocksDB. - -- **Acquire the snapshot cache lock** to prevent snapshot access during directory updates.[While performing the snapshot RocksDB directory switch there should be no RocksDB handle with read happening on it]. - -- **Move the checkpoint directory** into `checkpointStateDefragged` with the format: - -```text -om.db-- -``` - -- **Update snapshot metadata**, setting `lastDefragTime` and marking `needsDefrag = false` and set the next snapshot in the chain is marked for defragmentation. If there is no path previous snapshot in the chain then increase `version` by 1 otherwise set `version` which is equal to the previous snapshot in the chain. Based on the sstFiles in the RocksDB compute `Map>` and add this Map to `defraggedSstFileList` corresponding to the `version` of the snapshot. - -- **Delete old not defragged/defragged snapshots**, ensuring unreferenced not defragged/defragged snapshots are purged during OM startup(This is to handle JVM crash after viii). - -- **Release the snapshot cache lock** on the snapshot id. Now the snapshot is ready to be used to read - -#### Visualization - -```mermaid -flowchart TD -A[Start: Not defragged Snapshot Exists] --> B[Has SST Filtering Occurred?] -B -- No --> Z[Wait for SST Filtering] -B -- Yes --> C[Create RocksDB Checkpoint of Previous Snapshot] -C --> D["Defragged Copy Exists?"] -D -- Yes --> E[Update defragTime, set needsDefrag=false] -D -- No --> F[Create Checkpoint in Temp Directory] -E --> G[Acquire SNAPSHOT_GC_LOCK] -F --> G -G --> H[Compute Diff between Checkpoint & Current Snapshot] -H --> I[Flush Changed Objects into SST Files by table] -I --> J[Ingest SST Files into Checkpointed RocksDB] -J --> K[Truncate / Replace Deleted Tables] -K --> L[Acquire Snapshot Cache Lock] -L --> M[Move Checkpoint Dir to checkpointStateDefragged] -M --> N[Update Snapshot Metadata] -N --> O[Delete Old Snapshot DB Dir] -O --> P[Release Snapshot Cache Lock] -P --> Q[Defragged Snapshot Ready] -``` - -### Computing Changed Objects Between Snapshots - -The following steps outline how to compute changed objects between snapshots: - -1. **Determine delta SST files** - - Retrieve delta SST files from the DAG if the snapshot was not previously defragmented - and the previous snapshot has a non-defragmented copy. - - Otherwise, compute delta SST files by comparing SST files in both defragmented - RocksDB instances. - -2. **Initialize SST file writers** for the following tables `keyTable`, `directoryTable` and `fileTable` - -3. **Iterate SST files in parallel**, reading and merging keys to maintain sorted order.(Similar to the MinHeapIterator instead of iterating through multiple tables we would be iterating through multiple sst files concurrently). - -4. **Compare keys** between snapshots to determine changes and write updated objects if and only if they have changed into the SST file. - - If the object is present in the target snapshot then do an `sstFileWriter.put()`. - - If the object is present in source snapshot but not present in target snapshot then we just have to write a tombstone entry by calling `sstFileWriter.delete()`. - -5. **Ingest the generated SST files** into the checkpointed RocksDB. - ---- - -#### Visualization - -```mermaid -flowchart TD -A[Start: Need Diff Between Snapshots] --> B[Determine delta SST files] -B -- DAG Info available --> C[Retrieve from DAG] -B -- Otherwise --> D[Compute delta via SST comparison] -C --> E[Initialize SST file writers: keyTable, directoryTable, fileTable] -D --> E -E --> F[Iterate SST files in parallel, merge keys] -F --> G[Compare keys between snapshots] -G --> H["Object in Target Snapshot?"] -H -- Yes --> I[sstFileWriter put] -H -- No --> J[sstFileWriter delete tombstone] -I --> K[Ingest SST files into checkpointed RocksDB] -J --> K -``` - -### Handling Snapshot Purge - -Upon snapshot deletion, the `needsDefrag` flag for the next snapshot in the chain -is set to `true`, ensuring defragmentation propagates incrementally across the snapshot chain. - -#### Visualization - -```mermaid -flowchart TD - A[Snapshot Deletion Requested] - --> B[Mark next snapshot needsDefrag true] - --> C[Next snapshots will be defragmented incrementally] -``` - -## Conclusion - -This approach effectively reduces storage overhead while maintaining efficient snapshot retrieval and diff computation. The total storage would be in the order of total number of keys in the snapshots + AOS by reducing overall redundancy of the objects while also making the snapshot diff computation for even older snapshots more computationally efficient. diff --git a/docs/07-system-internals/07-features/06-snapshot-defragmentation.md b/docs/07-system-internals/07-features/06-snapshot-defragmentation.md new file mode 100644 index 0000000000..bbd37f81a5 --- /dev/null +++ b/docs/07-system-internals/07-features/06-snapshot-defragmentation.md @@ -0,0 +1,467 @@ +--- +sidebar_label: Snapshot Defragmentation +--- + + + +# Snapshot Defragmentation + +Feature documentation from [apache/ozone#10131](https://github.com/apache/ozone/pull/10131) ([HDDS-15113](https://issues.apache.org/jira/browse/HDDS-15113)). + +## Overview + +An Ozone snapshot is created as a RocksDB checkpoint of the active OM DB. A +new snapshot is cheap because its SST files are hard links to the active DB SST +files. Over time, active DB compactions rewrite SST files. Older snapshot +directories continue to pin their original SST files while newer snapshots pin +newer versions of the same metadata. With many long-lived snapshots and high +metadata churn, the disk usage under the snapshot checkpoint directory can grow +roughly with the number of snapshots rather than with the number of live unique +keys. + +Snapshot defragmentation rewrites each snapshot into a versioned checkpoint +that contains only the data needed for that snapshot. It uses the previous +snapshot in the same bucket path chain plus the changed SST/key ranges for the +current snapshot, so the newest defragmented copy does not keep a full, +independent copy of every historical SST file. + +Snapshot defragmentation was previously called snapshot compaction during the +design phase. Snapshot defragmentation is not the same as RocksDB automatic +compaction of snapshot DBs. Snapshot DB automatic compaction remains disabled +because the snapshot diff path relies on stable SST metadata. + +## Current Implementation + +The implementation is centered on these classes: + +* `SnapshotDefragService`: background and on-demand service that rewrites + snapshot checkpoint directories. +* `OmSnapshotLocalData` and `OmSnapshotLocalDataYaml`: local per-OM metadata + persisted in YAML sidecar files. +* `OmSnapshotLocalDataManager`: loads YAML files, maintains the in-memory + dependency graph for `(snapshotId, version)` nodes, resolves previous + snapshot versions, and removes orphaned version metadata. +* `CompositeDeltaDiffComputer`, `RDBDifferComputer`, and `FullDiffComputer`: + compute the SST files that may contain differences between two snapshots. +* `SstFileSetReader` and `TableMergeIterator`: read candidate keys from delta + SST files as a sorted stream and compare the current and previous snapshot + tables without issuing an independent point lookup for every candidate key. +* `OmSnapshotManager`: opens the current snapshot version and deletes old + checkpoint directories after a version switch. + +The defrag service is local to each OM. The rewritten checkpoint directories +and YAML files are not Ratis-replicated state. In an HA deployment, each OM has +its own local snapshot DB directories and must defragment its own copies. The +admin command can target any OM node. + +## On-Disk Layout + +The active OM DB lives under the OM metadata directory selected by +`ozone.om.db.dirs`. If that property is not set, OM falls back to +`ozone.metadata.dirs`. + +For an OM metadata directory ``, snapshot checkpoint directories +live under: + +```text +/db.snapshots/checkpointState/ +``` + +The current implementation does not place defragmented DBs under a separate +`checkpointStateDefragged` directory. The original and defragmented versions +are sibling directories in `checkpointState`: + +```text +/db.snapshots/checkpointState/om.db- +/db.snapshots/checkpointState/om.db-- +``` + +Version `0` is the original, non-defragmented checkpoint and has no version +suffix. Versions greater than `0` are produced by snapshot defragmentation. +Normally only the current version's directory remains after a successful +defrag cleanup. The following paths show how the directory name changes over a +snapshot's lifetime; they are not expected to coexist in normal steady state: + +```text +# Before first defrag: +/var/lib/ozone/om/db.snapshots/checkpointState/om.db-3d0a...9f62 + +# After first successful defrag: +/var/lib/ozone/om/db.snapshots/checkpointState/om.db-3d0a...9f62-1 + +# After the next successful defrag: +/var/lib/ozone/om/db.snapshots/checkpointState/om.db-3d0a...9f62-2 +``` + +Older directories can exist briefly during a version switch or after an +interrupted cleanup, but the normal post-defrag path deletes older checkpoint +directories for that snapshot DB. + +Each snapshot also has one local YAML sidecar next to the version `0` +directory: + +```text +/db.snapshots/checkpointState/om.db-3d0a...9f62.yaml +``` + +Temporary work is created under: + +```text +/db.snapshots/checkpointState/tmp_defrag/ +/db.snapshots/checkpointState/tmp_defrag/differSstFiles/ +``` + +`SnapshotDefragService` deletes and recreates `tmp_defrag` on service startup +and deletes it on shutdown. + +When an OM DB checkpoint is served to another OM, the checkpoint code uses the +current version from the local YAML metadata and includes that snapshot DB +directory. The bootstrap transfer also includes the required +`om.db-.yaml` sidecars. The inode-based transfer path explicitly +archives the YAML files for the snapshots present in the checkpoint and for any +previous local-data nodes they depend on; the directory-walk transfer path +includes sidecar files while selecting only the current snapshot DB directories. +The bootstrap write lock waits for the OM double buffer to flush before +collecting files, and the inode-based path also holds the snapshot cache lock +and the local data manager lock while resolving snapshot directories and YAML +paths. Snapshots in the intermediate `SNAPSHOT_DELETED` state can still be +copied because they remain in `SnapshotInfo`; fully purged snapshots are no +longer present there. + +## Local Snapshot Metadata + +Snapshot defrag metadata is stored in `OmSnapshotLocalData` YAML, not in +`SnapshotInfo` and not in the Ratis log. Important fields are: + +| Field | Meaning | +| :---- | :------ | +| `snapshotId` | Snapshot UUID. Must match the checkpoint directory name. | +| `checksum` | Checksum of the YAML representation, used to detect corrupted local metadata. | +| `previousSnapshotId` | The previous snapshot in the same bucket path chain that this local data is resolved against. | +| `version` | Current version to open. `0` means the original checkpoint; `> 0` means a defragmented version. | +| `needsDefrag` | Explicit local flag that forces the service to defragment the snapshot. | +| `isSSTFiltered` | YAML marker used by the older `SstFilteringService` path. Defrag disables that service when it is enabled. | +| `versionSstFileInfos` | Map from snapshot version to `VersionMeta`. This replaces the earlier split between `notDefraggedSstFileList` and `defraggedSstFileList`. | +| `VersionMeta.previousSnapshotVersion` | The version of `previousSnapshotId` that this version depends on. | +| `VersionMeta.sstFiles` | SST file metadata for `keyTable`, `directoryTable`, and `fileTable`. Each nested `SstFileInfo` uses `fileName`, `startKey`, `endKey`, and `columnFamily`; `fileName` is stored without the `.sst` extension. | +| `dbTxSequenceNumber` | Largest RocksDB sequence number observed in tracked SST files when the original snapshot YAML is created. Used by the checkpoint differ. | +| `transactionInfo` | Purge transaction marker used to remove local metadata only after the purge has flushed to disk. | +| `lastDefragTime` | Serialized by the YAML class, but current defrag decisions are based on `version`, `needsDefrag`, and `versionSstFileInfos`. | + +On every snapshot creation, OM creates the YAML sidecar and captures live SST +file metadata for `keyTable`, `directoryTable`, and `fileTable` as version +`0`. This metadata is read from the newly created snapshot checkpoint DB, not +from the active OM DB, so an active DB compaction immediately after checkpoint +creation cannot corrupt the snapshot's local SST tracking. This happens even +when the periodic snapshot defrag service is disabled. New snapshots are +committed with `needsDefrag = true`. During upgrade/finalization, +`OmSnapshotLocalDataManager` also creates missing YAML files for snapshots +already present in `SnapshotInfo`; active snapshots get their tracked SST +metadata, and the synthesized YAML is marked `needsDefrag = true`. When a new +defragmented version is added, the current version is incremented, the new +version's SST list is captured from RocksDB, and `needsDefrag` is cleared. + +`OmSnapshotLocalDataManager` keeps an in-memory graph of local version +dependencies. Each node is a `(snapshotId, version)` pair and points to the +`(previousSnapshotId, previousSnapshotVersion)` it depends on. The graph is +rebuilt from YAML at OM startup. It is used to: + +* reject deletion of a version that is still referenced by another snapshot + version; +* resolve a snapshot's previous-version dependency when the path chain changes + after purge; +* identify orphaned versions and YAML files that can be removed after purge. + +## Service Configuration + +Snapshot defragmentation is disabled by default. + +| Property | Default | Meaning | +| :------- | :------ | :------ | +| `ozone.snapshot.defrag.service.interval` | `-1` | Background interval. A value `<= 0` disables the service. | +| `ozone.snapshot.defrag.limit.per.task` | `1` | Maximum number of snapshots defragmented in one service run. | +| `ozone.snapshot.defrag.service.timeout` | `300s` | Timeout for one service run. | +| `ozone.om.snapshot.local.data.manager.service.interval` | `5m` | Interval for the local YAML/version orphan cleanup thread. A value `<= 0` disables the cleanup thread. | + +The service is gated by the `SNAPSHOT_DEFRAG` OM layout feature. It also +requires the Rocks tools native library; if the library is unavailable, an +on-demand run returns without defragmenting snapshots. + +If defrag is enabled, `KeyManagerImpl` does not start `SstFilteringService`, +even when the SST filtering interval is configured. Defrag already filters the +tracked snapshot tables by bucket prefix while building the rewritten +checkpoint. If defrag is disabled and SST filtering is enabled, the older SST +filtering service still removes irrelevant SST files from version `0` +snapshots and writes the `sstFiltered` marker file. + +Manual defrag is exposed through: + +```bash +ozone admin om snapshot defrag --service-id= --node-id= +ozone admin om snapshot defrag --service-id= --node-id= --no-wait +``` + +The command requires the defrag service to be initialized on the target OM. +Any OM in an HA service can run it because the rewritten snapshot DB state is +local to that OM. + +## Defragmentation Workflow + +`SnapshotDefragService` iterates the global snapshot chain in forward order and +processes active snapshots only. For each snapshot, it resolves the path +previous snapshot in the same bucket. Incremental defrag is based on the path +chain, not merely the global creation order. + +The service decides that a snapshot needs defrag when either: + +* the local `needsDefrag` flag is true; or +* the snapshot's current version depends on an older version of its resolved + previous snapshot than the previous snapshot's current version. + +The second condition is what propagates defrag after a previous snapshot is +rewritten or when snapshot purge changes the path chain. + +The main workflow is: + +1. Acquire the bootstrap read lock and load `SnapshotInfo` plus local YAML. +2. Create a temporary checkpoint in `tmp_defrag`. + * If this is the first snapshot in the path chain, checkpoint the current + snapshot. + * Otherwise, checkpoint the current version of the path previous snapshot. +3. Drop non-incremental column families from the temporary checkpoint. They are + reloaded from the current snapshot later. +4. For the first snapshot in the path chain, do a full defrag of `keyTable`, + `directoryTable`, and `fileTable`: + * delete ranges outside the bucket prefix; + * compact each tracked table with forced bottommost-level compaction so the + range tombstones are removed from the rewritten checkpoint. +5. For later snapshots, do incremental defrag of the same tracked tables: + * compute delta SST files between the path previous snapshot and the current + snapshot; + * group deltas by column family; + * read candidate keys from the delta SST files, merge them with the previous + and current snapshot tables, and write only changed keys or tombstones into + a temporary SST file; + * ingest the resulting SST file into the temporary checkpoint. +6. Acquire a write `SNAPSHOT_DB_CONTENT_LOCK` for the current snapshot. This is + the lock that prevents concurrent snapshot content changes while the service + reloads non-incremental tables and switches versions. Snapshot reads and + deep-clean writes use `SNAPSHOT_DB_LOCK` or read `SNAPSHOT_DB_CONTENT_LOCK` + in the same lock hierarchy. The DAG-based lock ordering allows the content + lock to be acquired before snapshot DB and local-data locks; code paths avoid + acquiring the content lock while already holding local-data locks. +7. Dump and ingest non-incremental tables from the current snapshot into the + checkpoint. The tracked tables (`keyTable`, `directoryTable`, `fileTable`) + are skipped because they were already rebuilt. +8. Close the temporary checkpoint metadata manager and move the checkpoint + directory to the next version path: + + ```text + /db.snapshots/checkpointState/om.db-- + ``` + +9. Open the new version, add its live SST metadata to + `versionSstFileInfos`, update `version`, and clear `needsDefrag`. +10. After a successful version switch, delete older checkpoint directory + versions for that same snapshot after acquiring the snapshot DB cache write + lock. For example, after switching from version `1` to version `2`, + `om.db--1` is removed locally once there are no open cached + handles for that snapshot DB. Version `0` is normally removed after the + first successful defrag that creates version `1`; if an older directory is + still present from an interrupted earlier cleanup, the same deletion path + can remove it. The YAML version metadata may remain longer than the + directories when another snapshot version still references it. + `OmSnapshotLocalDataManager` removes orphaned version metadata later. + This directory deletion intentionally remains under + `SNAPSHOT_DB_CONTENT_LOCK` so stale cached handles cannot write to an old + version while it is being removed. +11. Release `SNAPSHOT_DB_CONTENT_LOCK`. + +```mermaid +flowchart TD + A["Select next active snapshot"] --> B["Resolve local data and previous path snapshot"] + B --> C{"Needs defrag?"} + C -- "No" --> Z["Skip"] + C -- "Yes" --> D{"Has path previous snapshot?"} + D -- "No" --> E["Checkpoint current snapshot in tmp_defrag"] + D -- "Yes" --> F["Checkpoint previous snapshot current version in tmp_defrag"] + E --> G["Full defrag tracked tables by bucket prefix"] + F --> H["Compute and ingest incremental tracked-table deltas"] + G --> I["Acquire SNAPSHOT_DB_CONTENT_LOCK"] + H --> I + I --> J["Ingest non-incremental tables from current snapshot"] + J --> K["Move checkpoint to om.db--"] + K --> L["Update YAML version metadata and clear needsDefrag"] + L --> M["Delete old checkpoint directories"] + M --> N["Release lock"] +``` + +## Delta Computation + +Defrag uses only the column families tracked by the checkpoint differ: +`keyTable`, `directoryTable`, and `fileTable`. + +`CompositeDeltaDiffComputer` first tries `RDBDifferComputer`. The differ uses +the local `versionSstFileInfos` metadata and the active DB compaction DAG. When +the current snapshot version is `0`, the DAG path can be used to find the SST +files that changed since the previous snapshot. For versions greater than `0`, +the differ falls back to comparing SST file metadata by version because +defragmented versions are already rewritten snapshot DBs rather than raw active +DB checkpoints. + +If the DAG-based differ cannot produce a complete answer, the code falls back +to `FullDiffComputer`. The full differ compares relevant SST files by inode +when inode metadata is available, and falls back to comparing full file lists +when inode comparison fails. It considers files unique to either endpoint and +skips common files only when the file identity proves that they are the same +SST. + +The delta computers materialize candidate SSTs as hard links under +`tmp_defrag/differSstFiles` before returning them to the defrag service. This +keeps the source SST content stable while the service reads it, even if the +original source path later becomes eligible for cleanup. + +The delta files identify candidate SSTs, not final row-level changes. The +defrag service still reads keys from those files, compares current and previous +snapshot table values, writes only changed records or tombstones to a new SST +file, and ingests that file into the checkpoint. If there is exactly one delta +file for a table and the current snapshot version is already greater than `0`, +the service can ingest that delta file directly. + +`SstFileSetReader` returns candidate keys as a sorted merged stream and can +read tombstones through the raw SST reader. The defrag path uses key-only +iteration, `CodecBuffer`, and direct buffers where possible. Because candidate +keys are sorted, `TableMergeIterator` can walk the current and previous RocksDB +tables with forward iterators and seeks instead of issuing independent point +gets for every candidate key. + +## Snapshot Diff Before and After Defrag + +The snapshot diff API and report-generation flow do not change after snapshot +defragmentation. `SnapshotDiffManager` still submits a diff job, opens the +current snapshot DB versions through `OmSnapshotManager`, asks +`CompositeDeltaDiffComputer` for candidate SST files, reads candidate keys with +`SstFileSetReader`, compares the from/to snapshot tables with +`TableMergeIterator`, and builds the object-ID maps used to produce the final +diff report. + +The internal SST-candidate path changes based on the current local version of +the to-snapshot: + +* Before defrag, the to-snapshot is version `0`, which is the original OM DB + checkpoint. `RDBDifferComputer` can ask `RocksDBCheckpointDiffer` to walk the + active DB compaction DAG and use the YAML `dbTxSequenceNumber` plus version + `0` SST metadata to identify changed SSTs. If the DAG cannot provide a + complete answer, `CompositeDeltaDiffComputer` falls back to `FullDiffComputer`. +* After defrag, the to-snapshot's current version is greater than `0`, and that + version is a rewritten snapshot DB rather than an active DB checkpoint + produced by normal RocksDB compactions. The differ resolves the from-snapshot + dependency through `OmSnapshotLocalDataManager`, passes the YAML + `versionSstFileInfos` version map into `RocksDBCheckpointDiffer`, and compares + SST metadata for the relevant snapshot versions instead of using the + compaction-DAG walk. The full-diff fallback is still available, and + `--forceFullDiff` continues to bypass the DAG path. + +## Snapshot Reads + +Snapshot reads go through `OmSnapshotManager` and `SnapshotCache`. The cache +loader reads the snapshot's current version from `OmSnapshotLocalDataManager` +and opens: + +```text +/db.snapshots/checkpointState/om.db-[-] +``` + +The read path does not scan for the highest directory suffix on disk. The YAML +current version is the source of truth: moving a new checkpoint directory is +not visible to readers until the YAML current version is committed. + +Before opening a snapshot cache entry, the loader waits for the snapshot create +transaction recorded in `SnapshotInfo.createTransactionInfo` to flush to the OM +DB. This prevents a follower or a fast reader from opening a snapshot whose +checkpoint directory or YAML sidecar exists in memory or on disk before the +corresponding create transaction is durable. + +## Snapshot Purge and Orphan Cleanup + +Snapshot delete first marks `SnapshotInfo` as `SNAPSHOT_DELETED`. Later, +`SnapshotDeletingService` submits an internal purge request. Purge updates the +next snapshots' path/global previous IDs in `SnapshotInfo`, removes the purged +snapshot from the chain, records purge `transactionInfo` in the purged +snapshot's local YAML, invalidates the snapshot cache entry, and deletes the +purged snapshot's checkpoint directories. + +The purge path does not directly write `needsDefrag = true` into the next +snapshot's YAML. Instead, the next time local data for that snapshot is opened +for defrag, `OmSnapshotLocalDataManager` resolves the updated +`pathPreviousSnapshotId`. If that changes the dependency or if the referenced +previous snapshot version is stale, the provider marks or reports the snapshot +as needing defrag. + +Old checkpoint directories for a snapshot are deleted immediately after that +snapshot is successfully defragmented to a newer version. Old version metadata +and YAML files are cleaned separately from checkpoint directories by +`OmSnapshotLocalDataManagerService`, a single-threaded scheduler owned by +`OmSnapshotLocalDataManager`. + +On startup, the local data manager loads all `om.db-.yaml` +files, rebuilds the in-memory version dependency graph, and queues every +loaded snapshot ID for an orphan check. Later commits can queue additional +snapshot IDs: + +* when a snapshot gains or removes local versions; +* when a snapshot's resolved `previousSnapshotId` changes after purge updates + the path chain; +* when purge records `transactionInfo` in a snapshot's YAML. + +Each cleanup pass checks the queued snapshot IDs. A version entry can be +removed from YAML when no other local version node depends on it and either: + +* the version is not `0` and is not the snapshot's current version; or +* the snapshot itself has been purged. + +Version `0` is kept for active snapshots even when it has no dependents, +because a newly created or unresolved snapshot can still depend on the original +version. If a snapshot has purge `transactionInfo` but the purge transaction +has not flushed to the OM DB yet, the cleanup thread keeps the YAML and +re-queues the snapshot for a later pass. When the purge has flushed and no +versions remain, the YAML file is deleted. + +## Metrics and Logging + +`OmSnapshotInternalMetrics` records defrag progress since the last OM restart: +total defrag operations, total failures, skipped snapshots, full defrag +operations and failures, incrementally defragged snapshots and failures, full +defrag tables compacted, and incremental delta files processed. + +`OMPerformanceMetrics` records the latency of the last full defrag operation +and the last incremental defrag operation in milliseconds. With trace logging +enabled, `SnapshotDefragService` also logs before/after directory statistics +for each defragmented snapshot, including total files, SST file count, and +directory byte usage. + +## Expected Effect + +After a full pass, each active snapshot's current version is a compact, +bucket-scoped checkpoint that reuses the previous path snapshot plus the +snapshot's own changes. This reduces duplicate SST retention for long snapshot +chains while keeping snapshot reads and snapshot-diff computation based on +ordinary RocksDB checkpoints and SST metadata. + +For background on the original scale problem and early design discussion, see +[HDDS-13003](https://issues.apache.org/jira/browse/HDDS-13003). diff --git a/versioned_docs/version-2.1.0/07-system-internals/07-features/03-bucket-snapshots.md b/versioned_docs/version-2.1.0/07-system-internals/07-features/03-bucket-snapshots.md index 40ed8b8def..53b225c171 100644 --- a/versioned_docs/version-2.1.0/07-system-internals/07-features/03-bucket-snapshots.md +++ b/versioned_docs/version-2.1.0/07-system-internals/07-features/03-bucket-snapshots.md @@ -4,4 +4,8 @@ sidebar_label: Bucket Snapshot # Bucket Snapshot System Internals +For snapshot defragmentation (YAML sidecars, on-disk layout, configuration, and workflow), see [Snapshot Defragmentation](./snapshot-defragmentation). + +For detailed information on how snapshots are deleted internally, see the [Snapshot Deletion Lifecycle](./bucket-snapshot-deletion-lifecycle) page. + **TODO:** File a subtask under [HDDS-9862](https://issues.apache.org/jira/browse/HDDS-9862) and complete this page or section. diff --git a/versioned_docs/version-2.1.0/07-system-internals/07-features/06-improve-ozone-snapshot-scale-with-snapshot-defragmentation.md b/versioned_docs/version-2.1.0/07-system-internals/07-features/06-improve-ozone-snapshot-scale-with-snapshot-defragmentation.md deleted file mode 100644 index fdbed49e34..0000000000 --- a/versioned_docs/version-2.1.0/07-system-internals/07-features/06-improve-ozone-snapshot-scale-with-snapshot-defragmentation.md +++ /dev/null @@ -1,170 +0,0 @@ -# Improve Ozone Snapshot Scale with Snapshot Defragmentation - -## Improving Snapshot Scale - -[HDDS-13003](https://issues.apache.org/jira/browse/HDDS-13003) - -## Problem Statement - -In Apache Ozone, snapshots currently take a checkpoint of the Active Object Store (AOS) RocksDB each time a snapshot is created and track the compaction of SST files over time. This model works efficiently when snapshots are short-lived, as they merely serve as hard links to the AOS RocksDB. However, over time, if an older snapshot persists while significant churn occurs in the AOS RocksDB (due to compactions and writes), the snapshot RocksDB may diverge significantly from both the AOS RocksDB and other snapshot RocksDB instances. This divergence increases storage requirements linearly with the number of snapshots. - -## Solution Proposal - -The primary inefficiency in the current snapshot mechanism stems from constant RocksDB compactions in AOS, which can cause a key, file, or directory entry to appear in multiple SST files. Ideally, each unique key, file, or directory entry should reside in only one SST file, eliminating redundant storage and mitigating the multiplier effect caused by snapshots. If implemented correctly, the total RocksDB size would be proportional to the total number of unique keys in the system rather than the number of snapshots. - -## Snapshot Defragmentation - -Currently, snapshot RocksDBs has automatic RocksDB compaction disabled intentionally to preserve snapshot diff performance, preventing any form of compaction. However, snapshots can be defragmented in the way that the next active snapshot in the chain is a checkpoint of its previous active snapshot plus a diff stored in separate SST files (one SST for each column family changed). The proposed approach involves rewriting snapshots iteratively from the beginning of the snapshot chain and restructuring them in a separate directory. - -Note: Snapshot Defragmentation was previously called Snapshot Compaction earlier during the design phase. It is not RocksDB compaction. Thus the rename to avoid such confusion. We are also not going to enable RocksDB auto compaction on snapshot RocksDBs. - ---- - -### 1. Introducing Last Defragmentation Time - -A new boolean flag (`needsDefrag`), timestamp (`lastDefragTime`), int `version` would be added to snapshot metadata. -`needsDefrag` tells the system whether a snapshot is pending defrag (`true`) or if it is already defragged and up to date (`false`). This helps manage and automate the defrag workflow, ensuring snapshots are efficiently stored and maintained. `needsDefrag` defaults to `false` during initialization and when absent. -A new list of `Map>` (`notDefraggedSstFileList`) also would be added to snapshot meta as part of snapshot create operation; this would be storing the original list of SST files in the not defragged copy of the snapshot corresponding to keyTable/fileTable/DirectoryTable. This should be done as part of the snapshot create operation. -Since this is not going to be consistent across all OMs this would have to be written to a local YAML file inside the snapshot directory and this can be maintained in the SnapshotChainManager in memory on startup. So all updates should not go through Ratis. -An additional `Map>>` (`defraggedSstFileList`) also would be added to snapshotMeta. This will be maintaining a list of sstFiles of different versions of defragged snapshots. The key here would be the version number of snapshot DBs. - -### 2. Snapshot Cache Lock for Read Prevention - -A snapshot lock will be introduced in the snapshot cache to prevent reads on a specific snapshot during the last step of defragmentation. This ensures no active reads occur while we are replacing the underlying RocksDB instance. The swap should be instantaneous. - ---- - -### 3. Directory Structure Changes - -Snapshots currently reside under `db.snapshots/checkpointState/` directory. The proposal introduces a `db.snapshots/checkpointStateDefragged/` directory for defragged snapshots. The directory format should be as follows: - -```text -db.snapshots/checkpointState/ -``` - -### 4. Optimized Snapshot Diff Computation - -To compute a snapshot diff: - -- If both snapshots are defragged, their defragged versions will be used. The diff between two defragged snapshot should be present in one SST file. -- If the target snapshot is not defragged & the source snapshot is defragged (other way is not possible as we always defrag snapshots in order) and if the DAG has all the sst files corresponding to the not defragged snapshot version of the defragged snapshot which would be captured as part of the snapshot metadata, then an efficient diff can be performed with the information present in the DAG. Use `notDefraggedSstFileList` from each of the snapshot’s meta -- Otherwise, a full diff will be computed between the defragged source and the defragged target snapshot. Delta SST files would be computed corresponding to the latest version number of the target snapshot(version number of target snapshot would always be greater) -- Changes in the full diff logic is required to check inode ids of sst files and remove the common sst files b/w source and target snapshots. - ---- - -### 5. Snapshot Defragmentation Workflow - -A background snapshot defragmentation service should be added which would be done by iterating through the snapshot chain in the same order as the global snapshot chain. This is to ensure the snapshot created after is always defragged after all the snapshots previously created are defragged. Snapshot defragmentation should only occur once the snapshot has undergone SST filtering. The following steps outline the process: - -- **Create a RocksDB checkpoint** of the path previous snapshot corresponding to the bucket in the chain (if it exists). `version` of previous snapshot should be strictly greater than the current snapshot’s `version` otherwise skip compacting this snapshot in this iteration. - -- **Acquire the `SNAPSHOT_GC_LOCK`** for the snapshot ID to prevent garbage collection during defragmentation[This is to keep contents of deleted Table contents same while defragmentation consistent]. - 1. If there is no path previous snapshot then - 1. Take a checkpoint of the same RocksDB instance remove keys that don’t correspond to the bucket from tables `keyTable`, `fileTable`, `directoryTable,deletedTable,deletedDirectoryTable` by running RocksDB delete range api. This should be done if the snapshot has never been defragged before i.e. if `lastDefragTime` is zero or null. Otherwise just update the `needsDefrag` to False. - 2. We can trigger a forced manual compaction on the RocksDB instance(i & ii can be behind a flag where in we can just work with the checkpoint of the RocksDB if the flag is disabled). - - 2. If path previous snapshot exists: - 1. **Compute the diff** between tables (`keyTable`, `fileTable`, `directoryTable`) of the checkpoint and the current snapshot using snapshot diff functionality. - 2. **Flush changed objects** into separate SST files using the SST file writer, categorizing them by table type. - 3. **Ingest these SST files** into the RocksDB checkpoint using the `ingestFile` API. - -- Check if the entire current snapshot has been flushed to disk otherwise wait for the flush to happen. - -- Truncate `deletedTable,deletedDirectoryTable,snapshotRenamedTable etc. (All tables excepting keyTable/fileTable/directoryTable)` in checkpointed RocksDB and ingest the entire table from deletedTable and deletedDirectoryTable from the current snapshot RocksDB. - -- **Acquire the snapshot cache lock** to prevent snapshot access during directory updates.[While performing the snapshot RocksDB directory switch there should be no RocksDB handle with read happening on it]. - -- **Move the checkpoint directory** into `checkpointStateDefragged` with the format: - -```text -om.db-- -``` - -- **Update snapshot metadata**, setting `lastDefragTime` and marking `needsDefrag = false` and set the next snapshot in the chain is marked for defragmentation. If there is no path previous snapshot in the chain then increase `version` by 1 otherwise set `version` which is equal to the previous snapshot in the chain. Based on the sstFiles in the RocksDB compute `Map>` and add this Map to `defraggedSstFileList` corresponding to the `version` of the snapshot. - -- **Delete old not defragged/defragged snapshots**, ensuring unreferenced not defragged/defragged snapshots are purged during OM startup(This is to handle JVM crash after viii). - -- **Release the snapshot cache lock** on the snapshot id. Now the snapshot is ready to be used to read - -#### Visualization - -```mermaid -flowchart TD -A[Start: Not defragged Snapshot Exists] --> B[Has SST Filtering Occurred?] -B -- No --> Z[Wait for SST Filtering] -B -- Yes --> C[Create RocksDB Checkpoint of Previous Snapshot] -C --> D["Defragged Copy Exists?"] -D -- Yes --> E[Update defragTime, set needsDefrag=false] -D -- No --> F[Create Checkpoint in Temp Directory] -E --> G[Acquire SNAPSHOT_GC_LOCK] -F --> G -G --> H[Compute Diff between Checkpoint & Current Snapshot] -H --> I[Flush Changed Objects into SST Files by table] -I --> J[Ingest SST Files into Checkpointed RocksDB] -J --> K[Truncate / Replace Deleted Tables] -K --> L[Acquire Snapshot Cache Lock] -L --> M[Move Checkpoint Dir to checkpointStateDefragged] -M --> N[Update Snapshot Metadata] -N --> O[Delete Old Snapshot DB Dir] -O --> P[Release Snapshot Cache Lock] -P --> Q[Defragged Snapshot Ready] -``` - -### Computing Changed Objects Between Snapshots - -The following steps outline how to compute changed objects between snapshots: - -1. **Determine delta SST files** - - Retrieve delta SST files from the DAG if the snapshot was not previously defragmented - and the previous snapshot has a non-defragmented copy. - - Otherwise, compute delta SST files by comparing SST files in both defragmented - RocksDB instances. - -2. **Initialize SST file writers** for the following tables `keyTable`, `directoryTable` and `fileTable` - -3. **Iterate SST files in parallel**, reading and merging keys to maintain sorted order.(Similar to the MinHeapIterator instead of iterating through multiple tables we would be iterating through multiple sst files concurrently). - -4. **Compare keys** between snapshots to determine changes and write updated objects if and only if they have changed into the SST file. - - If the object is present in the target snapshot then do an `sstFileWriter.put()`. - - If the object is present in source snapshot but not present in target snapshot then we just have to write a tombstone entry by calling `sstFileWriter.delete()`. - -5. **Ingest the generated SST files** into the checkpointed RocksDB. - ---- - -#### Visualization - -```mermaid -flowchart TD -A[Start: Need Diff Between Snapshots] --> B[Determine delta SST files] -B -- DAG Info available --> C[Retrieve from DAG] -B -- Otherwise --> D[Compute delta via SST comparison] -C --> E[Initialize SST file writers: keyTable, directoryTable, fileTable] -D --> E -E --> F[Iterate SST files in parallel, merge keys] -F --> G[Compare keys between snapshots] -G --> H["Object in Target Snapshot?"] -H -- Yes --> I[sstFileWriter put] -H -- No --> J[sstFileWriter delete tombstone] -I --> K[Ingest SST files into checkpointed RocksDB] -J --> K -``` - -### Handling Snapshot Purge - -Upon snapshot deletion, the `needsDefrag` flag for the next snapshot in the chain -is set to `true`, ensuring defragmentation propagates incrementally across the snapshot chain. - -#### Visualization - -```mermaid -flowchart TD - A[Snapshot Deletion Requested] - --> B[Mark next snapshot needsDefrag true] - --> C[Next snapshots will be defragmented incrementally] -``` - -## Conclusion - -This approach effectively reduces storage overhead while maintaining efficient snapshot retrieval and diff computation. The total storage would be in the order of total number of keys in the snapshots + AOS by reducing overall redundancy of the objects while also making the snapshot diff computation for even older snapshots more computationally efficient. diff --git a/versioned_docs/version-2.1.0/07-system-internals/07-features/06-snapshot-defragmentation.md b/versioned_docs/version-2.1.0/07-system-internals/07-features/06-snapshot-defragmentation.md new file mode 100644 index 0000000000..bbd37f81a5 --- /dev/null +++ b/versioned_docs/version-2.1.0/07-system-internals/07-features/06-snapshot-defragmentation.md @@ -0,0 +1,467 @@ +--- +sidebar_label: Snapshot Defragmentation +--- + + + +# Snapshot Defragmentation + +Feature documentation from [apache/ozone#10131](https://github.com/apache/ozone/pull/10131) ([HDDS-15113](https://issues.apache.org/jira/browse/HDDS-15113)). + +## Overview + +An Ozone snapshot is created as a RocksDB checkpoint of the active OM DB. A +new snapshot is cheap because its SST files are hard links to the active DB SST +files. Over time, active DB compactions rewrite SST files. Older snapshot +directories continue to pin their original SST files while newer snapshots pin +newer versions of the same metadata. With many long-lived snapshots and high +metadata churn, the disk usage under the snapshot checkpoint directory can grow +roughly with the number of snapshots rather than with the number of live unique +keys. + +Snapshot defragmentation rewrites each snapshot into a versioned checkpoint +that contains only the data needed for that snapshot. It uses the previous +snapshot in the same bucket path chain plus the changed SST/key ranges for the +current snapshot, so the newest defragmented copy does not keep a full, +independent copy of every historical SST file. + +Snapshot defragmentation was previously called snapshot compaction during the +design phase. Snapshot defragmentation is not the same as RocksDB automatic +compaction of snapshot DBs. Snapshot DB automatic compaction remains disabled +because the snapshot diff path relies on stable SST metadata. + +## Current Implementation + +The implementation is centered on these classes: + +* `SnapshotDefragService`: background and on-demand service that rewrites + snapshot checkpoint directories. +* `OmSnapshotLocalData` and `OmSnapshotLocalDataYaml`: local per-OM metadata + persisted in YAML sidecar files. +* `OmSnapshotLocalDataManager`: loads YAML files, maintains the in-memory + dependency graph for `(snapshotId, version)` nodes, resolves previous + snapshot versions, and removes orphaned version metadata. +* `CompositeDeltaDiffComputer`, `RDBDifferComputer`, and `FullDiffComputer`: + compute the SST files that may contain differences between two snapshots. +* `SstFileSetReader` and `TableMergeIterator`: read candidate keys from delta + SST files as a sorted stream and compare the current and previous snapshot + tables without issuing an independent point lookup for every candidate key. +* `OmSnapshotManager`: opens the current snapshot version and deletes old + checkpoint directories after a version switch. + +The defrag service is local to each OM. The rewritten checkpoint directories +and YAML files are not Ratis-replicated state. In an HA deployment, each OM has +its own local snapshot DB directories and must defragment its own copies. The +admin command can target any OM node. + +## On-Disk Layout + +The active OM DB lives under the OM metadata directory selected by +`ozone.om.db.dirs`. If that property is not set, OM falls back to +`ozone.metadata.dirs`. + +For an OM metadata directory ``, snapshot checkpoint directories +live under: + +```text +/db.snapshots/checkpointState/ +``` + +The current implementation does not place defragmented DBs under a separate +`checkpointStateDefragged` directory. The original and defragmented versions +are sibling directories in `checkpointState`: + +```text +/db.snapshots/checkpointState/om.db- +/db.snapshots/checkpointState/om.db-- +``` + +Version `0` is the original, non-defragmented checkpoint and has no version +suffix. Versions greater than `0` are produced by snapshot defragmentation. +Normally only the current version's directory remains after a successful +defrag cleanup. The following paths show how the directory name changes over a +snapshot's lifetime; they are not expected to coexist in normal steady state: + +```text +# Before first defrag: +/var/lib/ozone/om/db.snapshots/checkpointState/om.db-3d0a...9f62 + +# After first successful defrag: +/var/lib/ozone/om/db.snapshots/checkpointState/om.db-3d0a...9f62-1 + +# After the next successful defrag: +/var/lib/ozone/om/db.snapshots/checkpointState/om.db-3d0a...9f62-2 +``` + +Older directories can exist briefly during a version switch or after an +interrupted cleanup, but the normal post-defrag path deletes older checkpoint +directories for that snapshot DB. + +Each snapshot also has one local YAML sidecar next to the version `0` +directory: + +```text +/db.snapshots/checkpointState/om.db-3d0a...9f62.yaml +``` + +Temporary work is created under: + +```text +/db.snapshots/checkpointState/tmp_defrag/ +/db.snapshots/checkpointState/tmp_defrag/differSstFiles/ +``` + +`SnapshotDefragService` deletes and recreates `tmp_defrag` on service startup +and deletes it on shutdown. + +When an OM DB checkpoint is served to another OM, the checkpoint code uses the +current version from the local YAML metadata and includes that snapshot DB +directory. The bootstrap transfer also includes the required +`om.db-.yaml` sidecars. The inode-based transfer path explicitly +archives the YAML files for the snapshots present in the checkpoint and for any +previous local-data nodes they depend on; the directory-walk transfer path +includes sidecar files while selecting only the current snapshot DB directories. +The bootstrap write lock waits for the OM double buffer to flush before +collecting files, and the inode-based path also holds the snapshot cache lock +and the local data manager lock while resolving snapshot directories and YAML +paths. Snapshots in the intermediate `SNAPSHOT_DELETED` state can still be +copied because they remain in `SnapshotInfo`; fully purged snapshots are no +longer present there. + +## Local Snapshot Metadata + +Snapshot defrag metadata is stored in `OmSnapshotLocalData` YAML, not in +`SnapshotInfo` and not in the Ratis log. Important fields are: + +| Field | Meaning | +| :---- | :------ | +| `snapshotId` | Snapshot UUID. Must match the checkpoint directory name. | +| `checksum` | Checksum of the YAML representation, used to detect corrupted local metadata. | +| `previousSnapshotId` | The previous snapshot in the same bucket path chain that this local data is resolved against. | +| `version` | Current version to open. `0` means the original checkpoint; `> 0` means a defragmented version. | +| `needsDefrag` | Explicit local flag that forces the service to defragment the snapshot. | +| `isSSTFiltered` | YAML marker used by the older `SstFilteringService` path. Defrag disables that service when it is enabled. | +| `versionSstFileInfos` | Map from snapshot version to `VersionMeta`. This replaces the earlier split between `notDefraggedSstFileList` and `defraggedSstFileList`. | +| `VersionMeta.previousSnapshotVersion` | The version of `previousSnapshotId` that this version depends on. | +| `VersionMeta.sstFiles` | SST file metadata for `keyTable`, `directoryTable`, and `fileTable`. Each nested `SstFileInfo` uses `fileName`, `startKey`, `endKey`, and `columnFamily`; `fileName` is stored without the `.sst` extension. | +| `dbTxSequenceNumber` | Largest RocksDB sequence number observed in tracked SST files when the original snapshot YAML is created. Used by the checkpoint differ. | +| `transactionInfo` | Purge transaction marker used to remove local metadata only after the purge has flushed to disk. | +| `lastDefragTime` | Serialized by the YAML class, but current defrag decisions are based on `version`, `needsDefrag`, and `versionSstFileInfos`. | + +On every snapshot creation, OM creates the YAML sidecar and captures live SST +file metadata for `keyTable`, `directoryTable`, and `fileTable` as version +`0`. This metadata is read from the newly created snapshot checkpoint DB, not +from the active OM DB, so an active DB compaction immediately after checkpoint +creation cannot corrupt the snapshot's local SST tracking. This happens even +when the periodic snapshot defrag service is disabled. New snapshots are +committed with `needsDefrag = true`. During upgrade/finalization, +`OmSnapshotLocalDataManager` also creates missing YAML files for snapshots +already present in `SnapshotInfo`; active snapshots get their tracked SST +metadata, and the synthesized YAML is marked `needsDefrag = true`. When a new +defragmented version is added, the current version is incremented, the new +version's SST list is captured from RocksDB, and `needsDefrag` is cleared. + +`OmSnapshotLocalDataManager` keeps an in-memory graph of local version +dependencies. Each node is a `(snapshotId, version)` pair and points to the +`(previousSnapshotId, previousSnapshotVersion)` it depends on. The graph is +rebuilt from YAML at OM startup. It is used to: + +* reject deletion of a version that is still referenced by another snapshot + version; +* resolve a snapshot's previous-version dependency when the path chain changes + after purge; +* identify orphaned versions and YAML files that can be removed after purge. + +## Service Configuration + +Snapshot defragmentation is disabled by default. + +| Property | Default | Meaning | +| :------- | :------ | :------ | +| `ozone.snapshot.defrag.service.interval` | `-1` | Background interval. A value `<= 0` disables the service. | +| `ozone.snapshot.defrag.limit.per.task` | `1` | Maximum number of snapshots defragmented in one service run. | +| `ozone.snapshot.defrag.service.timeout` | `300s` | Timeout for one service run. | +| `ozone.om.snapshot.local.data.manager.service.interval` | `5m` | Interval for the local YAML/version orphan cleanup thread. A value `<= 0` disables the cleanup thread. | + +The service is gated by the `SNAPSHOT_DEFRAG` OM layout feature. It also +requires the Rocks tools native library; if the library is unavailable, an +on-demand run returns without defragmenting snapshots. + +If defrag is enabled, `KeyManagerImpl` does not start `SstFilteringService`, +even when the SST filtering interval is configured. Defrag already filters the +tracked snapshot tables by bucket prefix while building the rewritten +checkpoint. If defrag is disabled and SST filtering is enabled, the older SST +filtering service still removes irrelevant SST files from version `0` +snapshots and writes the `sstFiltered` marker file. + +Manual defrag is exposed through: + +```bash +ozone admin om snapshot defrag --service-id= --node-id= +ozone admin om snapshot defrag --service-id= --node-id= --no-wait +``` + +The command requires the defrag service to be initialized on the target OM. +Any OM in an HA service can run it because the rewritten snapshot DB state is +local to that OM. + +## Defragmentation Workflow + +`SnapshotDefragService` iterates the global snapshot chain in forward order and +processes active snapshots only. For each snapshot, it resolves the path +previous snapshot in the same bucket. Incremental defrag is based on the path +chain, not merely the global creation order. + +The service decides that a snapshot needs defrag when either: + +* the local `needsDefrag` flag is true; or +* the snapshot's current version depends on an older version of its resolved + previous snapshot than the previous snapshot's current version. + +The second condition is what propagates defrag after a previous snapshot is +rewritten or when snapshot purge changes the path chain. + +The main workflow is: + +1. Acquire the bootstrap read lock and load `SnapshotInfo` plus local YAML. +2. Create a temporary checkpoint in `tmp_defrag`. + * If this is the first snapshot in the path chain, checkpoint the current + snapshot. + * Otherwise, checkpoint the current version of the path previous snapshot. +3. Drop non-incremental column families from the temporary checkpoint. They are + reloaded from the current snapshot later. +4. For the first snapshot in the path chain, do a full defrag of `keyTable`, + `directoryTable`, and `fileTable`: + * delete ranges outside the bucket prefix; + * compact each tracked table with forced bottommost-level compaction so the + range tombstones are removed from the rewritten checkpoint. +5. For later snapshots, do incremental defrag of the same tracked tables: + * compute delta SST files between the path previous snapshot and the current + snapshot; + * group deltas by column family; + * read candidate keys from the delta SST files, merge them with the previous + and current snapshot tables, and write only changed keys or tombstones into + a temporary SST file; + * ingest the resulting SST file into the temporary checkpoint. +6. Acquire a write `SNAPSHOT_DB_CONTENT_LOCK` for the current snapshot. This is + the lock that prevents concurrent snapshot content changes while the service + reloads non-incremental tables and switches versions. Snapshot reads and + deep-clean writes use `SNAPSHOT_DB_LOCK` or read `SNAPSHOT_DB_CONTENT_LOCK` + in the same lock hierarchy. The DAG-based lock ordering allows the content + lock to be acquired before snapshot DB and local-data locks; code paths avoid + acquiring the content lock while already holding local-data locks. +7. Dump and ingest non-incremental tables from the current snapshot into the + checkpoint. The tracked tables (`keyTable`, `directoryTable`, `fileTable`) + are skipped because they were already rebuilt. +8. Close the temporary checkpoint metadata manager and move the checkpoint + directory to the next version path: + + ```text + /db.snapshots/checkpointState/om.db-- + ``` + +9. Open the new version, add its live SST metadata to + `versionSstFileInfos`, update `version`, and clear `needsDefrag`. +10. After a successful version switch, delete older checkpoint directory + versions for that same snapshot after acquiring the snapshot DB cache write + lock. For example, after switching from version `1` to version `2`, + `om.db--1` is removed locally once there are no open cached + handles for that snapshot DB. Version `0` is normally removed after the + first successful defrag that creates version `1`; if an older directory is + still present from an interrupted earlier cleanup, the same deletion path + can remove it. The YAML version metadata may remain longer than the + directories when another snapshot version still references it. + `OmSnapshotLocalDataManager` removes orphaned version metadata later. + This directory deletion intentionally remains under + `SNAPSHOT_DB_CONTENT_LOCK` so stale cached handles cannot write to an old + version while it is being removed. +11. Release `SNAPSHOT_DB_CONTENT_LOCK`. + +```mermaid +flowchart TD + A["Select next active snapshot"] --> B["Resolve local data and previous path snapshot"] + B --> C{"Needs defrag?"} + C -- "No" --> Z["Skip"] + C -- "Yes" --> D{"Has path previous snapshot?"} + D -- "No" --> E["Checkpoint current snapshot in tmp_defrag"] + D -- "Yes" --> F["Checkpoint previous snapshot current version in tmp_defrag"] + E --> G["Full defrag tracked tables by bucket prefix"] + F --> H["Compute and ingest incremental tracked-table deltas"] + G --> I["Acquire SNAPSHOT_DB_CONTENT_LOCK"] + H --> I + I --> J["Ingest non-incremental tables from current snapshot"] + J --> K["Move checkpoint to om.db--"] + K --> L["Update YAML version metadata and clear needsDefrag"] + L --> M["Delete old checkpoint directories"] + M --> N["Release lock"] +``` + +## Delta Computation + +Defrag uses only the column families tracked by the checkpoint differ: +`keyTable`, `directoryTable`, and `fileTable`. + +`CompositeDeltaDiffComputer` first tries `RDBDifferComputer`. The differ uses +the local `versionSstFileInfos` metadata and the active DB compaction DAG. When +the current snapshot version is `0`, the DAG path can be used to find the SST +files that changed since the previous snapshot. For versions greater than `0`, +the differ falls back to comparing SST file metadata by version because +defragmented versions are already rewritten snapshot DBs rather than raw active +DB checkpoints. + +If the DAG-based differ cannot produce a complete answer, the code falls back +to `FullDiffComputer`. The full differ compares relevant SST files by inode +when inode metadata is available, and falls back to comparing full file lists +when inode comparison fails. It considers files unique to either endpoint and +skips common files only when the file identity proves that they are the same +SST. + +The delta computers materialize candidate SSTs as hard links under +`tmp_defrag/differSstFiles` before returning them to the defrag service. This +keeps the source SST content stable while the service reads it, even if the +original source path later becomes eligible for cleanup. + +The delta files identify candidate SSTs, not final row-level changes. The +defrag service still reads keys from those files, compares current and previous +snapshot table values, writes only changed records or tombstones to a new SST +file, and ingests that file into the checkpoint. If there is exactly one delta +file for a table and the current snapshot version is already greater than `0`, +the service can ingest that delta file directly. + +`SstFileSetReader` returns candidate keys as a sorted merged stream and can +read tombstones through the raw SST reader. The defrag path uses key-only +iteration, `CodecBuffer`, and direct buffers where possible. Because candidate +keys are sorted, `TableMergeIterator` can walk the current and previous RocksDB +tables with forward iterators and seeks instead of issuing independent point +gets for every candidate key. + +## Snapshot Diff Before and After Defrag + +The snapshot diff API and report-generation flow do not change after snapshot +defragmentation. `SnapshotDiffManager` still submits a diff job, opens the +current snapshot DB versions through `OmSnapshotManager`, asks +`CompositeDeltaDiffComputer` for candidate SST files, reads candidate keys with +`SstFileSetReader`, compares the from/to snapshot tables with +`TableMergeIterator`, and builds the object-ID maps used to produce the final +diff report. + +The internal SST-candidate path changes based on the current local version of +the to-snapshot: + +* Before defrag, the to-snapshot is version `0`, which is the original OM DB + checkpoint. `RDBDifferComputer` can ask `RocksDBCheckpointDiffer` to walk the + active DB compaction DAG and use the YAML `dbTxSequenceNumber` plus version + `0` SST metadata to identify changed SSTs. If the DAG cannot provide a + complete answer, `CompositeDeltaDiffComputer` falls back to `FullDiffComputer`. +* After defrag, the to-snapshot's current version is greater than `0`, and that + version is a rewritten snapshot DB rather than an active DB checkpoint + produced by normal RocksDB compactions. The differ resolves the from-snapshot + dependency through `OmSnapshotLocalDataManager`, passes the YAML + `versionSstFileInfos` version map into `RocksDBCheckpointDiffer`, and compares + SST metadata for the relevant snapshot versions instead of using the + compaction-DAG walk. The full-diff fallback is still available, and + `--forceFullDiff` continues to bypass the DAG path. + +## Snapshot Reads + +Snapshot reads go through `OmSnapshotManager` and `SnapshotCache`. The cache +loader reads the snapshot's current version from `OmSnapshotLocalDataManager` +and opens: + +```text +/db.snapshots/checkpointState/om.db-[-] +``` + +The read path does not scan for the highest directory suffix on disk. The YAML +current version is the source of truth: moving a new checkpoint directory is +not visible to readers until the YAML current version is committed. + +Before opening a snapshot cache entry, the loader waits for the snapshot create +transaction recorded in `SnapshotInfo.createTransactionInfo` to flush to the OM +DB. This prevents a follower or a fast reader from opening a snapshot whose +checkpoint directory or YAML sidecar exists in memory or on disk before the +corresponding create transaction is durable. + +## Snapshot Purge and Orphan Cleanup + +Snapshot delete first marks `SnapshotInfo` as `SNAPSHOT_DELETED`. Later, +`SnapshotDeletingService` submits an internal purge request. Purge updates the +next snapshots' path/global previous IDs in `SnapshotInfo`, removes the purged +snapshot from the chain, records purge `transactionInfo` in the purged +snapshot's local YAML, invalidates the snapshot cache entry, and deletes the +purged snapshot's checkpoint directories. + +The purge path does not directly write `needsDefrag = true` into the next +snapshot's YAML. Instead, the next time local data for that snapshot is opened +for defrag, `OmSnapshotLocalDataManager` resolves the updated +`pathPreviousSnapshotId`. If that changes the dependency or if the referenced +previous snapshot version is stale, the provider marks or reports the snapshot +as needing defrag. + +Old checkpoint directories for a snapshot are deleted immediately after that +snapshot is successfully defragmented to a newer version. Old version metadata +and YAML files are cleaned separately from checkpoint directories by +`OmSnapshotLocalDataManagerService`, a single-threaded scheduler owned by +`OmSnapshotLocalDataManager`. + +On startup, the local data manager loads all `om.db-.yaml` +files, rebuilds the in-memory version dependency graph, and queues every +loaded snapshot ID for an orphan check. Later commits can queue additional +snapshot IDs: + +* when a snapshot gains or removes local versions; +* when a snapshot's resolved `previousSnapshotId` changes after purge updates + the path chain; +* when purge records `transactionInfo` in a snapshot's YAML. + +Each cleanup pass checks the queued snapshot IDs. A version entry can be +removed from YAML when no other local version node depends on it and either: + +* the version is not `0` and is not the snapshot's current version; or +* the snapshot itself has been purged. + +Version `0` is kept for active snapshots even when it has no dependents, +because a newly created or unresolved snapshot can still depend on the original +version. If a snapshot has purge `transactionInfo` but the purge transaction +has not flushed to the OM DB yet, the cleanup thread keeps the YAML and +re-queues the snapshot for a later pass. When the purge has flushed and no +versions remain, the YAML file is deleted. + +## Metrics and Logging + +`OmSnapshotInternalMetrics` records defrag progress since the last OM restart: +total defrag operations, total failures, skipped snapshots, full defrag +operations and failures, incrementally defragged snapshots and failures, full +defrag tables compacted, and incremental delta files processed. + +`OMPerformanceMetrics` records the latency of the last full defrag operation +and the last incremental defrag operation in milliseconds. With trace logging +enabled, `SnapshotDefragService` also logs before/after directory statistics +for each defragmented snapshot, including total files, SST file count, and +directory byte usage. + +## Expected Effect + +After a full pass, each active snapshot's current version is a compact, +bucket-scoped checkpoint that reuses the previous path snapshot plus the +snapshot's own changes. This reduces duplicate SST retention for long snapshot +chains while keeping snapshot reads and snapshot-diff computation based on +ordinary RocksDB checkpoints and SST metadata. + +For background on the original scale problem and early design discussion, see +[HDDS-13003](https://issues.apache.org/jira/browse/HDDS-13003).