resize: make interrupted resizes safe to resume#14
Merged
Conversation
Contributor
Author
|
@deitch ready for review. Fixes #13: an interrupted resize that was re-run could corrupt the disk (planResizes re-planned a second shrink once the relocated partitions existed, driving the shrink partition's target size negative). Makes planResizes resume-aware and adds a CompareFS-based copy-skip, plus restart-safety tests that interrupt the pipeline at each step and re-run to completion. Stacked on #6/#11 + the go-diskfs bump; the end-to-end tests need diskfs/go-diskfs#410, so CI here stays red until that merges and the pin updates (verified green locally against #410). |
Contributor
|
I like this. idempotency always was the eventual goal. It does need a rebase though. |
dcbca33 to
131c22b
Compare
A resize that was interrupted and re-run could corrupt the disk. planResizes recomputed the plan from the partially-modified disk, and once the relocated partitions had already been created it no longer found room for the grows, so it planned a second shrink of the shrink partition -- driving its target size negative (diskfs#13). planResizes is now resume-aware: a grow whose relocated "<label>_resized2" partition already exists is reused in place and excluded from the space/shrink planning, so no second shrink is computed. copyFilesystems additionally skips the reformat+recopy when the target already holds a matching filesystem (sync.CompareFS). resume_test.go adds restart-safety coverage: the pipeline is interrupted after each step (including mid-copy, with an empty and a stale-but-non-empty target) and re-run to completion, asserting the final disk matches an uninterrupted run; a separate case corrupts the shrink source and asserts the resize aborts on e2fsck failure rather than shrinking a broken filesystem. These end-to-end cases run a real shrink/copy of a multi-GB fixture, so they are guarded by testing.Short and CI's test timeout is raised to 30m. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
131c22b to
c93d703
Compare
Replace the swapPartitions + removePartitions/removeAndRenumberPartitions sequence that ends a resize with a single idempotent updatePartitions. In one partition-table write it gives each relocated target the original partition's identity (name, type GUID, partition GUID, attributes), sets its number (the original number when preserveNumbers, otherwise the number it was created with), and removes the superseded original. The previous swap was not idempotent -- running it twice restored the original arrangement -- so a resize interrupted between the swap and the removal could not be safely re-run. updatePartitions keys on each partition's on-disk start offset, which is stable across this phase (names and numbers change), sets the desired final state directly rather than exchanging values, and treats an already-removed original as a no-op. planResizes additionally skips a grow whose partition is already at the requested size, so a re-run after a fully completed resize converges to a no-op. Together these make every interruption point in the pipeline safe to resume; resume_test.go now asserts the post-finalize case (afterUpdatePartitions) that was previously skipped as a known-broken design TBD. swapPartitions, removePartitions and removeAndRenumberPartitions remain defined but are no longer called by resize; they stay unit-tested for reference. A new TestUpdatePartitions covers the finalize transformation in both numbering modes. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #13.
Problem
A resize interrupted (e.g. the machine reboots mid-operation) and then re-run
could corrupt the disk.
planResizesrecomputes the plan from the currenton-disk state each run; once the relocated
*_resized2partitions have beencreated, that space is occupied, the grows no longer fit, and a second
shrink of the shrink partition is planned — driving its target size negative.
Fix
planResizes: a grow whose relocated*_resized2partitionalready exists is reused in place and excluded from the space/shrink
re-planning, so no second shrink is computed.
copyFilesystemscopy-skip: when the target already holds a filesystemmatching the source (
sync.CompareFS), the reformat+recopy is skipped;otherwise it reformats (over an empty or non-matching target) and recopies.
updatePartitionsfinalize: replaces the non-idempotentswapPartitions+removePartitions/removeAndRenumberPartitionsfinalizewith a single partition-table write that relabels/reindexes each relocated
target and removes the original, keyed on the stable on-disk start offset, so
a re-run converges instead of undoing a completed operation. This makes the
finalize step itself safe to resume.
Tests (
resume_test.go)TestRunResumeAfterInterruptioninterrupts the pipeline after each step andre-runs to completion, asserting the final disk matches an uninterrupted run:
afterShrinkFilesystems,afterShrinkPartitions,afterCreatePartitions,midCopyTargetFsCreated,midCopyTargetFsHasStaleFile,afterCopyFilesystems,and
afterUpdatePartitions(whole resize completed → re-run is a no-op) — inboth
renumberandpreserveNumbersmodes.TestUpdatePartitionscovers thefinalize transformation directly.
TestRunAbortsOnFsckFailureasserts theresize aborts on e2fsck failure rather than shrinking a broken filesystem.
swapPartitions/removePartitions/removeAndRenumberPartitionsremain definedbut are no longer called by
resize.These cases run a real shrink/copy of a multi-GB fixture, so they are guarded by
testing.Short()and the CI test timeout is raised to 30m.