Skip to content

resize: make interrupted resizes safe to resume#14

Merged
deitch merged 2 commits into
diskfs:mainfrom
eriknordmark:resume-restart-safety
Jun 11, 2026
Merged

resize: make interrupted resizes safe to resume#14
deitch merged 2 commits into
diskfs:mainfrom
eriknordmark:resume-restart-safety

Conversation

@eriknordmark

@eriknordmark eriknordmark commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Fixes #13.

Problem

A resize interrupted (e.g. the machine reboots mid-operation) and then re-run
could corrupt the disk. planResizes recomputes the plan from the current
on-disk state each run; once the relocated *_resized2 partitions have been
created, that space is occupied, the grows no longer fit, and a second
shrink of the shrink partition is planned — driving its target size negative.

Fix

  • Resume-aware planResizes: a grow whose relocated *_resized2 partition
    already exists is reused in place and excluded from the space/shrink
    re-planning, so no second shrink is computed.
  • copyFilesystems copy-skip: when the target already holds a filesystem
    matching the source (sync.CompareFS), the reformat+recopy is skipped;
    otherwise it reformats (over an empty or non-matching target) and recopies.
  • Idempotent updatePartitions finalize: replaces the non-idempotent
    swapPartitions + removePartitions/removeAndRenumberPartitions finalize
    with a single partition-table write that relabels/reindexes each relocated
    target and removes the original, keyed on the stable on-disk start offset, so
    a re-run converges instead of undoing a completed operation. This makes the
    finalize step itself safe to resume.

Tests (resume_test.go)

TestRunResumeAfterInterruption interrupts the pipeline after each step and
re-runs to completion, asserting the final disk matches an uninterrupted run:
afterShrinkFilesystems, afterShrinkPartitions, afterCreatePartitions,
midCopyTargetFsCreated, midCopyTargetFsHasStaleFile, afterCopyFilesystems,
and afterUpdatePartitions (whole resize completed → re-run is a no-op) — in
both renumber and preserveNumbers modes. TestUpdatePartitions covers the
finalize transformation directly. TestRunAbortsOnFsckFailure asserts the
resize aborts on e2fsck failure rather than shrinking a broken filesystem.

swapPartitions/removePartitions/removeAndRenumberPartitions remain defined
but are no longer called by resize.

These cases run a real shrink/copy of a multi-GB fixture, so they are guarded by
testing.Short() and the CI test timeout is raised to 30m.

@eriknordmark eriknordmark marked this pull request as ready for review June 10, 2026 18:13
@eriknordmark

Copy link
Copy Markdown
Contributor Author

@deitch ready for review. Fixes #13: an interrupted resize that was re-run could corrupt the disk (planResizes re-planned a second shrink once the relocated partitions existed, driving the shrink partition's target size negative). Makes planResizes resume-aware and adds a CompareFS-based copy-skip, plus restart-safety tests that interrupt the pipeline at each step and re-run to completion. Stacked on #6/#11 + the go-diskfs bump; the end-to-end tests need diskfs/go-diskfs#410, so CI here stays red until that merges and the pin updates (verified green locally against #410).

@deitch

deitch commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

I like this. idempotency always was the eventual goal. It does need a rebase though.

@eriknordmark eriknordmark marked this pull request as draft June 11, 2026 13:42
@eriknordmark eriknordmark force-pushed the resume-restart-safety branch from dcbca33 to 131c22b Compare June 11, 2026 15:33
@eriknordmark eriknordmark marked this pull request as ready for review June 11, 2026 15:37
A resize that was interrupted and re-run could corrupt the disk. planResizes
recomputed the plan from the partially-modified disk, and once the relocated
partitions had already been created it no longer found room for the grows, so
it planned a second shrink of the shrink partition -- driving its target size
negative (diskfs#13).

planResizes is now resume-aware: a grow whose relocated "<label>_resized2"
partition already exists is reused in place and excluded from the space/shrink
planning, so no second shrink is computed. copyFilesystems additionally skips
the reformat+recopy when the target already holds a matching filesystem
(sync.CompareFS).

resume_test.go adds restart-safety coverage: the pipeline is interrupted after
each step (including mid-copy, with an empty and a stale-but-non-empty target)
and re-run to completion, asserting the final disk matches an uninterrupted
run; a separate case corrupts the shrink source and asserts the resize aborts
on e2fsck failure rather than shrinking a broken filesystem. These end-to-end
cases run a real shrink/copy of a multi-GB fixture, so they are guarded by
testing.Short and CI's test timeout is raised to 30m.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@eriknordmark eriknordmark force-pushed the resume-restart-safety branch from 131c22b to c93d703 Compare June 11, 2026 16:25
Replace the swapPartitions + removePartitions/removeAndRenumberPartitions
sequence that ends a resize with a single idempotent updatePartitions. In one
partition-table write it gives each relocated target the original partition's
identity (name, type GUID, partition GUID, attributes), sets its number (the
original number when preserveNumbers, otherwise the number it was created with),
and removes the superseded original.

The previous swap was not idempotent -- running it twice restored the original
arrangement -- so a resize interrupted between the swap and the removal could
not be safely re-run. updatePartitions keys on each partition's on-disk start
offset, which is stable across this phase (names and numbers change), sets the
desired final state directly rather than exchanging values, and treats an
already-removed original as a no-op. planResizes additionally skips a grow whose
partition is already at the requested size, so a re-run after a fully completed
resize converges to a no-op. Together these make every interruption point in the
pipeline safe to resume; resume_test.go now asserts the post-finalize case
(afterUpdatePartitions) that was previously skipped as a known-broken design TBD.

swapPartitions, removePartitions and removeAndRenumberPartitions remain defined
but are no longer called by resize; they stay unit-tested for reference. A new
TestUpdatePartitions covers the finalize transformation in both numbering modes.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@deitch deitch merged commit 2be589c into diskfs:main Jun 11, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resuming after createPartitions re-plans a second shrink and corrupts the disk

2 participants