resize: make interrupted resizes safe to resume by eriknordmark · Pull Request #14 · diskfs/partitionresizer

eriknordmark · 2026-06-10T11:55:00Z

Fixes #13.

Problem

A resize interrupted (e.g. the machine reboots mid-operation) and then re-run
could corrupt the disk. planResizes recomputes the plan from the current
on-disk state each run; once the relocated *_resized2 partitions have been
created, that space is occupied, the grows no longer fit, and a second
shrink of the shrink partition is planned — driving its target size negative.

Fix

Resume-aware planResizes: a grow whose relocated *_resized2 partition
already exists is reused in place and excluded from the space/shrink
re-planning, so no second shrink is computed.
copyFilesystems copy-skip: when the target already holds a filesystem
matching the source (sync.CompareFS), the reformat+recopy is skipped;
otherwise it reformats (over an empty or non-matching target) and recopies.
Idempotent updatePartitions finalize: replaces the non-idempotent
swapPartitions + removePartitions/removeAndRenumberPartitions finalize
with a single partition-table write that relabels/reindexes each relocated
target and removes the original, keyed on the stable on-disk start offset, so
a re-run converges instead of undoing a completed operation. This makes the
finalize step itself safe to resume.

Tests (`resume_test.go`)

TestRunResumeAfterInterruption interrupts the pipeline after each step and
re-runs to completion, asserting the final disk matches an uninterrupted run:
afterShrinkFilesystems, afterShrinkPartitions, afterCreatePartitions,
midCopyTargetFsCreated, midCopyTargetFsHasStaleFile, afterCopyFilesystems,
and afterUpdatePartitions (whole resize completed → re-run is a no-op) — in
both renumber and preserveNumbers modes. TestUpdatePartitions covers the
finalize transformation directly. TestRunAbortsOnFsckFailure asserts the
resize aborts on e2fsck failure rather than shrinking a broken filesystem.

swapPartitions/removePartitions/removeAndRenumberPartitions remain defined
but are no longer called by resize.

These cases run a real shrink/copy of a multi-GB fixture, so they are guarded by
testing.Short() and the CI test timeout is raised to 30m.

eriknordmark · 2026-06-10T18:13:52Z

@deitch ready for review. Fixes #13: an interrupted resize that was re-run could corrupt the disk (planResizes re-planned a second shrink once the relocated partitions existed, driving the shrink partition's target size negative). Makes planResizes resume-aware and adds a CompareFS-based copy-skip, plus restart-safety tests that interrupt the pipeline at each step and re-run to completion. Stacked on #6/#11 + the go-diskfs bump; the end-to-end tests need diskfs/go-diskfs#410, so CI here stays red until that merges and the pin updates (verified green locally against #410).

deitch · 2026-06-11T07:58:44Z

I like this. idempotency always was the eventual goal. It does need a rebase though.

A resize that was interrupted and re-run could corrupt the disk. planResizes recomputed the plan from the partially-modified disk, and once the relocated partitions had already been created it no longer found room for the grows, so it planned a second shrink of the shrink partition -- driving its target size negative (diskfs#13). planResizes is now resume-aware: a grow whose relocated "<label>_resized2" partition already exists is reused in place and excluded from the space/shrink planning, so no second shrink is computed. copyFilesystems additionally skips the reformat+recopy when the target already holds a matching filesystem (sync.CompareFS). resume_test.go adds restart-safety coverage: the pipeline is interrupted after each step (including mid-copy, with an empty and a stale-but-non-empty target) and re-run to completion, asserting the final disk matches an uninterrupted run; a separate case corrupts the shrink source and asserts the resize aborts on e2fsck failure rather than shrinking a broken filesystem. These end-to-end cases run a real shrink/copy of a multi-GB fixture, so they are guarded by testing.Short and CI's test timeout is raised to 30m. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Replace the swapPartitions + removePartitions/removeAndRenumberPartitions sequence that ends a resize with a single idempotent updatePartitions. In one partition-table write it gives each relocated target the original partition's identity (name, type GUID, partition GUID, attributes), sets its number (the original number when preserveNumbers, otherwise the number it was created with), and removes the superseded original. The previous swap was not idempotent -- running it twice restored the original arrangement -- so a resize interrupted between the swap and the removal could not be safely re-run. updatePartitions keys on each partition's on-disk start offset, which is stable across this phase (names and numbers change), sets the desired final state directly rather than exchanging values, and treats an already-removed original as a no-op. planResizes additionally skips a grow whose partition is already at the requested size, so a re-run after a fully completed resize converges to a no-op. Together these make every interruption point in the pipeline safe to resume; resume_test.go now asserts the post-finalize case (afterUpdatePartitions) that was previously skipped as a known-broken design TBD. swapPartitions, removePartitions and removeAndRenumberPartitions remain defined but are no longer called by resize; they stay unit-tested for reference. A new TestUpdatePartitions covers the finalize transformation in both numbering modes. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

eriknordmark mentioned this pull request Jun 10, 2026

resize: EVE-patterned resume tests and SIGKILL chaos harness #15

Merged

eriknordmark marked this pull request as ready for review June 10, 2026 18:13

eriknordmark marked this pull request as draft June 11, 2026 13:42

eriknordmark force-pushed the resume-restart-safety branch from dcbca33 to 131c22b Compare June 11, 2026 15:33

eriknordmark marked this pull request as ready for review June 11, 2026 15:37

eriknordmark mentioned this pull request Jun 11, 2026

resize: integrity-check all source filesystems (e2fsck + fsck.fat) #19

Merged

eriknordmark force-pushed the resume-restart-safety branch from 131c22b to c93d703 Compare June 11, 2026 16:25

deitch merged commit 2be589c into diskfs:main Jun 11, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resize: make interrupted resizes safe to resume#14

resize: make interrupted resizes safe to resume#14
deitch merged 2 commits into
diskfs:mainfrom
eriknordmark:resume-restart-safety

eriknordmark commented Jun 10, 2026 •

edited

Loading

Uh oh!

eriknordmark commented Jun 10, 2026

Uh oh!

deitch commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eriknordmark commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests (resume_test.go)

Uh oh!

eriknordmark commented Jun 10, 2026

Uh oh!

deitch commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eriknordmark commented Jun 10, 2026 •

edited

Loading

Tests (`resume_test.go`)