Skip to content

fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures#8308

Closed
hsubramanianaks wants to merge 1 commit intomainfrom
fix/e2e-retry-flaky-tests
Closed

fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures#8308
hsubramanianaks wants to merge 1 commit intomainfrom
fix/e2e-retry-flaky-tests

Conversation

@hsubramanianaks
Copy link
Copy Markdown
Contributor

What

Add automatic retry for failed e2e tests using gotestsum --rerun-fails=1.

Why

E2e tests have no retry mechanism today. When a transient infrastructure issue causes a single test to fail (e.g., a cloud-init temp mount entering failed systemd state), the entire pipeline fails and requires manual re-run.

The VHD build pipelines already have retryCountOnTaskFailure: 3, but the e2e pipeline has nothing — no ADO-level retry and no Go-level retry.

Example flaky failure (Build 160089239)

DONE 160 tests, 68 skipped, 1 failure in 528.088s

One test failed due to a transient run-cloud\x2dinit-tmp-tmpde1rbvp9.mount systemd unit entering failed state. All 159 other tests passed. A retry would have likely passed.

Changes

  • Added --rerun-fails=1 to the gotestsum command in .pipelines/scripts/e2e_run.sh
  • Only failed tests are rerun (not the entire suite), so overhead is minimal
  • If the test passes on retry, the suite passes — consistent with how gotestsum handles flaky tests

Risk

🟢 Low — gotestsum --rerun-fails is a well-established feature. It only reruns failed tests, does not affect passing tests, and the JUnit report correctly reflects the final outcome.

Add --rerun-fails=1 to the gotestsum command so that failed tests are
automatically rerun once before reporting failure. This handles transient
infrastructure issues like systemd bookkeeping races (e.g., cloud-init
temp mount units entering failed state) without requiring a full pipeline
re-run.

Only failed tests are rerun, not the entire suite, so the cost is minimal.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a lightweight retry mechanism for flaky e2e Go tests by rerunning only failed tests once via gotestsum.

Changes:

  • Adds gotestsum --rerun-fails=1 to rerun only failed e2e tests once
  • Documents the rationale for the retry behavior in the e2e runner script

@hsubramanianaks
Copy link
Copy Markdown
Contributor Author

Closing this one due to comments - #8300 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants