Skip to content

fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures#8300

Closed
hsubramanianaks wants to merge 1 commit intoAzure:mainfrom
hsubramanianaks:fix/e2e-retry-flaky-tests
Closed

fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures#8300
hsubramanianaks wants to merge 1 commit intoAzure:mainfrom
hsubramanianaks:fix/e2e-retry-flaky-tests

Conversation

@hsubramanianaks
Copy link
Copy Markdown
Contributor

What

Add automatic retry for failed e2e tests using gotestsum --rerun-fails=1.

Why

E2e tests have no retry mechanism today. When a transient infrastructure issue causes a single test to fail (e.g., a cloud-init temp mount entering failed systemd state), the entire pipeline fails and requires manual re-run.

The VHD build pipelines already have retryCountOnTaskFailure: 3, but the e2e pipeline has nothing — no ADO-level retry and no Go-level retry.

Example flaky failure (Build 160089239)

DONE 160 tests, 68 skipped, 1 failure in 528.088s

One test failed due to a transient run-cloud\x2dinit-tmp-tmpde1rbvp9.mount systemd unit entering failed state. All 159 other tests passed. A retry would have likely passed.

Changes

  • Added --rerun-fails=1 to the gotestsum command in .pipelines/scripts/e2e_run.sh
  • Only failed tests are rerun (not the entire suite), so overhead is minimal
  • If the test passes on retry, the suite passes — consistent with how gotestsum handles flaky tests

Risk

🟢 Low — gotestsum --rerun-fails is a well-established feature. It only reruns failed tests, does not affect passing tests, and the JUnit report correctly reflects the final outcome.

Add --rerun-fails=1 to the gotestsum command so that failed tests are
automatically rerun once before reporting failure. This handles transient
infrastructure issues like systemd bookkeeping races (e.g., cloud-init
temp mount units entering failed state) without requiring a full pipeline
re-run.

Only failed tests are rerun, not the entire suite, so the cost is minimal.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a lightweight retry mechanism for flaky e2e failures by rerunning only failed Go tests once via gotestsum.

Changes:

  • Adds --rerun-fails=1 to the gotestsum invocation in the e2e pipeline script
  • Documents the rationale for the rerun behavior inline (transient/flaky infra issues)

@hsubramanianaks
Copy link
Copy Markdown
Contributor Author

Closing: PR validation requires branch to be in Azure/AgentBaker, not from a fork. Will re-create from an internal branch.

# Run the tests! Yey!
test_exit_code=0
./bin/gotestsum --format testdox --junitfile "${BUILD_SRC_DIR}/e2e/report.xml" --jsonfile "${BUILD_SRC_DIR}/e2e/test-log.json" -- -parallel 150 -timeout "${E2E_GO_TEST_TIMEOUT}" || test_exit_code=$?
./bin/gotestsum --format testdox --rerun-fails=1 --junitfile "${BUILD_SRC_DIR}/e2e/report.xml" --jsonfile "${BUILD_SRC_DIR}/e2e/test-log.json" -- -parallel 150 -timeout "${E2E_GO_TEST_TIMEOUT}" || test_exit_code=$?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about directly enabling retries like this - might be worth gating this behind a pipeline flag or something

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want quick feedback, we want to fix flakyness, not hide them.

@hsubramanianaks
Copy link
Copy Markdown
Contributor Author

Superseded by #8308 (created from internal branch).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants