fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures by hsubramanianaks · Pull Request #8308 · Azure/AgentBaker

hsubramanianaks · 2026-04-14T19:12:00Z

What

Add automatic retry for failed e2e tests using gotestsum --rerun-fails=1.

Why

E2e tests have no retry mechanism today. When a transient infrastructure issue causes a single test to fail (e.g., a cloud-init temp mount entering failed systemd state), the entire pipeline fails and requires manual re-run.

The VHD build pipelines already have retryCountOnTaskFailure: 3, but the e2e pipeline has nothing — no ADO-level retry and no Go-level retry.

Example flaky failure (Build 160089239)

DONE 160 tests, 68 skipped, 1 failure in 528.088s

One test failed due to a transient run-cloud\x2dinit-tmp-tmpde1rbvp9.mount systemd unit entering failed state. All 159 other tests passed. A retry would have likely passed.

Changes

Added --rerun-fails=1 to the gotestsum command in .pipelines/scripts/e2e_run.sh
Only failed tests are rerun (not the entire suite), so overhead is minimal
If the test passes on retry, the suite passes — consistent with how gotestsum handles flaky tests

Risk

🟢 Low — gotestsum --rerun-fails is a well-established feature. It only reruns failed tests, does not affect passing tests, and the JUnit report correctly reflects the final outcome.

Add --rerun-fails=1 to the gotestsum command so that failed tests are automatically rerun once before reporting failure. This handles transient infrastructure issues like systemd bookkeeping races (e.g., cloud-init temp mount units entering failed state) without requiring a full pipeline re-run. Only failed tests are rerun, not the entire suite, so the cost is minimal. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a lightweight retry mechanism for flaky e2e Go tests by rerunning only failed tests once via gotestsum.

Changes:

Adds gotestsum --rerun-fails=1 to rerun only failed e2e tests once
Documents the rationale for the retry behavior in the e2e runner script

hsubramanianaks · 2026-04-14T19:29:06Z

Closing this one due to comments - #8300 (comment)

Copilot AI review requested due to automatic review settings April 14, 2026 19:12

hsubramanianaks requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners April 14, 2026 19:12

hsubramanianaks temporarily deployed to test April 14, 2026 19:12 — with GitHub Actions Inactive

hsubramanianaks mentioned this pull request Apr 14, 2026

fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures #8300

Closed

Copilot AI reviewed Apr 14, 2026

View reviewed changes

hsubramanianaks closed this Apr 14, 2026

Copilot started reviewing on behalf of hsubramanianaks April 14, 2026 21:11 View session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures#8308

fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures#8308
hsubramanianaks wants to merge 1 commit intomainfrom
fix/e2e-retry-flaky-tests

hsubramanianaks commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

hsubramanianaks commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hsubramanianaks commented Apr 14, 2026

What

Why

Example flaky failure (Build 160089239)

Changes

Risk

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

hsubramanianaks commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants