Skip to content

[ci] Add surefire fork timeouts to prevent CI hangs#6186

Closed
joewiz wants to merge 2 commits intoeXist-db:developfrom
joewiz:bugfix/ci-surefire-timeouts
Closed

[ci] Add surefire fork timeouts to prevent CI hangs#6186
joewiz wants to merge 2 commits intoeXist-db:developfrom
joewiz:bugfix/ci-surefire-timeouts

Conversation

@joewiz
Copy link
Copy Markdown
Member

@joewiz joewiz commented Mar 26, 2026

Summary

Two CI reliability improvements:

  1. Surefire fork timeouts: When a test like DeadlockIT or MoveResourceTest hangs during CI, the surefire/failsafe forked JVM waits indefinitely — the only protection is the GitHub Actions step timeout at 45 minutes. This burns CI minutes and blocks PR merges. Adding surefire fork timeouts kills hung test JVMs after 10 minutes instead of 45.

  2. Maven Build wagon timeouts + offline license check: Several Maven repositories (repo.exist-db.org, repo.evolvedbinary.com) cause stalled or failed HTTP connections from CI runners (CLOSE_WAIT and 504 errors respectively). This causes the Maven Build step to hang silently for 25+ minutes until the 30-minute step timeout fires.

What changed

exist-parent/pom.xml

Added to both maven-surefire-plugin and maven-failsafe-plugin configuration:

<forkedProcessTimeoutInSeconds>600</forkedProcessTimeoutInSeconds>
<forkedProcessExitTimeoutInSeconds>60</forkedProcessExitTimeoutInSeconds>
  • forkedProcessTimeoutInSeconds=600: Kills the forked JVM after 10 minutes. Clean test runs complete in ~3.5 minutes, so this only fires on hung tests.
  • forkedProcessExitTimeoutInSeconds=60: Gives the fork 60 seconds to flush results before force-kill.

.github/workflows/ci-test.yml

Three changes:

  1. Maven Build: wagon timeout via MAVEN_OPTS. Added wagon connection/read timeouts (10s/30s) to the Maven Build step's MAVEN_OPTS. Setting these in MAVEN_OPTS (not as -D flags on the mvnd command line) ensures they are passed as JVM system properties to the Maven daemon — which is where wagon reads them. Causes stalled connections to fail quickly instead of hanging for 25+ minutes.

  2. License check: run offline. Added --offline to the license:check step. The license job always restores the Maven cache from develop (which has all required artifacts), so no network access is needed. Eliminates 7+ minute timeouts from the same repo issues. Result: ~20s.

  3. Integration test step timeout: 45 → 30 minutes. With surefire killing hung forks at 10 minutes, 30 minutes is ample for the full integration suite.

Evidence the fix works

An earlier iteration of this PR proved the surefire timeout infrastructure works:

Metric Before (no timeout) After (this PR)
DeadlockIT Hung for 44 min, killed by step timeout Completed in 522s (under 600s limit)
Windows integration total 49 min (step timeout) 14 min 55s
Tests lost Fork killed mid-run, results lost All 9 tests passed, 0 failures

Why these values

  • Clean test suite runs complete in ~3.5 min (local) and ~16 min (CI with all modules)
  • Hung tests (DeadlockIT, MoveResourceTest) run indefinitely without intervention
  • 10 minutes (600s) is generous enough to never kill a slow-but-legitimate test run
  • 30s wagon read timeout: legitimate artifact downloads complete in seconds; CLOSE_WAIT connections produce no data

Test plan

  • License check completes in ~20s (offline, no repo network calls)
  • Maven Build completes without repo timeout hangs
  • Integration tests: hung forks killed after 10 min, not 45
  • No regressions: all clean tests still pass within the 600s fork timeout

Configure forkedProcessTimeoutInSeconds=600 and
forkedProcessExitTimeoutInSeconds=60 in both maven-surefire-plugin
and maven-failsafe-plugin in exist-parent/pom.xml. This kills forked
JVMs that hang (e.g. DeadlockIT, MoveResourceTest) after 10 minutes
instead of waiting indefinitely for the 45-minute GitHub Actions step
timeout.

Also reduce the integration test step timeout from 45 to 30 minutes
in ci-test.yml — with surefire killing hung forks at 10 minutes,
30 minutes is plenty for the full integration suite.

Clean runs complete in ~3.5 minutes; the 600s timeout is a safety
net that only fires on hung tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz joewiz requested a review from a team as a code owner March 26, 2026 05:19
@dizzzz dizzzz requested review from a team, duncdrum, line-o and reinhapa March 26, 2026 21:49
joewiz added a commit to joewiz/exist that referenced this pull request Apr 8, 2026
Commit 7db70bf reduced this from 45→30 minutes alongside the surefire
forked-process timeout. However flaky infrastructure tests (MoveResource,
RenameCollection) can each hang for up to 10 minutes before the surefire
timeout kills them; with multiple hung tests the 30-minute step limit is
too tight. Restore to 45 minutes to match the pre-eXist-db#6186 baseline while
keeping the surefire fork timeout to bound individual hangs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
joewiz added a commit to joewiz/exist that referenced this pull request Apr 8, 2026
Cherry-picked from PR eXist-db#6186 (bugfix/ci-surefire-timeouts) and extended
with the offline license check fix:

exist-parent/pom.xml:
  Add forkedProcessTimeoutInSeconds=600 and forkedProcessExitTimeoutInSeconds=60
  to both maven-surefire-plugin and maven-failsafe-plugin. Kills forked JVMs
  that hang (e.g. DeadlockIT, MoveResourceTest) after 10 minutes instead of
  waiting for the 45-minute step timeout.

.github/workflows/ci-test.yml:
  Add --offline to the license:check step. The license job always restores
  the Maven cache from develop (which has all required artifacts), so no
  network access is needed. Eliminates timeouts caused by CLOSE_WAIT connections
  to repo.exist-db.org and repo.evolvedbinary.com returning 504 from CI runners.
  Result: ~20s instead of 7+ minutes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
joewiz added a commit to joewiz/exist that referenced this pull request Apr 8, 2026
Cherry-picked from PR eXist-db#6186 (bugfix/ci-surefire-timeouts) and extended
with the offline license check fix:

exist-parent/pom.xml:
  Add forkedProcessTimeoutInSeconds=600 and forkedProcessExitTimeoutInSeconds=60
  to both maven-surefire-plugin and maven-failsafe-plugin. Kills forked JVMs
  that hang (e.g. DeadlockIT, MoveResourceTest) after 10 minutes instead of
  waiting for the 45-minute step timeout.

.github/workflows/ci-test.yml:
  Add --offline to the license:check step. The license job always restores
  the Maven cache from develop (which has all required artifacts), so no
  network access is needed. Eliminates timeouts caused by CLOSE_WAIT connections
  to repo.exist-db.org and repo.evolvedbinary.com returning 504 from CI runners.
  Result: ~20s instead of 7+ minutes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two CI reliability improvements:

1. Maven Build step: Add wagon timeout flags via MAVEN_OPTS (connectionTimeout=10s,
   readTimeout=30s). Since MAVEN_OPTS is passed to the Maven daemon JVM, these
   become system properties that configure wagon's HTTP transport — causing stalled
   connections to repo.exist-db.org (CLOSE_WAIT) and repo.evolvedbinary.com
   (504/unreachable) to time out after 30s instead of hanging indefinitely.

2. License check step: Add --offline flag. The license job always restores the
   Maven cache from develop (which has all required artifacts), so no network
   access is needed. Eliminates 7+ minute timeouts from the same repo issues.
   Result: ~20s instead of 7+ minutes.

Note: wagon timeout flags must be in MAVEN_OPTS (daemon JVM system properties),
not as -D flags on the mvnd command line (which go to the client, not daemon).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@joewiz joewiz force-pushed the bugfix/ci-surefire-timeouts branch from 7805dde to fda23d2 Compare April 8, 2026 20:57
Copy link
Copy Markdown
Contributor

@duncdrum duncdrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I m not quite sure why we keep running into these build issues on CI. Maven central, repo.exist-db and Github are up. It's quite unlikely that all three are suddenly struck with outages.

env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
MAVEN_OPTS: >-
-DtrimStackTrace=false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like centralising these options, but we are doing it only half-heartedly e.g. -DtrimStackTrace=false is appears in the env and on every invocation. We should make this DRY. There are more flags similarly repeated

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[This response was co-authored with Claude Code. -Joe]

Agreed — this is a real smell and worth cleaning up properly. I'll address this as part of reworking the transport question; once we settle on native vs wagon the flag set will change anyway, and that's a good moment to centralize everything.

GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
MAVEN_OPTS: >-
-DtrimStackTrace=false
-D'maven.resolver.transport=wagon'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want wagon at all? Its no longer the default https://maven.apache.org/guides/mini/guide-resolver-transport.html for a reason

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[This response was co-authored with Claude Code. -Joe]

Good question, and the git history has an answer: -Dmaven.resolver.transport=wagon was added by Dannes on 2023-03-02 with commit message "testing" — almost certainly a temporary workaround for instability in Maven 3.9.0's native transport when it first became the default. It was never revisited. mvnd 1.0.3 bundles Maven 3.9.11, so whatever prompted the workaround is long fixed.

I'd like to try removing it (and the wagon-specific timeout flags) and replacing with Aether native timeout flags (aether.connector.basic.connectTimeout, aether.connector.basic.requestTimeout). Will open a test branch and report back.

Comment thread exist-parent/pom.xml
<reuseForks>true</reuseForks>
<!-- Kill forked JVMs that hang (e.g. DeadlockIT, MoveResourceTest).
10 minutes is generous; clean runs complete in ~3.5 minutes. -->
<forkedProcessTimeoutInSeconds>600</forkedProcessTimeoutInSeconds>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea with #6045 was to move slow tests into the IT phase. Unit test should be faster, so I would expect the timeouts here to be lower compared to failsafe tests. If there are slow tests left in surefire, I would rather move them to IT.
Does surefire even fork a jvm? I thought this was a failsafe thing?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[This response was co-authored with Claude Code. -Joe]

Surefire does fork here — forkCount=2C with reuseForks=true in exist-parent/pom.xml, so forkedProcessTimeoutInSeconds is applicable. That said, your point about lower timeouts for unit tests is well taken, and moving slow stragglers to the IT phase (#6045) is the right long-term answer. For now I'll remove the forkedProcessTimeoutInSeconds from surefire and keep it only in failsafe, where it has the clearest justification.

- name: Maven Integration Tests
if: matrix.test-type == 'integration'
timeout-minutes: 45
timeout-minutes: 30
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen 37min successful runs on CI. I would suggest to be more generous with the CI timeout until we can establish a new baseline of successful CI runs, and then lower the workflow timeouts accordingly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[This response was co-authored with Claude Code. -Joe]

Good to know — reverting to 45 minutes. The 30-minute reduction was premature given you've seen legitimate 37-minute runs.

@joewiz
Copy link
Copy Markdown
Member Author

joewiz commented Apr 9, 2026

[This response was co-authored with Claude Code. -Joe]

Fair challenge. What we observed locally was CLOSE_WAIT on TCP connections to repo.exist-db.org — the server accepts the connection then sends FIN, leaving the client socket in CLOSE_WAIT and Maven's HTTP client waiting for data that never arrives. And CI logs showed HTTP 504 from repo.evolvedbinary.com. Neither means the services are "down" in the traditional sense — it's more likely a connection-lifecycle mismatch between wagon's HTTP client and how those servers handle idle connections.

But your question about wagon (see inline comment) led to something more interesting: the -Dmaven.resolver.transport=wagon flag in global MAVEN_OPTS was added by Dannes in March 2023 with commit message "testing" — likely a workaround for early Maven 3.9.0 native transport instability when it first became the default. mvnd 1.0.3 now bundles Maven 3.9.11. That flag is almost certainly stale, and removing it (letting native transport handle everything, with Aether timeout flags) is probably the right fix rather than working around wagon's behavior. Opening a test branch now.

@joewiz
Copy link
Copy Markdown
Member Author

joewiz commented Apr 13, 2026

Superseded by #6224.

@joewiz joewiz closed this Apr 13, 2026
@joewiz joewiz deleted the bugfix/ci-surefire-timeouts branch April 13, 2026 01:51
joewiz added a commit to joewiz/exist that referenced this pull request Apr 14, 2026
… update statuses

- Replace closed eXist-db#6213 (v2/jetty-12-upgrade) with eXist-db#6145 (feature/websocket-core)
- Add CI Health Note explaining known noise: integration hangs, container image
  HTTP 502, XQTS runner Saxon 12 crash, and complementary empty-match failures
  in eXist-db#6212/eXist-db#6218
- Update XQTS runner: eXist-db#45 closed, eXist-db#49 is the active PR
- Update cross-repo PR table accordingly
- Update "Also Ready to Merge" table: mark eXist-db#6142, eXist-db#6146 merged; eXist-db#6186 superseded
  by eXist-db#6224; correct eXist-db#6087 approver; add status notes for eXist-db#6182, eXist-db#6184

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants