[ci] Add surefire fork timeouts to prevent CI hangs#6186
[ci] Add surefire fork timeouts to prevent CI hangs#6186joewiz wants to merge 2 commits intoeXist-db:developfrom
Conversation
Configure forkedProcessTimeoutInSeconds=600 and forkedProcessExitTimeoutInSeconds=60 in both maven-surefire-plugin and maven-failsafe-plugin in exist-parent/pom.xml. This kills forked JVMs that hang (e.g. DeadlockIT, MoveResourceTest) after 10 minutes instead of waiting indefinitely for the 45-minute GitHub Actions step timeout. Also reduce the integration test step timeout from 45 to 30 minutes in ci-test.yml — with surefire killing hung forks at 10 minutes, 30 minutes is plenty for the full integration suite. Clean runs complete in ~3.5 minutes; the 600s timeout is a safety net that only fires on hung tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Commit 7db70bf reduced this from 45→30 minutes alongside the surefire forked-process timeout. However flaky infrastructure tests (MoveResource, RenameCollection) can each hang for up to 10 minutes before the surefire timeout kills them; with multiple hung tests the 30-minute step limit is too tight. Restore to 45 minutes to match the pre-eXist-db#6186 baseline while keeping the surefire fork timeout to bound individual hangs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-picked from PR eXist-db#6186 (bugfix/ci-surefire-timeouts) and extended with the offline license check fix: exist-parent/pom.xml: Add forkedProcessTimeoutInSeconds=600 and forkedProcessExitTimeoutInSeconds=60 to both maven-surefire-plugin and maven-failsafe-plugin. Kills forked JVMs that hang (e.g. DeadlockIT, MoveResourceTest) after 10 minutes instead of waiting for the 45-minute step timeout. .github/workflows/ci-test.yml: Add --offline to the license:check step. The license job always restores the Maven cache from develop (which has all required artifacts), so no network access is needed. Eliminates timeouts caused by CLOSE_WAIT connections to repo.exist-db.org and repo.evolvedbinary.com returning 504 from CI runners. Result: ~20s instead of 7+ minutes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cherry-picked from PR eXist-db#6186 (bugfix/ci-surefire-timeouts) and extended with the offline license check fix: exist-parent/pom.xml: Add forkedProcessTimeoutInSeconds=600 and forkedProcessExitTimeoutInSeconds=60 to both maven-surefire-plugin and maven-failsafe-plugin. Kills forked JVMs that hang (e.g. DeadlockIT, MoveResourceTest) after 10 minutes instead of waiting for the 45-minute step timeout. .github/workflows/ci-test.yml: Add --offline to the license:check step. The license job always restores the Maven cache from develop (which has all required artifacts), so no network access is needed. Eliminates timeouts caused by CLOSE_WAIT connections to repo.exist-db.org and repo.evolvedbinary.com returning 504 from CI runners. Result: ~20s instead of 7+ minutes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two CI reliability improvements: 1. Maven Build step: Add wagon timeout flags via MAVEN_OPTS (connectionTimeout=10s, readTimeout=30s). Since MAVEN_OPTS is passed to the Maven daemon JVM, these become system properties that configure wagon's HTTP transport — causing stalled connections to repo.exist-db.org (CLOSE_WAIT) and repo.evolvedbinary.com (504/unreachable) to time out after 30s instead of hanging indefinitely. 2. License check step: Add --offline flag. The license job always restores the Maven cache from develop (which has all required artifacts), so no network access is needed. Eliminates 7+ minute timeouts from the same repo issues. Result: ~20s instead of 7+ minutes. Note: wagon timeout flags must be in MAVEN_OPTS (daemon JVM system properties), not as -D flags on the mvnd command line (which go to the client, not daemon). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7805dde to
fda23d2
Compare
| env: | ||
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| MAVEN_OPTS: >- | ||
| -DtrimStackTrace=false |
There was a problem hiding this comment.
I like centralising these options, but we are doing it only half-heartedly e.g. -DtrimStackTrace=false is appears in the env and on every invocation. We should make this DRY. There are more flags similarly repeated
There was a problem hiding this comment.
[This response was co-authored with Claude Code. -Joe]
Agreed — this is a real smell and worth cleaning up properly. I'll address this as part of reworking the transport question; once we settle on native vs wagon the flag set will change anyway, and that's a good moment to centralize everything.
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| MAVEN_OPTS: >- | ||
| -DtrimStackTrace=false | ||
| -D'maven.resolver.transport=wagon' |
There was a problem hiding this comment.
Do we want wagon at all? Its no longer the default https://maven.apache.org/guides/mini/guide-resolver-transport.html for a reason
There was a problem hiding this comment.
[This response was co-authored with Claude Code. -Joe]
Good question, and the git history has an answer: -Dmaven.resolver.transport=wagon was added by Dannes on 2023-03-02 with commit message "testing" — almost certainly a temporary workaround for instability in Maven 3.9.0's native transport when it first became the default. It was never revisited. mvnd 1.0.3 bundles Maven 3.9.11, so whatever prompted the workaround is long fixed.
I'd like to try removing it (and the wagon-specific timeout flags) and replacing with Aether native timeout flags (aether.connector.basic.connectTimeout, aether.connector.basic.requestTimeout). Will open a test branch and report back.
| <reuseForks>true</reuseForks> | ||
| <!-- Kill forked JVMs that hang (e.g. DeadlockIT, MoveResourceTest). | ||
| 10 minutes is generous; clean runs complete in ~3.5 minutes. --> | ||
| <forkedProcessTimeoutInSeconds>600</forkedProcessTimeoutInSeconds> |
There was a problem hiding this comment.
The idea with #6045 was to move slow tests into the IT phase. Unit test should be faster, so I would expect the timeouts here to be lower compared to failsafe tests. If there are slow tests left in surefire, I would rather move them to IT.
Does surefire even fork a jvm? I thought this was a failsafe thing?
There was a problem hiding this comment.
[This response was co-authored with Claude Code. -Joe]
Surefire does fork here — forkCount=2C with reuseForks=true in exist-parent/pom.xml, so forkedProcessTimeoutInSeconds is applicable. That said, your point about lower timeouts for unit tests is well taken, and moving slow stragglers to the IT phase (#6045) is the right long-term answer. For now I'll remove the forkedProcessTimeoutInSeconds from surefire and keep it only in failsafe, where it has the clearest justification.
| - name: Maven Integration Tests | ||
| if: matrix.test-type == 'integration' | ||
| timeout-minutes: 45 | ||
| timeout-minutes: 30 |
There was a problem hiding this comment.
I've seen 37min successful runs on CI. I would suggest to be more generous with the CI timeout until we can establish a new baseline of successful CI runs, and then lower the workflow timeouts accordingly.
There was a problem hiding this comment.
[This response was co-authored with Claude Code. -Joe]
Good to know — reverting to 45 minutes. The 30-minute reduction was premature given you've seen legitimate 37-minute runs.
|
[This response was co-authored with Claude Code. -Joe] Fair challenge. What we observed locally was CLOSE_WAIT on TCP connections to But your question about wagon (see inline comment) led to something more interesting: the |
|
Superseded by #6224. |
… update statuses - Replace closed eXist-db#6213 (v2/jetty-12-upgrade) with eXist-db#6145 (feature/websocket-core) - Add CI Health Note explaining known noise: integration hangs, container image HTTP 502, XQTS runner Saxon 12 crash, and complementary empty-match failures in eXist-db#6212/eXist-db#6218 - Update XQTS runner: eXist-db#45 closed, eXist-db#49 is the active PR - Update cross-repo PR table accordingly - Update "Also Ready to Merge" table: mark eXist-db#6142, eXist-db#6146 merged; eXist-db#6186 superseded by eXist-db#6224; correct eXist-db#6087 approver; add status notes for eXist-db#6182, eXist-db#6184 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Two CI reliability improvements:
Surefire fork timeouts: When a test like
DeadlockITorMoveResourceTesthangs during CI, the surefire/failsafe forked JVM waits indefinitely — the only protection is the GitHub Actions step timeout at 45 minutes. This burns CI minutes and blocks PR merges. Adding surefire fork timeouts kills hung test JVMs after 10 minutes instead of 45.Maven Build wagon timeouts + offline license check: Several Maven repositories (
repo.exist-db.org,repo.evolvedbinary.com) cause stalled or failed HTTP connections from CI runners (CLOSE_WAIT and 504 errors respectively). This causes the Maven Build step to hang silently for 25+ minutes until the 30-minute step timeout fires.What changed
exist-parent/pom.xmlAdded to both
maven-surefire-pluginandmaven-failsafe-pluginconfiguration:forkedProcessTimeoutInSeconds=600: Kills the forked JVM after 10 minutes. Clean test runs complete in ~3.5 minutes, so this only fires on hung tests.forkedProcessExitTimeoutInSeconds=60: Gives the fork 60 seconds to flush results before force-kill..github/workflows/ci-test.ymlThree changes:
Maven Build: wagon timeout via MAVEN_OPTS. Added wagon connection/read timeouts (10s/30s) to the Maven Build step's
MAVEN_OPTS. Setting these inMAVEN_OPTS(not as-Dflags on the mvnd command line) ensures they are passed as JVM system properties to the Maven daemon — which is where wagon reads them. Causes stalled connections to fail quickly instead of hanging for 25+ minutes.License check: run offline. Added
--offlineto thelicense:checkstep. The license job always restores the Maven cache fromdevelop(which has all required artifacts), so no network access is needed. Eliminates 7+ minute timeouts from the same repo issues. Result: ~20s.Integration test step timeout: 45 → 30 minutes. With surefire killing hung forks at 10 minutes, 30 minutes is ample for the full integration suite.
Evidence the fix works
An earlier iteration of this PR proved the surefire timeout infrastructure works:
DeadlockITWhy these values
DeadlockIT,MoveResourceTest) run indefinitely without interventionTest plan