Skip to content

fix(discovery-plugin): startup plugin pings are serial, add more fault-tolerance#1496

Merged
andrewazores merged 8 commits intocryostatio:mainfrom
andrewazores:registration-herd
May 4, 2026
Merged

fix(discovery-plugin): startup plugin pings are serial, add more fault-tolerance#1496
andrewazores merged 8 commits intocryostatio:mainfrom
andrewazores:registration-herd

Conversation

@andrewazores
Copy link
Copy Markdown
Member

Welcome to Cryostat! 👋

Before contributing, make sure you have:

  • Read the contributing guidelines
  • Linked a relevant issue which this PR resolves
  • Linked any other relevant issues, PR's, or documentation, if any
  • Resolved all conflicts, if any
  • Rebased your branch PR on top of the latest upstream main branch
  • Attached at least one of the following labels to the PR: [chore, ci, docs, feat, fix, test]
  • Signed all commits using a GPG signature

To recreate commits with GPG signature git fetch upstream && git rebase --force --gpg-sign upstream/main


Related to #1483

Description of the change:

Further improvements to the discovery plugin registration cycle. After the recent Cryostat and Agent PRs the situation is stable when Agents register with Cryostat, and Agents going offline and being replaced is working as expected and reliably. However, if a very large number of Agents come online simultaneously then Cryostat can get overwhelmed with the flood of requests, which can cause threadpool or JDBC connection pool starvation issues or even OOM crashes.

  1. adds more fault-tolerance annotations and tuning to ensure Cryostat is resilient to registration flooding
  2. when Cryostat restarts (for example, if it got OOM killed by a thundering herd of Agents registering at once) it tries to verify the DiscoveryPlugin instances already in the database to ensure they aren't stale. Prior to this PR it does so by creating a new RefreshPluginJob for each plugin and executing them all immediately - which means Cryostat is inviting the Agents to form a thundering herd and overwhelm it again. So, the startup ping is now batched and metered out slowly so that Cryostat can process the DiscoveryPlugin instances at a sustainable rate and verify that each is not stale and recover the state of the system.

Copy link
Copy Markdown
Member

@jtolentino1 jtolentino1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewazores
Copy link
Copy Markdown
Member Author

/build_test

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Workflow started at 5/4/2026, 4:33:26 PM. View Actions Run.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

No WebSocket notifications schema changes detected.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

No GraphQL schema changes detected.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

No OpenAPI schema changes detected.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

CI build:
Integration tests pass ✅
Tests run: 46, Failures: 0, Errors: 0, Skipped: 1

https://github.com/cryostatio/cryostat/actions/runs/25342024421

@andrewazores andrewazores merged commit cf2aca9 into cryostatio:main May 4, 2026
13 checks passed
@andrewazores andrewazores deleted the registration-herd branch May 4, 2026 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants