Skip to content

fix(discovery): harden Discovery Plugin registration against failed plugins, retry floods#1482

Merged
andrewazores merged 24 commits intocryostatio:mainfrom
andrewazores:discovery-plugin-retry-storm
Apr 27, 2026
Merged

fix(discovery): harden Discovery Plugin registration against failed plugins, retry floods#1482
andrewazores merged 24 commits intocryostatio:mainfrom
andrewazores:discovery-plugin-retry-storm

Conversation

@andrewazores
Copy link
Copy Markdown
Member

@andrewazores andrewazores commented Apr 22, 2026

Welcome to Cryostat! 👋

Before contributing, make sure you have:

  • Read the contributing guidelines
  • Linked a relevant issue which this PR resolves
  • Linked any other relevant issues, PR's, or documentation, if any
  • Resolved all conflicts, if any
  • Rebased your branch PR on top of the latest upstream main branch
  • Attached at least one of the following labels to the PR: [chore, ci, docs, feat, fix, test]
  • Signed all commits using a GPG signature

To recreate commits with GPG signature git fetch upstream && git rebase --force --gpg-sign upstream/main


Related to #189
Related to #406
Fixes #1483
See cryostatio/cryostat-agent#851

Description of the change:

  1. Fixes up some database transaction handling in Discovery, CustomDiscovery, and KubeEndpointSlicesDiscovery, including using locks on the Realm nodes and ensuring Realms have unique names.
  2. Adds consecutiveFailures, lastSuccessfulPing, lastFailedPing, backoffMultiplier, and nextPingAt columns to the DiscoveryPlugin table. These are used for enhanced logic to detect when Discovery Plugins (Agents) become unreachable. Previously Cryostat would consider a plugin failed as soon as it failed a single ping check, but pings may fail in practice due to network interruptions or target application overload etc., so there should be some leeway. Once a plugin does fail enough consecutive checks with exponential backoff then Cryostat will consider it failed and prune it.
  3. Ensure that plugin refresh jobs are only started in response to new plugin registrations, and duplicate jobs are not created or jobs are not restarted when plugins refresh their registration.
  4. Adds a ConnectionPoolMonitor that logs debug messages periodically to help with troubleshooting Cryostat instances. This should help us see if there are database connection pool issues again in the future, in particular in the case where Cryostat is stuck and cannot even be used to start profiling itself - if this job is running in the background and printing logs, and if Cryostat is configured at the correct log level already, then we may be able to determine the cause of the problem.
  5. Fixes up various exception handling in discovery, as well as related sites like ActiveRecordingUpdateJob to ensure the system is more resilient to things like race conditions where a Target may have been lost while a task was in the middle of executing.
  6. Allows idempotent plugin registration - if a plugin is trying to register for the first time, without passing an id and token, but using an existing callback and realm, AND the new plugin is able to pass the callback ping check, then we know that this is (somehow) a state where an Agent instance has lost its internal state for keeping track of its own registration information but still appears to be functionally the same as a previously-registered Agent instance. Previously this would generate a registration refusal from Cryostat because the plugin appears to be a duplicate, but now we consider it a replacement of the same plugin (since it has the same identity and passes our identification checks) and pass it back its ID and a fresh token.

@github-actions
Copy link
Copy Markdown

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

@mergify mergify Bot requested a review from a team April 22, 2026 20:24
@andrewazores
Copy link
Copy Markdown
Member Author

/build_test

@github-actions
Copy link
Copy Markdown

Workflow started at 4/22/2026, 4:25:40 PM. View Actions Run.

@github-actions
Copy link
Copy Markdown

No WebSocket notifications schema changes detected.

@github-actions
Copy link
Copy Markdown

No OpenAPI schema changes detected.

@github-actions
Copy link
Copy Markdown

No GraphQL schema changes detected.

@github-actions
Copy link
Copy Markdown

CI build:
Integration tests pass ✅
Tests run: 46, Failures: 0, Errors: 0, Skipped: 1

https://github.com/cryostatio/cryostat/actions/runs/24800846425

@andrewazores andrewazores force-pushed the discovery-plugin-retry-storm branch from 2f11eea to 6c39e95 Compare April 24, 2026 19:45
@andrewazores andrewazores requested review from a team and removed request for a team April 24, 2026 20:17
@andrewazores
Copy link
Copy Markdown
Member Author

/build_test

@github-actions
Copy link
Copy Markdown

Workflow started at 4/24/2026, 4:17:57 PM. View Actions Run.

@github-actions
Copy link
Copy Markdown

No WebSocket notifications schema changes detected.

@github-actions
Copy link
Copy Markdown

No OpenAPI schema changes detected.

@github-actions
Copy link
Copy Markdown

No GraphQL schema changes detected.

@github-actions
Copy link
Copy Markdown

CI build:
Integration tests pass ✅
Tests run: 46, Failures: 0, Errors: 0, Skipped: 1

https://github.com/cryostatio/cryostat/actions/runs/24909934429

@andrewazores andrewazores marked this pull request as ready for review April 24, 2026 21:02
@andrewazores andrewazores merged commit 2b4e86c into cryostatio:main Apr 27, 2026
13 checks passed
@andrewazores andrewazores deleted the discovery-plugin-retry-storm branch April 27, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Discovery Plugins (Cryostat Agent instances) can become desynchronized over time and fail to re-register

2 participants