Delete shards only when notified via gossip#6504
Conversation
There was a problem hiding this comment.
💡 Codex Review
https://github.com/quickwit-oss/quickwit/blob/5e6c7026730cb6295e48c2e258a9a2c5d980c47f/quickwit-ingest/src/ingest_v2/ingester.rs#L1035-L1037
Preserve EOF deletion for ingester-only nodes
When the target node is running only the ingester role, ShardPositionsUpdate is never republished into that node's local EventBroker because start_shard_positions_service is only started for Indexer or ControlPlane nodes (quickwit-serve/src/lib.rs lines 583-586). In that deployment, the EOF gRPC truncate request is the only signal delivered directly to the ingester, so changing it to just truncate_shard leaves the closed shard and WAL queue in memory/on disk until some later retain-shards sync happens, instead of deleting it when indexing reaches EOF.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ) | ||
| } | ||
| // The queue is empty. | ||
| None => (Position::Beginning, Position::Beginning), |
There was a problem hiding this comment.
discussed offline: this isn't ideal. We'll explore if this is something we can get from the mrecordlog.
There was a problem hiding this comment.
Fixed. Now relying on the queues summary API.
Description
There are paths (deleting empty shards on init, truncation at eof) where an ingester can delete shards without the control-plane eventually knowing. It is actually easy to reproduce locally with a combination of ingest with
commit=forceand abrupt Ctrl+C.This PR makes deleting shards via gossip the only path. There's still a very subtle path where this issue can happen. This will be the object of another PR.
How was this PR tested?
Added unit tests