Skip to content

Delete shards only when notified via gossip#6504

Open
guilload wants to merge 2 commits into
mainfrom
guilload/fix-no-shards-available
Open

Delete shards only when notified via gossip#6504
guilload wants to merge 2 commits into
mainfrom
guilload/fix-no-shards-available

Conversation

@guilload

@guilload guilload commented Jun 11, 2026

Copy link
Copy Markdown
Member

Description

There are paths (deleting empty shards on init, truncation at eof) where an ingester can delete shards without the control-plane eventually knowing. It is actually easy to reproduce locally with a combination of ingest with commit=force and abrupt Ctrl+C.

This PR makes deleting shards via gossip the only path. There's still a very subtle path where this issue can happen. This will be the object of another PR.

How was this PR tested?

Added unit tests

@guilload guilload requested a review from a team as a code owner June 11, 2026 00:41
@guilload guilload requested a review from nadav-govari June 11, 2026 00:42

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/quickwit-oss/quickwit/blob/5e6c7026730cb6295e48c2e258a9a2c5d980c47f/quickwit-ingest/src/ingest_v2/ingester.rs#L1035-L1037
P2 Badge Preserve EOF deletion for ingester-only nodes

When the target node is running only the ingester role, ShardPositionsUpdate is never republished into that node's local EventBroker because start_shard_positions_service is only started for Indexer or ControlPlane nodes (quickwit-serve/src/lib.rs lines 583-586). In that deployment, the EOF gRPC truncate request is the only signal delivered directly to the ingester, so changing it to just truncate_shard leaves the closed shard and WAL queue in memory/on disk until some later retain-shards sync happens, instead of deleting it when indexing reaches EOF.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@fulmicoton-dd fulmicoton-dd added the bug Something isn't working label Jun 11, 2026
)
}
// The queue is empty.
None => (Position::Beginning, Position::Beginning),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline: this isn't ideal. We'll explore if this is something we can get from the mrecordlog.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now relying on the queues summary API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants