Skip to content

RATIS-2507. Fix java.lang.IllegalStateException: gap between entries#1439

Open
ChenSammi wants to merge 3 commits intoapache:masterfrom
ChenSammi:RATIS-2507
Open

RATIS-2507. Fix java.lang.IllegalStateException: gap between entries#1439
ChenSammi wants to merge 3 commits intoapache:masterfrom
ChenSammi:RATIS-2507

Conversation

@ChenSammi
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Fail RatisServer when RaftLog end log index is smaller than last snapshot index during startup

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-2507

How was this patch tested?

existing unit test

@ChenSammi ChenSammi marked this pull request as draft April 24, 2026 07:59
Comment on lines +267 to +268
// If the end index is smaller than lastIndexInSnapshot, it means the state machine state is inconsistent
// with raft log state, fail the RaftServerImpl.start() to keep the state untacked.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... state machine state is inconsistent with raft log state ...

They don't have to be consistent:

  1. leader's log index is 1000, but a (slow) follower only has log index 500
  2. the follower dies
  3. leader purges the log before 700
  4. the follower restarts and reads log up to index 500
  5. the follower installs a snapshot at index 1000
  6. the follower dis again <--- snapshot 1000 but log 500

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szetszwo , above is exactly the case as what I saw. For this snapshot 1000 but log 500 case, current ratis will still start the RaftServer, which I think is not a good choice.
For Ozone's case, it leads to OM shutdown due to "java.lang.IllegalStateException: gap between entry term" , https://issues.apache.org/jira/browse/HDDS-15103. Do you have any better suggestion for the fix solution?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenSammi , For HDDS-15103, let's try to create a unit test to reproduce it. Then, it would be easier to see how to fix it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not HDDS-15103, but HDDS-15068.

HDDS-15103 is fixed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For HDDS-15068, Ratis should not write to the RaftLog when there is a gap. Would like to fix it in this PR (or filing a new JIRA) ? If not, I am happy to do it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We can reuse this jira.

@ChenSammi ChenSammi changed the title RATIS-2507. Fail RatisServer when RaftLog end log index is smaller than last snapshot index during startup RATIS-2507. Fix java.lang.IllegalStateException "gap between entries" Apr 29, 2026
@ChenSammi ChenSammi changed the title RATIS-2507. Fix java.lang.IllegalStateException "gap between entries" RATIS-2507. Fix java.lang.IllegalStateException: gap between entries Apr 29, 2026
@ChenSammi ChenSammi marked this pull request as ready for review April 29, 2026 08:25
Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@ChenSammi
Copy link
Copy Markdown
Contributor Author

Not sure the UT failed reason, under check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants