CNDB-17012: Improve TrieMemoryIndex row count estimation#2262
CNDB-17012: Improve TrieMemoryIndex row count estimation#2262
Conversation
Checklist before you submit for review
|
58dc2a8 to
3ea235f
Compare
| } | ||
| } | ||
|
|
||
| private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyz".toCharArray(); |
There was a problem hiding this comment.
Why a limited alphabet, if we are testing all of UTF-8?
There was a problem hiding this comment.
The problem is this simplified estimation isn't very accurate if the data is not distributed uniformly. This ticket really switched the estimation from a horrible algorithm to a slightly less horrible ;). And this test is just a very basic sanity check that the algorithm works fine if the main assumption (= data distributed uniformly) kinda holds.
If we use a wider character set, there will be likely huge holes between the code points, and modeling real distribution of characters used by some real text (as in some existing natural language) becomes hairy very quickly. I got already much worse results when using a set of {a-z, 0-9}, because of a hole between encodings of letters and numbers.
Btw - this test wasn't passing earlier even for the simplified character set like that.
Im afraid we can't make this test better without further improvement of the estimation algorithm itself. We'd probably need to introduce histograms similar way like we have for sstables, but that is far out of scope of this ticket.
773b3d7 to
a238f83
Compare
Fixes a bug in TermsDistribution#toBigDecimal which didn't preserve order for some types and broke the assertion in TrieMemoryIndex#estimateNumRowsMatchingRange. Additionally, replaces the row count estimation algorithm with a better one: - simpler - more efficient (no need to search and iterate the trie) - more accurate (especially for wider ranges)
a238f83 to
d1532b7
Compare
❌ Build ds-cassandra-pr-gate/PR-2262 rejected by Butler7 regressions found Found 7 new test failures
Found 7 known test failures |
|



Fixes a bug in TermsDistribution#toBigDecimal which didn't preserve
order for some types and broke the assertion
in TrieMemoryIndex#estimateNumRowsMatchingRange.
Additionally, replaces the row count estimation algorithm with a better
one: