Blog post: Redis Query Engine Triemap Port#2850
Conversation
|
This needs a better way of communicating the speedups too. Percentages are all a bit broad and ">100% speedup" is very vague compared to "0.46x the original operation's speed." In general though, this is ready for review @JonasKruckenberg |
|
I'll give this a look next Tuesday if that is alright with you! |
|
Works for me! I'll aim to get the diagrams in before then. |
| #pragma pack(1) | ||
| typedef struct { | ||
| uint16_t labelLen; | ||
| uint16_t numChildren : 9; | ||
| uint8_t flags : 7; | ||
| void *data; | ||
| char label[]; | ||
| //... here come the first letters of each child | ||
| //... now come the children pointers | ||
| } TrieMapNode; | ||
|
|
||
| #pragma pack() | ||
| ``` | ||
|
|
||
| The core property of this type is that the fields, label, and the array of pointers to children take up a **single heap allocation**. This is really important for cache locality and minimizing pointer dereferences, but it also means the size of the type and some of the offsets of the fields of that type are not known at compile time. |
There was a problem hiding this comment.
I think it'd be good to state explicitly that the actual layout is not fully specified with this struct definition, as this is very much a dynamically sized type that is allocated dynamically on the heap. The comments add some info, but what needs to made more explicit is that:
- the length of
labelis stored inlabelLen - the number of "first letters of each child" is as well as the number of "children pointers" is stored in
numChildren - Each "first letter of each child" is stored a
char
This is important as it shows that there is a lot more to it than the type definition shows, and that will need to be handled with care during implementation
|
|
||
| 2. **Debug assertions** | ||
|
|
||
| Debug assertions are checks that only run in debug builds, release builds do not pay the performance cost of them. Debug assertions let us check the invariants we want to build our API around at runtime and give good errors when those assertions fail. |
There was a problem hiding this comment.
They also further specify the invariants in code
|
|
||
| - 3.2: `multiple_unsafe_ops_per_block`: If an unsafe block contains more than one unsafe operation, there will be a warning. | ||
|
|
||
| This pushes developers in the direction of properly documenting the unsafe code they do right, operation by operation, to assure all invariants are documented. |
There was a problem hiding this comment.
And of course, we're denying warnings in CI
There was a problem hiding this comment.
I don't quite follow, could you elaborate?
There was a problem hiding this comment.
We're saying
If an unsafe block contains more than one unsafe operation, there will be a warning.
Which is great and all, but warnings can be ignored. That's why we configured CI such that it will fail if there are any warnings
Co-authored-by: Henk Oordt <hd@oordt.dev>
|
We need to title this one before I can sort an OG image for it. It's based on the "Rewrite, Optimize, Repeat: Our journey porting a triemap from C to Rust" talk, and I don't think an identical title would be appropriate but riffing off it makes sense. Looking for opinions. |
|
@tall-vase can you please confirm that @LukeMathWalker's feedback from Notion has been resolved here? If so, please let us know. Thanks! |
|
The feedback I got from Luca was that the discussion of picking a good candidate from a complex module graph didn't flow well into the TrieMap details, as well as that this should be talking about the Redis Query Engine not Redis. I can confirm I've addressed this feedback, though a second set of eyes on that is always appreciated. |
|
|
||
| A TrieMap is a key-value data structure that has a similar API as a Hashmap / Binary Tree Map, but optimized for compressing keys by their shared prefixes. | ||
|
|
||
|  |
There was a problem hiding this comment.
I believe we're transitioning too fast from "here's the data structure" to "let's discuss the nitty-gritty details of the implementation".
I'd rather have a dedicated section showing how the data structure grows/reshapes itself after each insertion, thus helping the reader build out their intuition as to how a TrieMap works.
We can include a little disclaimer right under the heading that says something along the lines of "This section explains how a TrieMap works. If you're already familiar with this data structure, skip ahead to the ".
My assumption is that few folks know (or remember) how a trie works by heart. I certainly didn't.
|
|
||
| In the final codebase we were finally left with **128** unsafe blocks and **21** unsafe methods. | ||
|
|
||
| Managing this unsafe code requires care and understanding for how developers will interact with it. Preferably a developer won't interact with it at all unless they're a maintainer. |
There was a problem hiding this comment.
The developer/maintainer distinction doesn't apply here. The project isn't really "open contributions", so all developers working on it are maintainers.
It's more about "consuming this unit of code in another module" vs "changing how this unit of code works".
|
|
||
| Note how this type is not fully definable in C's type system, the two lines of comments after `label` are fields that we can't define in a C struct definition and need to compute how to access at runtime. `labelLen` and `numChildren` let us do these computations. | ||
|
|
||
| ```c |
There was a problem hiding this comment.
This needs closer scrutiny (@LukeMathWalker ? @hdoordt ?) as my attempts at getting an MVP working properly mostly ended in segfaults.
There was a problem hiding this comment.
It is tricky to do it right, you need to account for the packed layout too and the fact that the children pointers may not be aligned.
Easier to quote the original code here: https://github.com/RediSearch/RediSearch/blob/dcf009a2327240b24d6efcf100fed577b8a5eef0/deps/triemap/triemap.c#L18
Co-authored-by: Luca Palmieri <20745048+LukeMathWalker@users.noreply.github.com>
Co-authored-by: Luca Palmieri <20745048+LukeMathWalker@users.noreply.github.com>
Co-authored-by: Luca Palmieri <20745048+LukeMathWalker@users.noreply.github.com>
Working draft of the Triemap blog post. Putting the last known draft here before I address feedback on Notion.
Preflight checks before merging: