Conversation
| // Nope, it didn't complete. Pulse CSn. | ||
| ringbuf_entry!(Trace::UnexpectedRotIrq); | ||
| self.stats.csn_pulses += self.stats.csn_pulses.wrapping_add(1); | ||
| self.stats.csn_pulses = self.stats.csn_pulses.wrapping_add(1); |
There was a problem hiding this comment.
take it or leave it: if we wanted to be able to use += here, we could change stats.csn_pulses to Wrapping(u32) and then always get explicitly wrapping behavior when using normal arithmetic operators. we might consider doing this separately in a follow-up PR.
There was a problem hiding this comment.
Interesting. I see many uses of x = x.wrapping_add(1) in hubris and no uses of Wrapping(u32) yet. If you think it's useful I'd be happy to open a separate PR that changes all of them at once.
There was a problem hiding this comment.
Yeah, I think we may want to do an audit for this everywhere. The nice thing about Wrapping is that we can say "we always want to use explicitly wrapping arithmetic for this type" rather than having to remember to do it at the call site, which feels less error prone. There's also a Saturating type which I think might be a bit newer. I'll go make a separate bug for this.
904b067 to
6baeea1
Compare
andrewjstone
left a comment
There was a problem hiding this comment.
It took me a bit to get back up to speed here, as I haven't written any hubris code in a couple of years. I think Claude largely wasted our time with this one, although the defense in depth seems reasonable.
|
|
||
| // The rx fifo contained more bytes than could fit in the buffer, which | ||
| // is more than we ever expect to receive in one request message. | ||
| if bytes_received > rx_buf.len() { |
There was a problem hiding this comment.
This seems fine, but also is also largely unnecessary. We statically size the buffers to always be able to handle a request or response.
I don't believe we've ever actually seen this issue in practice since #1523 merged.
The only reasons I can think of where this would matter are:
- The SP was misbehaving significantly
- We resized the buffers and messages. The SP was updated first and sent a message to the RoT that didn't fit in the old buffer.
Both would be major problems in production. For (2) we can and should avoid it by just not resizing the buffers :). If we decide we must resize then we can do it over two releases depending upon direction:
- If we are increasing buffer size, we just go ahead and do this, but don't add any new messages that need the larger buffer yet. Then in the second release we add the larger messages. This ensures we never overflow.
- If we are decreasing the buffer size, we deprecate messages that are larger and stop sending them in the first release. Then in the second release we decrease the buffer size and remove the messages.
All in all, this is fine as a defensive measure, but I find it largely unnecessary. IIRC the reason I did it the way I did was because without fault management at the time I figured it would be easier to spot this major issue (SP misbehavior) occurring from a crash then with an error response and retry. But I'm not sure that's actually true or valid. So I'm generally fine with just not crashing the RoT lol.
However, I don't think we should overload the same error code. This is not a FIFO overrun and it isn't really related to flow control. I'd call it something else like RequestMsgTooLarge or something similar so that it stands out and helps us debug if this ever happens in the future. It would be a good sign to never see this distinct error code occur in practice.
There was a problem hiding this comment.
I just pushed a commit that adds IOError::RequestMsgTooLarge and converts it to the existing SprotProtocolError::BadMessageLength error when reporting to the SP. Do you think that's a good approach? I could add a new SprotProtocolError but that would require updating the management-gateway-service repo, and BadMessageLength seems like a pretty good fit even though it's mostly used for a slightly different case.
| } | ||
|
|
||
| if total_size > RESPONSE_BUF_SIZE { | ||
| if total_size > rx_buf.len() { |
There was a problem hiding this comment.
This is a no-op and I find the new version less clear than the original. In fact I had to go read the code to ensure that they were the same.
The reason I find it harder to understand is the lack of local reasoning with the new version. I know from looking at the capitalized name that it is a constant, fixed buffer size. When looking at rx_buf.len() I have no such guarantee. If, as is the case here, we have an array, then the values must be the same. The length of an array doesn't change. However, if we had used something like ArrayVec and called .len() it would give us the number of elements in the "vector", not the capacity of the array. I couldn't recall what I used in the latest version of the code and so I had to go check and make sure.
For that reason I'd prefer not to make this change.
There was a problem hiding this comment.
I take this one back! I was looking at the lpc55 code locally. It's much less clear here in the call chain. In fact we could change to take a shorter slice and the constant would be wrong. Good fix. Sorry for the confusion.
There was a problem hiding this comment.
Ok, I'll revert it. My thought was that the new version improved local reasoning, because now the function was more self-contained and would work with any array passed in as rx_buf, instead of only the one array that is defined elsewhere to be RESPONSE_BUF_SIZE bytes long. But it does get more complicated if you consider types other than arrays.
There was a problem hiding this comment.
Ok, I'll revert it. My thought was that the new version improved local reasoning, because now the function was more self-contained and would work with any array passed in as
rx_buf, instead of only the one array that is defined elsewhere to be RESPONSE_BUF_SIZE bytes long. But it does get more complicated if you consider types other than arrays.
No no! I actually think you are correct here. I tried to alert you that I was wrong and you were right in my second comment. I was looking at the wrong file. I thought my comment only made sense (if it makes sense at all ) on the lpc55 side. On the SP side we are passing a slice and it doesn't matter what the type is. I just really haven't looked at this code in a while and got confused.
At this point totally up to you what you want to do. You can leave it reverted or add it back ;) Sorry about the confusion.
There was a problem hiding this comment.
Oh lol, I didn't see the second comment. I really wish github made sure to show you all new comments in a thread before letting you reply, instead of hoping you reload the page...
|
@evan-oxide How did you test this code? I think it's good to go, but if you are setup to test it locally that would be best. The best way I've found is to do an upgrade from faux-mgs to the gimletlet. I don't have a local setup anymore though. Everything is packed away currently. In any event I'm sure someone can help get you squared away. |
| /// The number of times an SP sent more bytes than expected for one | ||
| /// message. In otherwords, the number of bytes sent by the SP to the RoT | ||
| /// between CSn assert and CSn de-assert exceeds `REQUEST_BUF_SIZE`. | ||
| pub rx_protocol_error_too_many_bytes: u32, |
There was a problem hiding this comment.
I think this is fine, but if run with mixed RoT/SP versions may cause issues. I don't think we need to use it until after update though, so is probably fine.
|
@andrewjstone Thanks for thinking about this code again! I agree this issue was not a serious concern - I'm mostly treating it as a learning exercise in how sprot works and how to use humility.
I flashed a grapefruit's SP with this, making it send too many bytes in the Is there other testing you'd like to see? |
Awesome. That's what I had hoped. Sorry if that comment came off as rude. I'm super happy to have you aboard and working on this. There is certainly room for improvement!
That's a very good test. In the past we discovered issues when we tried to do updates, because it's a lot of messages sent back to back. However, I can't see how that would matter for this change except for message changes that are probably unused. I think that your test is likely good enough. One other possible easy test you can do with your current setup is flashing the RoT first and seeing what happens when the SP receives an error response that it doesn't understand. And then also trying to get the stats from the SP and see what happens. Hopefully the SP task doesn't crash. Even if it does I don't think this is actually a problem for production. We don't expect to actually hit this error at all, and I don't believe we check stats during udpates. Only if the update went wrong and we couldn't collect stats would this be a problem. A somewhat more complicated test would be to actually try to perform updates from faux-mgs with the RoT updated first. I honestly don't recall the ordering, but trying update in both orders and see what happens would be a great test. Up to you whether you think it is worth it. We'd almost certainly catch any problems on dogfood or someone would catch it on a racklette before we ship. I'm probably not the best person to get you setup with this type of testing right now since I haven't done it in a long time, but it shouldn't be hard to setup since you have a grapefruit. Any of a number of hubris folks can help. |
|
It didn't sound rude! I appreciate the context about SP/RoT communication, what we're likely to see happen in practice, and how to think about updates 🙂 I'll try to run the faux-mgs update test, mostly because that sounds like a useful thing to know how to do. If that turns out to be too much of a pain then I'll do the simpler test of flashing the RoT first. |
If the SP sent the RoT a request that couldn't fit in its rx buffer array, the RoT's sprot task would panic. Interestingly, the code used to check for this condition and increment an
rx_protocol_error_too_many_bytescounter, but the check and the counter were removed a few years ago. I revived them both, and made the sprot task report a flow error instead of panicking. I also tested that the following SPI transaction succeeds.I did a couple cleanups in the SP's sprot server while I was at it.