Skip to content

sprot: fix panic on rx buffer overflow#2487

Open
evan-oxide wants to merge 7 commits intomasterfrom
evan/rot-spi-rx-buf-overflow
Open

sprot: fix panic on rx buffer overflow#2487
evan-oxide wants to merge 7 commits intomasterfrom
evan/rot-spi-rx-buf-overflow

Conversation

@evan-oxide
Copy link
Copy Markdown
Contributor

If the SP sent the RoT a request that couldn't fit in its rx buffer array, the RoT's sprot task would panic. Interestingly, the code used to check for this condition and increment an rx_protocol_error_too_many_bytes counter, but the check and the counter were removed a few years ago. I revived them both, and made the sprot task report a flow error instead of panicking. I also tested that the following SPI transaction succeeds.

I did a couple cleanups in the SP's sprot server while I was at it.

@evan-oxide evan-oxide requested a review from lzrd April 22, 2026 21:36
Comment thread drv/lpc55-sprot-server/src/main.rs Outdated
// Nope, it didn't complete. Pulse CSn.
ringbuf_entry!(Trace::UnexpectedRotIrq);
self.stats.csn_pulses += self.stats.csn_pulses.wrapping_add(1);
self.stats.csn_pulses = self.stats.csn_pulses.wrapping_add(1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take it or leave it: if we wanted to be able to use += here, we could change stats.csn_pulses to Wrapping(u32) and then always get explicitly wrapping behavior when using normal arithmetic operators. we might consider doing this separately in a follow-up PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I see many uses of x = x.wrapping_add(1) in hubris and no uses of Wrapping(u32) yet. If you think it's useful I'd be happy to open a separate PR that changes all of them at once.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we may want to do an audit for this everywhere. The nice thing about Wrapping is that we can say "we always want to use explicitly wrapping arithmetic for this type" rather than having to remember to do it at the call site, which feels less error prone. There's also a Saturating type which I think might be a bit newer. I'll go make a separate bug for this.

@evan-oxide evan-oxide force-pushed the evan/rot-spi-rx-buf-overflow branch from 904b067 to 6baeea1 Compare April 22, 2026 21:41
@labbott labbott requested a review from andrewjstone April 23, 2026 12:37
Copy link
Copy Markdown
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a bit to get back up to speed here, as I haven't written any hubris code in a couple of years. I think Claude largely wasted our time with this one, although the defense in depth seems reasonable.


// The rx fifo contained more bytes than could fit in the buffer, which
// is more than we ever expect to receive in one request message.
if bytes_received > rx_buf.len() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine, but also is also largely unnecessary. We statically size the buffers to always be able to handle a request or response.

I don't believe we've ever actually seen this issue in practice since #1523 merged.

The only reasons I can think of where this would matter are:

  1. The SP was misbehaving significantly
  2. We resized the buffers and messages. The SP was updated first and sent a message to the RoT that didn't fit in the old buffer.

Both would be major problems in production. For (2) we can and should avoid it by just not resizing the buffers :). If we decide we must resize then we can do it over two releases depending upon direction:

  1. If we are increasing buffer size, we just go ahead and do this, but don't add any new messages that need the larger buffer yet. Then in the second release we add the larger messages. This ensures we never overflow.
  2. If we are decreasing the buffer size, we deprecate messages that are larger and stop sending them in the first release. Then in the second release we decrease the buffer size and remove the messages.

All in all, this is fine as a defensive measure, but I find it largely unnecessary. IIRC the reason I did it the way I did was because without fault management at the time I figured it would be easier to spot this major issue (SP misbehavior) occurring from a crash then with an error response and retry. But I'm not sure that's actually true or valid. So I'm generally fine with just not crashing the RoT lol.

However, I don't think we should overload the same error code. This is not a FIFO overrun and it isn't really related to flow control. I'd call it something else like RequestMsgTooLarge or something similar so that it stands out and helps us debug if this ever happens in the future. It would be a good sign to never see this distinct error code occur in practice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit that adds IOError::RequestMsgTooLarge and converts it to the existing SprotProtocolError::BadMessageLength error when reporting to the SP. Do you think that's a good approach? I could add a new SprotProtocolError but that would require updating the management-gateway-service repo, and BadMessageLength seems like a pretty good fit even though it's mostly used for a slightly different case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems perfect!

}

if total_size > RESPONSE_BUF_SIZE {
if total_size > rx_buf.len() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a no-op and I find the new version less clear than the original. In fact I had to go read the code to ensure that they were the same.

The reason I find it harder to understand is the lack of local reasoning with the new version. I know from looking at the capitalized name that it is a constant, fixed buffer size. When looking at rx_buf.len() I have no such guarantee. If, as is the case here, we have an array, then the values must be the same. The length of an array doesn't change. However, if we had used something like ArrayVec and called .len() it would give us the number of elements in the "vector", not the capacity of the array. I couldn't recall what I used in the latest version of the code and so I had to go check and make sure.

For that reason I'd prefer not to make this change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take this one back! I was looking at the lpc55 code locally. It's much less clear here in the call chain. In fact we could change to take a shorter slice and the constant would be wrong. Good fix. Sorry for the confusion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll revert it. My thought was that the new version improved local reasoning, because now the function was more self-contained and would work with any array passed in as rx_buf, instead of only the one array that is defined elsewhere to be RESPONSE_BUF_SIZE bytes long. But it does get more complicated if you consider types other than arrays.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll revert it. My thought was that the new version improved local reasoning, because now the function was more self-contained and would work with any array passed in as rx_buf, instead of only the one array that is defined elsewhere to be RESPONSE_BUF_SIZE bytes long. But it does get more complicated if you consider types other than arrays.

No no! I actually think you are correct here. I tried to alert you that I was wrong and you were right in my second comment. I was looking at the wrong file. I thought my comment only made sense (if it makes sense at all ) on the lpc55 side. On the SP side we are passing a slice and it doesn't matter what the type is. I just really haven't looked at this code in a while and got confused.

At this point totally up to you what you want to do. You can leave it reverted or add it back ;) Sorry about the confusion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh lol, I didn't see the second comment. I really wish github made sure to show you all new comments in a thread before letting you reply, instead of hoping you reload the page...

@andrewjstone
Copy link
Copy Markdown
Contributor

@evan-oxide How did you test this code? I think it's good to go, but if you are setup to test it locally that would be best. The best way I've found is to do an upgrade from faux-mgs to the gimletlet. I don't have a local setup anymore though. Everything is packed away currently. In any event I'm sure someone can help get you squared away.

Comment thread drv/sprot-api/src/lib.rs Outdated
/// The number of times an SP sent more bytes than expected for one
/// message. In otherwords, the number of bytes sent by the SP to the RoT
/// between CSn assert and CSn de-assert exceeds `REQUEST_BUF_SIZE`.
pub rx_protocol_error_too_many_bytes: u32,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, but if run with mixed RoT/SP versions may cause issues. I don't think we need to use it until after update though, so is probably fine.

@evan-oxide
Copy link
Copy Markdown
Contributor Author

@andrewjstone Thanks for thinking about this code again! I agree this issue was not a serious concern - I'm mostly treating it as a learning exercise in how sprot works and how to use humility.

How did you test this code?

I flashed a grapefruit's SP with this, making it send too many bytes in the rot_state request. I then used hiffy to call SpRot.rot_state and saw that the RoT sprot task's generation number increased. After flashing the RoT with the fix, calling SpRot.rot_state did not increase the generation number but did increment the rx_protocol_error_too_many_bytes count. Calls to SpRot.status and SpRot.io_stats always succeeded, before and after the fix.

Is there other testing you'd like to see?

@andrewjstone
Copy link
Copy Markdown
Contributor

andrewjstone commented Apr 24, 2026

@andrewjstone Thanks for thinking about this code again! I agree this issue was not a serious concern - I'm mostly treating it as a learning exercise in how sprot works and how to use humility.

Awesome. That's what I had hoped. Sorry if that comment came off as rude. I'm super happy to have you aboard and working on this. There is certainly room for improvement!

How did you test this code?

I flashed a grapefruit's SP with this, making it send too many bytes in the rot_state request. I then used hiffy to call SpRot.rot_state and saw that the RoT sprot task's generation number increased. After flashing the RoT with the fix, calling SpRot.rot_state did not increase the generation number but did increment the rx_protocol_error_too_many_bytes count. Calls to SpRot.status and SpRot.io_stats always succeeded, before and after the fix.

Is there other testing you'd like to see?

That's a very good test. In the past we discovered issues when we tried to do updates, because it's a lot of messages sent back to back. However, I can't see how that would matter for this change except for message changes that are probably unused. I think that your test is likely good enough.

One other possible easy test you can do with your current setup is flashing the RoT first and seeing what happens when the SP receives an error response that it doesn't understand. And then also trying to get the stats from the SP and see what happens. Hopefully the SP task doesn't crash. Even if it does I don't think this is actually a problem for production. We don't expect to actually hit this error at all, and I don't believe we check stats during udpates. Only if the update went wrong and we couldn't collect stats would this be a problem.

A somewhat more complicated test would be to actually try to perform updates from faux-mgs with the RoT updated first. I honestly don't recall the ordering, but trying update in both orders and see what happens would be a great test. Up to you whether you think it is worth it. We'd almost certainly catch any problems on dogfood or someone would catch it on a racklette before we ship.

I'm probably not the best person to get you setup with this type of testing right now since I haven't done it in a long time, but it shouldn't be hard to setup since you have a grapefruit. Any of a number of hubris folks can help.

@evan-oxide
Copy link
Copy Markdown
Contributor Author

It didn't sound rude! I appreciate the context about SP/RoT communication, what we're likely to see happen in practice, and how to think about updates 🙂

I'll try to run the faux-mgs update test, mostly because that sounds like a useful thing to know how to do. If that turns out to be too much of a pain then I'll do the simpler test of flashing the RoT first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants