sprot: fix panic on rx buffer overflow by evan-oxide · Pull Request #2487 · oxidecomputer/hubris

evan-oxide · 2026-04-22T21:36:23Z

If the SP sent the RoT a request that couldn't fit in its rx buffer array, the RoT's sprot task would panic. Interestingly, the code used to check for this condition and increment an rx_protocol_error_too_many_bytes counter, but the check and the counter were removed a few years ago. I revived them both, and made the sprot task report a flow error instead of panicking. I also tested that the following SPI transaction succeeds.

I did a couple cleanups in the SP's sprot server while I was at it.

hawkw · 2026-04-22T21:40:32Z

                // Nope, it didn't complete. Pulse CSn.
                ringbuf_entry!(Trace::UnexpectedRotIrq);
-                self.stats.csn_pulses += self.stats.csn_pulses.wrapping_add(1);
+                self.stats.csn_pulses = self.stats.csn_pulses.wrapping_add(1);


take it or leave it: if we wanted to be able to use += here, we could change stats.csn_pulses to Wrapping(u32) and then always get explicitly wrapping behavior when using normal arithmetic operators. we might consider doing this separately in a follow-up PR.

Interesting. I see many uses of x = x.wrapping_add(1) in hubris and no uses of Wrapping(u32) yet. If you think it's useful I'd be happy to open a separate PR that changes all of them at once.

Yeah, I think we may want to do an audit for this everywhere. The nice thing about Wrapping is that we can say "we always want to use explicitly wrapping arithmetic for this type" rather than having to remember to do it at the call site, which feels less error prone. There's also a Saturating type which I think might be a bit newer. I'll go make a separate bug for this.

andrewjstone

It took me a bit to get back up to speed here, as I haven't written any hubris code in a couple of years. I think Claude largely wasted our time with this one, although the defense in depth seems reasonable.

andrewjstone · 2026-04-23T22:31:18Z


+        // The rx fifo contained more bytes than could fit in the buffer, which
+        // is more than we ever expect to receive in one request message.
+        if bytes_received > rx_buf.len() {


This seems fine, but also is also largely unnecessary. We statically size the buffers to always be able to handle a request or response.

I don't believe we've ever actually seen this issue in practice since #1523 merged.

The only reasons I can think of where this would matter are:

The SP was misbehaving significantly

We resized the buffers and messages. The SP was updated first and sent a message to the RoT that didn't fit in the old buffer.

Both would be major problems in production. For (2) we can and should avoid it by just not resizing the buffers :). If we decide we must resize then we can do it over two releases depending upon direction:

If we are increasing buffer size, we just go ahead and do this, but don't add any new messages that need the larger buffer yet. Then in the second release we add the larger messages. This ensures we never overflow.

If we are decreasing the buffer size, we deprecate messages that are larger and stop sending them in the first release. Then in the second release we decrease the buffer size and remove the messages.

All in all, this is fine as a defensive measure, but I find it largely unnecessary. IIRC the reason I did it the way I did was because without fault management at the time I figured it would be easier to spot this major issue (SP misbehavior) occurring from a crash then with an error response and retry. But I'm not sure that's actually true or valid. So I'm generally fine with just not crashing the RoT lol.

However, I don't think we should overload the same error code. This is not a FIFO overrun and it isn't really related to flow control. I'd call it something else like RequestMsgTooLarge or something similar so that it stands out and helps us debug if this ever happens in the future. It would be a good sign to never see this distinct error code occur in practice.

I just pushed a commit that adds IOError::RequestMsgTooLarge and converts it to the existing SprotProtocolError::BadMessageLength error when reporting to the SP. Do you think that's a good approach? I could add a new SprotProtocolError but that would require updating the management-gateway-service repo, and BadMessageLength seems like a pretty good fit even though it's mostly used for a slightly different case.

That seems perfect!

andrewjstone · 2026-04-23T22:42:18Z

        }

-        if total_size > RESPONSE_BUF_SIZE {
+        if total_size > rx_buf.len() {


This is a no-op and I find the new version less clear than the original. In fact I had to go read the code to ensure that they were the same.

The reason I find it harder to understand is the lack of local reasoning with the new version. I know from looking at the capitalized name that it is a constant, fixed buffer size. When looking at rx_buf.len() I have no such guarantee. If, as is the case here, we have an array, then the values must be the same. The length of an array doesn't change. However, if we had used something like ArrayVec and called .len() it would give us the number of elements in the "vector", not the capacity of the array. I couldn't recall what I used in the latest version of the code and so I had to go check and make sure.

For that reason I'd prefer not to make this change.

I take this one back! I was looking at the lpc55 code locally. It's much less clear here in the call chain. In fact we could change to take a shorter slice and the constant would be wrong. Good fix. Sorry for the confusion.

Ok, I'll revert it. My thought was that the new version improved local reasoning, because now the function was more self-contained and would work with any array passed in as rx_buf, instead of only the one array that is defined elsewhere to be RESPONSE_BUF_SIZE bytes long. But it does get more complicated if you consider types other than arrays.

Ok, I'll revert it. My thought was that the new version improved local reasoning, because now the function was more self-contained and would work with any array passed in as rx_buf, instead of only the one array that is defined elsewhere to be RESPONSE_BUF_SIZE bytes long. But it does get more complicated if you consider types other than arrays.

No no! I actually think you are correct here. I tried to alert you that I was wrong and you were right in my second comment. I was looking at the wrong file. I thought my comment only made sense (if it makes sense at all ) on the lpc55 side. On the SP side we are passing a slice and it doesn't matter what the type is. I just really haven't looked at this code in a while and got confused.

At this point totally up to you what you want to do. You can leave it reverted or add it back ;) Sorry about the confusion.

Oh lol, I didn't see the second comment. I really wish github made sure to show you all new comments in a thread before letting you reply, instead of hoping you reload the page...

andrewjstone · 2026-04-23T23:03:42Z

@evan-oxide How did you test this code? I think it's good to go, but if you are setup to test it locally that would be best. The best way I've found is to do an upgrade from faux-mgs to the gimletlet. I don't have a local setup anymore though. Everything is packed away currently. In any event I'm sure someone can help get you squared away.

andrewjstone · 2026-04-23T23:06:45Z

+    /// The number of times an SP sent more bytes than expected for one
+    /// message. In otherwords, the number of bytes sent by the SP to the RoT
+    /// between CSn assert and CSn de-assert exceeds `REQUEST_BUF_SIZE`.
+    pub rx_protocol_error_too_many_bytes: u32,


I think this is fine, but if run with mixed RoT/SP versions may cause issues. I don't think we need to use it until after update though, so is probably fine.

evan-oxide · 2026-04-24T14:26:26Z

@andrewjstone Thanks for thinking about this code again! I agree this issue was not a serious concern - I'm mostly treating it as a learning exercise in how sprot works and how to use humility.

How did you test this code?

I flashed a grapefruit's SP with this, making it send too many bytes in the rot_state request. I then used hiffy to call SpRot.rot_state and saw that the RoT sprot task's generation number increased. After flashing the RoT with the fix, calling SpRot.rot_state did not increase the generation number but did increment the rx_protocol_error_too_many_bytes count. Calls to SpRot.status and SpRot.io_stats always succeeded, before and after the fix.

Is there other testing you'd like to see?

This reverts commit 50846ad.

This reverts commit 13671c4.

Reported as SprotProtocolError::BadMessageLength

andrewjstone · 2026-04-24T17:18:25Z

@andrewjstone Thanks for thinking about this code again! I agree this issue was not a serious concern - I'm mostly treating it as a learning exercise in how sprot works and how to use humility.

Awesome. That's what I had hoped. Sorry if that comment came off as rude. I'm super happy to have you aboard and working on this. There is certainly room for improvement!

How did you test this code?

I flashed a grapefruit's SP with this, making it send too many bytes in the rot_state request. I then used hiffy to call SpRot.rot_state and saw that the RoT sprot task's generation number increased. After flashing the RoT with the fix, calling SpRot.rot_state did not increase the generation number but did increment the rx_protocol_error_too_many_bytes count. Calls to SpRot.status and SpRot.io_stats always succeeded, before and after the fix.

Is there other testing you'd like to see?

That's a very good test. In the past we discovered issues when we tried to do updates, because it's a lot of messages sent back to back. However, I can't see how that would matter for this change except for message changes that are probably unused. I think that your test is likely good enough.

One other possible easy test you can do with your current setup is flashing the RoT first and seeing what happens when the SP receives an error response that it doesn't understand. And then also trying to get the stats from the SP and see what happens. Hopefully the SP task doesn't crash. Even if it does I don't think this is actually a problem for production. We don't expect to actually hit this error at all, and I don't believe we check stats during udpates. Only if the update went wrong and we couldn't collect stats would this be a problem.

A somewhat more complicated test would be to actually try to perform updates from faux-mgs with the RoT updated first. I honestly don't recall the ordering, but trying update in both orders and see what happens would be a great test. Up to you whether you think it is worth it. We'd almost certainly catch any problems on dogfood or someone would catch it on a racklette before we ship.

I'm probably not the best person to get you setup with this type of testing right now since I haven't done it in a long time, but it shouldn't be hard to setup since you have a grapefruit. Any of a number of hubris folks can help.

evan-oxide · 2026-04-24T19:32:17Z

It didn't sound rude! I appreciate the context about SP/RoT communication, what we're likely to see happen in practice, and how to think about updates 🙂

I'll try to run the faux-mgs update test, mostly because that sounds like a useful thing to know how to do. If that turns out to be too much of a pain then I'll do the simpler test of flashing the RoT first.

evan-oxide added 2 commits April 22, 2026 17:12

Don't panic if the RoT receives too many bytes from the SP

93b5682

Refactor: don't assume length of rx_buf

50846ad

evan-oxide requested a review from lzrd April 22, 2026 21:36

hawkw reviewed Apr 22, 2026

View reviewed changes

Fix count of csn_pulses and csn_pulse_failures

6baeea1

evan-oxide force-pushed the evan/rot-spi-rx-buf-overflow branch from 904b067 to 6baeea1 Compare April 22, 2026 21:41

Wrap comment

aceb8cc

labbott requested a review from andrewjstone April 23, 2026 12:37

andrewjstone reviewed Apr 23, 2026

View reviewed changes

evan-oxide added 3 commits April 24, 2026 10:45

Revert "Refactor: don't assume length of rx_buf"

13671c4

This reverts commit 50846ad.

Reapply "Refactor: don't assume length of rx_buf"

a0efc88

This reverts commit 13671c4.

Add IOError::RequestMsgTooLarge

780cfed

Reported as SprotProtocolError::BadMessageLength

Conversation

evan-oxide commented Apr 22, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewjstone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewjstone commented Apr 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evan-oxide commented Apr 24, 2026

Uh oh!

andrewjstone commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evan-oxide commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andrewjstone commented Apr 24, 2026 •

edited

Loading