Skip to content

Block multiple sled reservations with the same gen#10479

Open
jmpesp wants to merge 1 commit into
oxidecomputer:mainfrom
jmpesp:instance_state_generation_in_sled_reservation
Open

Block multiple sled reservations with the same gen#10479
jmpesp wants to merge 1 commit into
oxidecomputer:mainfrom
jmpesp:instance_state_generation_in_sled_reservation

Conversation

@jmpesp
Copy link
Copy Markdown
Contributor

@jmpesp jmpesp commented May 21, 2026

If multiple instance-start sagas are concurrently attempting to allocate for the same instance, this temporarily results in multiple rows in sled_resource_vmm with different propolis ids for the same instance id. One of the instance-start sagas will succeed, where the other(s) will unwind (due to an "instance changed state before it could be started" error from sis_move_to_starting), and remove the sled_resource_vmm record that they added by matching on that saga's propolis id.

There's never been a uniqueness constraint for instance id in the sled_resource_vmm table, because there can't be, otherwise we'd never be able to migrate an instance (which makes a new record on a different sled for the same instance).

For an instance start that performs any new local storage allocation, this is a problem: the latent assumption in inserting / updating local storage related records is that this type of duplication could not occur, that if the insert succeeded then it means the allocation will only be performed once. Because this is not true the CTE will happily stomp all over the local storage allocation related records and that leads to the orphaning seen in the linked issue.

The fix is to add a uniqueness constraint to sled_resource_vmm that ensures only one record for a given instance id plus the instance state generation number exists. This will not affect migration because the instance state generation is bumped in that case.

This commit also changes the local storage related unit tests to clearly specify the ncpus and memory for the fake instances, as inspecting the sled_resource_vmm records produced by the test showed the resources didn't match the instance specification.

Fixes oxidecomputer/customer-support#1184.

If multiple instance-start sagas are concurrently attempting to allocate
for the same instance, this temporarily results in multiple rows in
`sled_resource_vmm` with different propolis ids for the same instance
id. One of the instance-start sagas will succeed, where the other(s)
will unwind (due to an "instance changed state before it could be
started" error from `sis_move_to_starting`), and remove the
`sled_resource_vmm` record that they added by matching on that saga's
propolis id.

There's never been a uniqueness constraint for instance id in the
`sled_resource_vmm` table, because there can't be, otherwise we'd never
be able to migrate an instance (which makes a new record on a different
sled for the same instance).

For an instance start that performs any new local storage allocation,
this is a problem: the latent assumption in inserting / updating local
storage related records is that this type of duplication could not
occur, that if the insert succeeded then it means the allocation will
only be performed once. Because this is not true the CTE will happily
stomp all over the local storage allocation related records and that
leads to the orphaning seen in the linked issue.

The fix is to add a uniqueness constraint to `sled_resource_vmm` that
ensures only one record for a given instance id plus the instance state
generation number exists. This will not affect migration because the
instance state generation is bumped in that case.

This commit also changes the local storage related unit tests to clearly
specify the ncpus and memory for the fake instances, as inspecting the
`sled_resource_vmm` records produced by the test showed the resources
didn't match the instance specification.

Fixes oxidecomputer/customer-support#1184.
@jmpesp jmpesp requested a review from hawkw May 21, 2026 16:56
@hawkw hawkw requested a review from smklein May 22, 2026 17:11
}

#[derive(Clone, Debug)]
pub enum SledResourceVmmInstanceStateGeneration {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super annoying obnoxious nitpick: man this is a long name...I suppose including the SledResourceVmm prefix is necessary because this type is currently re-exported by a pub use sled_resource_vmm::*; so we can't just expect callers to refer to it as sled_resource_vmm::InstanceStateGeneration. which...I dunno if it's worth trying to fix that. I guess this is fine, it just makes me feel a certain type of way!

.sled_reservation_create(
&opctx,
instance_id,
nexus_db_model::Generation::new(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, may not be important: I wonder if rather than always inserting the reservation at generation 1, we ought to change this function to take an &Instance rather than an InstanceUuid, and use the generation from the instance record. Clearly, there aren't currently any tests which are calling this helper multiple times for the same InstanceUuid, or else they would have already broken, but it seems like it could be potentially annoying if someone were to start adding a new test that does so and was surprised to discover this doesn't actually use the generation from the instance record.

on the other hand, maybe updating all the test code that uses this to pass an &Instance is going to be too painful.

Comment on lines 864 to +865
instance_id: InstanceUuid,
instance_state_generation: db::model::Generation,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, so, as written, this allows the caller to attempt to create the reservation at any instance state generation, regardless of what they believe the instance's current generation to be (and, in fact, it seems like we currently have a bunch of tests which are always providing generation 1 no matter what). I wonder if it might be a bit more misuse-resistant to change this function to instead take an &Instance, and always use the state generation from the instance model. That way, we're sure that these come from the same snapshot of the instance state.

Of course, this will require changing the callsites, and it may not be worth the effort, but I felt like it was worth mentioning...

Comment on lines +1233 to +1237
Err(diesel::result::Error::DatabaseError(
diesel::result::DatabaseErrorKind::UniqueViolation,
error_info,
)) if error_info.constraint_name()
== Some(SINGLE_RESERVATION_CONSTRAINT) =>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, take it or leave it: i think it might be nice if this logic was stuffed into a function, so you could just say

Suggested change
Err(diesel::result::Error::DatabaseError(
diesel::result::DatabaseErrorKind::UniqueViolation,
error_info,
)) if error_info.constraint_name()
== Some(SINGLE_RESERVATION_CONSTRAINT) =>
Err(e) if is_single_reservation_constraint_violation(e) =>

or something, here and later on?

group membership."
)]
RequiredAffinitySledNotValid,
#[error("Instance reservation already made for generation {generation}")]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rest of the errors here are intended to be user facing (which is why they're kinda big blocks of multi-sentence text). I don't think "Instance reservation already made for generation 69" is something that's particularly meaningful to a user. If this is going to bubble up, could it say something a little less obscure? Maybe "This instance is already running or starting on another sled."?

// Finally, perform the INSERT if it's still valid.
query.sql("
INSERT INTO sled_resource_vmm (id, sled_id, hardware_threads, rss_ram, reservoir_ram, instance_id)
INSERT INTO sled_resource_vmm (id, sled_id, hardware_threads, rss_ram, reservoir_ram, instance_id, instance_state_generation)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we line wrap this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants