Block multiple sled reservations with the same gen#10479
Conversation
If multiple instance-start sagas are concurrently attempting to allocate for the same instance, this temporarily results in multiple rows in `sled_resource_vmm` with different propolis ids for the same instance id. One of the instance-start sagas will succeed, where the other(s) will unwind (due to an "instance changed state before it could be started" error from `sis_move_to_starting`), and remove the `sled_resource_vmm` record that they added by matching on that saga's propolis id. There's never been a uniqueness constraint for instance id in the `sled_resource_vmm` table, because there can't be, otherwise we'd never be able to migrate an instance (which makes a new record on a different sled for the same instance). For an instance start that performs any new local storage allocation, this is a problem: the latent assumption in inserting / updating local storage related records is that this type of duplication could not occur, that if the insert succeeded then it means the allocation will only be performed once. Because this is not true the CTE will happily stomp all over the local storage allocation related records and that leads to the orphaning seen in the linked issue. The fix is to add a uniqueness constraint to `sled_resource_vmm` that ensures only one record for a given instance id plus the instance state generation number exists. This will not affect migration because the instance state generation is bumped in that case. This commit also changes the local storage related unit tests to clearly specify the ncpus and memory for the fake instances, as inspecting the `sled_resource_vmm` records produced by the test showed the resources didn't match the instance specification. Fixes oxidecomputer/customer-support#1184.
| } | ||
|
|
||
| #[derive(Clone, Debug)] | ||
| pub enum SledResourceVmmInstanceStateGeneration { |
There was a problem hiding this comment.
super annoying obnoxious nitpick: man this is a long name...I suppose including the SledResourceVmm prefix is necessary because this type is currently re-exported by a pub use sled_resource_vmm::*; so we can't just expect callers to refer to it as sled_resource_vmm::InstanceStateGeneration. which...I dunno if it's worth trying to fix that. I guess this is fine, it just makes me feel a certain type of way!
| .sled_reservation_create( | ||
| &opctx, | ||
| instance_id, | ||
| nexus_db_model::Generation::new(), |
There was a problem hiding this comment.
nit, may not be important: I wonder if rather than always inserting the reservation at generation 1, we ought to change this function to take an &Instance rather than an InstanceUuid, and use the generation from the instance record. Clearly, there aren't currently any tests which are calling this helper multiple times for the same InstanceUuid, or else they would have already broken, but it seems like it could be potentially annoying if someone were to start adding a new test that does so and was surprised to discover this doesn't actually use the generation from the instance record.
on the other hand, maybe updating all the test code that uses this to pass an &Instance is going to be too painful.
| instance_id: InstanceUuid, | ||
| instance_state_generation: db::model::Generation, |
There was a problem hiding this comment.
hmm, so, as written, this allows the caller to attempt to create the reservation at any instance state generation, regardless of what they believe the instance's current generation to be (and, in fact, it seems like we currently have a bunch of tests which are always providing generation 1 no matter what). I wonder if it might be a bit more misuse-resistant to change this function to instead take an &Instance, and always use the state generation from the instance model. That way, we're sure that these come from the same snapshot of the instance state.
Of course, this will require changing the callsites, and it may not be worth the effort, but I felt like it was worth mentioning...
| Err(diesel::result::Error::DatabaseError( | ||
| diesel::result::DatabaseErrorKind::UniqueViolation, | ||
| error_info, | ||
| )) if error_info.constraint_name() | ||
| == Some(SINGLE_RESERVATION_CONSTRAINT) => |
There was a problem hiding this comment.
nit, take it or leave it: i think it might be nice if this logic was stuffed into a function, so you could just say
| Err(diesel::result::Error::DatabaseError( | |
| diesel::result::DatabaseErrorKind::UniqueViolation, | |
| error_info, | |
| )) if error_info.constraint_name() | |
| == Some(SINGLE_RESERVATION_CONSTRAINT) => | |
| Err(e) if is_single_reservation_constraint_violation(e) => |
or something, here and later on?
| group membership." | ||
| )] | ||
| RequiredAffinitySledNotValid, | ||
| #[error("Instance reservation already made for generation {generation}")] |
There was a problem hiding this comment.
the rest of the errors here are intended to be user facing (which is why they're kinda big blocks of multi-sentence text). I don't think "Instance reservation already made for generation 69" is something that's particularly meaningful to a user. If this is going to bubble up, could it say something a little less obscure? Maybe "This instance is already running or starting on another sled."?
| // Finally, perform the INSERT if it's still valid. | ||
| query.sql(" | ||
| INSERT INTO sled_resource_vmm (id, sled_id, hardware_threads, rss_ram, reservoir_ram, instance_id) | ||
| INSERT INTO sled_resource_vmm (id, sled_id, hardware_threads, rss_ram, reservoir_ram, instance_id, instance_state_generation) |
If multiple instance-start sagas are concurrently attempting to allocate for the same instance, this temporarily results in multiple rows in
sled_resource_vmmwith different propolis ids for the same instance id. One of the instance-start sagas will succeed, where the other(s) will unwind (due to an "instance changed state before it could be started" error fromsis_move_to_starting), and remove thesled_resource_vmmrecord that they added by matching on that saga's propolis id.There's never been a uniqueness constraint for instance id in the
sled_resource_vmmtable, because there can't be, otherwise we'd never be able to migrate an instance (which makes a new record on a different sled for the same instance).For an instance start that performs any new local storage allocation, this is a problem: the latent assumption in inserting / updating local storage related records is that this type of duplication could not occur, that if the insert succeeded then it means the allocation will only be performed once. Because this is not true the CTE will happily stomp all over the local storage allocation related records and that leads to the orphaning seen in the linked issue.
The fix is to add a uniqueness constraint to
sled_resource_vmmthat ensures only one record for a given instance id plus the instance state generation number exists. This will not affect migration because the instance state generation is bumped in that case.This commit also changes the local storage related unit tests to clearly specify the ncpus and memory for the fake instances, as inspecting the
sled_resource_vmmrecords produced by the test showed the resources didn't match the instance specification.Fixes oxidecomputer/customer-support#1184.