Skip to content

Fix CPU sets reconcile for in place vertical scaling with exclusive CPUs#11

Open
lukaszwojciechowski wants to merge 15 commits intoesotsal:policy_staticfrom
lukaszwojciechowski:fix-cpuset-reconcile
Open

Fix CPU sets reconcile for in place vertical scaling with exclusive CPUs#11
lukaszwojciechowski wants to merge 15 commits intoesotsal:policy_staticfrom
lukaszwojciechowski:fix-cpuset-reconcile

Conversation

@lukaszwojciechowski
Copy link
Copy Markdown

@lukaszwojciechowski lukaszwojciechowski commented Feb 11, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR modifies the algorithm for CPU Set application to containers runtime in CPU manager's reconciliation loop.
It allows to solve conflicts of CPUs usage that might appear in scenarios of exclusive CPU resources resizing without restart with Guaranteed QoS Class, static CPU policy and InPlacePodVerticalScaling enabled.

For details about conflicting scenarios see Which issue(s) this PR fixes section below.

PR structure

PR consists of 5 commits:

  • enhance TestReconcileState:
    -- verification if lastUpdateState contains expected values
    -- enabling verification of how the reconcile process completed for multiple containers, not just a single one
    -- extending test cases to cover the above changes and add a simple multi-container test
  • verify CPUSets set in mock runtime:
    -- extend mockRuntimeService in the CPU manager tests to store CPUSets that were applied to the runtime
    -- verify if exclusive CPUs are assigned to a single container only
    -- verified if the runtime state matches expectations
  • add resize testcases for reconcileState: (NOTE: added testcases fail because of issues in algorithm)
    -- add test cases for verification of CPUSets reconcilation for containers using exclusive CPUs
  • rework CPUSet reconciliation algorithm:
    -- modification of the algorithm (NOTE: test cases pass again)
  • add test cases to cover multi-pass reconcile failures
    -- adds testcases covering UpdateContainerResources failure at different passes
Algorithm:
  1. Modifies the loop iterating over all containers in all pods to act in the critical section controlled by CPU manager's lock (same that is used during Allocate).
  2. During the iteration CPU Sets are not yet applied but only collected to local variables:
    exclusiveCPUContainers and nonExclusiveCPUContainers. Usage of the lock guarantees consistent state.
  3. After collection and outside critical section CPU Sets are applied to runtime in three steps:

Step 3.1. remove scaled down exclusive CPUs from containers:

  • as containers using exclusive CPUs cannot be scaled down to empty CPU Set, because they need to keep Original CPU Set, it is safe operation that won't try to set an empty set in runtime
  • after this step, all CPUs that belong now to default CPU Set are no longer used exclusively by any container

Step 3.2. apply CPU Sets for all non-exclusive containers:

  • these containers uses default CPUSet
  • if default CPUSet shrank since last reconcileState call due to allocation of the CPUs as exclusive, the CPUs removed from it are no longer used by non-exclusive containers after completing this step

Step 3.3. set final CPU Sets for containers using exclusive CPUs

Which issue(s) this PR fixes:

Resize of integer CPU resources of pods with Guaranteed QoS Class without restarting containers when CPU policy is static and InPlacePodVerticalScaling enabled requires handling of more complex scenarios when applying the allocated CPUs to containers' runtime.

The algorithm used in reconcileState in CPU manager has few drawbacks that may lead to temporary conflicts of CPU usage (exclusive CPUs used by multiple containers):

Consistent state application

The reconcileState function does not apply CPU sets of all containers as a consistent state. The loop for all pods and containers applies CPU sets one by one without any critical section. During iteration over loop, the allocated CPU sets and the default set can be changed by parallel resize. Such operation executes Allocate in another go routine changing allocations, e.g.

  1. reconcileState applies default CPU set to container's A runtime
  2. (meanwhile) container B is being resized. Allocate removes some CPUs from default CPU set and use them as additional exclusive CPUs for container B
  3. reconcileState loop continues and applies new CPU Set to container's B runtime

Since this moment CPUs that were allocated in step 2. are now applied to runtime of both containers A and B.
The situation will fix by itself next time a reconcileState is executed (after 10s).

Temporary conflicts when moving CPUs between containers

Consider following scenario:

  1. container A uses CPUs [1, 2]; container B uses [3, 4]
  2. container B is resized down to original size, so its allocation changes to [3]
  3. container A is resized up and gets additional CPU 4, so its allocation is now [1, 2, 4]
  4. reconcileState applies new CPU set to container's A runtime
    (so container's A runtime uses [1, 2, 4], while container B still uses [3, 4])
  5. reconcileState applies new CPU set to container's B runtime

Between steps 4. and 5. CPU 4 is applied in both containers. Situation is temporary and usually very short, but when it is coincided with kubelet restart or some delay, it might affect operation of containers.

Special notes for your reviewer:

Only patch "rework CPUSet reconciliation algorithm" contains changes to the code.
Other patches extend tests and add test cases covering 100% of changes made and scenarios that new algorithm fixes.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


err := m.updateContainerCPUSet(ctx, rcat.containerID, rcat.allocatedSet)
if err != nil {
logger.Error(err, "failed to update container", "containerID", rcat.containerID, "cpuSet", rcat.allocatedSet)
failure = append(failure, rcat.reconciledContainer)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not update ok= false in this err case, do you have other consideration?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right, The line is missing. I will add it.

We cannot allow 3rd pass to be executed if failure happens in 2nd pass.
Because some of the CPUs currently used by nonExclusiveCPUContainers might not be released and they might be required by exclusiveCPUContainers

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error relationship between phases is not absolute. If a scale down exclusive container fails in the 1st pass , a container in other pass stage may still applied the cpuset successfully. e.g.
1)1st pass :exclusive container scale down: container #0: {1, 2, 3, 4} -> {1, 2} container#1:{5, 6 } ->{5}
2) 2nd pass: non-exclusive cpuset is scale up as {7, 8}->{7, 8, 6},
3)3rd pass: exclusive scale up: Container#2 is {9}->{ 9, 3}, container#3 is {10} ->{10, 4}

If container#0 ok, container#1 fail, container 2 and container 3 can still ok, that means if there is one fail in 1st pass stage, 3rd pass container still ok.

If use ok check before each pass stage, one container update failure can affect all containers in the following pass stage, even if there is no cpuset conflict between them.

How about delete the ok check before every stage? And then change process like below:

  1. 1st pass : if one exclusive container scale down fail, add this container to failure list and then continue

  2. 2nd pass: before updateContainerCPUSet, check intersection(non-exclusive cset, Union(containers lcset in failure list))
    2-1) If it is not null, append all non-exclusive containers to failure list
    2-2) If it is null, there is no cpuset conflict, continue to updateContainerCPUSet for each container. updateContainerCPUSet failed container append to failure list and then continue.

  3. 3rd pass: Loop 3rd pass stage container, if intersection(container cset, Union(container lcset in failure list)),
    2-1) If it is not null, append this container to failure list and continue.
    2-2) If it is null, there is no cpuset conflict, updateContainerCPUSet for each container and continue.

With this design, failed containers in the before process only affect the containers which have conflict cpuset between them. And cpuset update failed container is appended to failure list.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiya0705 Thank you for your design proposal, I think it will be good to implement it.

However I think we need to think about one more scenario:
What if this is first run of reconcile loop after kubelet restart?
In such situation we don't know what is set in runtime (lastUpdateState will be an empty set for all containers). If calling updateContainerCPUSet fails for any of containers in 2nd pass, and we will add shared-CPU containers to failed list, the union of their lcsets will remain empty. Because of that intersection of container cset and this union will also be empty and we will try to continue all exclusive-CPU containers possible leading to conflicts, e.g.

  1. initial state:
    defaultSet={0,1,2,3}, allocations={A:{4,5}},
    lastUpdateState={A:{4,5}, B:{0,1,2,3}},
    runtime={A:{4,5}, B:{0,1,2,3}}
  2. scale up of A (just allocation):
    defaultSet={0,1}, allocations={A:{2,3,4,5}},
    lastUpdateState={A:{4,5}, B:{0,1,2,3}},
    runtime={A:{4,5}, B:{0,1,2,3}}
  3. kubelet restart
    defaultSet={0,1}, allocations={A:{2,3,4,5}},
    lastUpdateState={},
    runtime={A:{4,5}, B:{0,1,2,3}}
  4. reconcile of B fails, we continue with setting up A:
    defaultSet={0,1}, allocations={A:{2,3,4,5}},
    lastUpdateState={A:{2,3,4,5}},
    runtime={A:{2,3,4,5}, B:{0,1,2,3}} <-- conflict on CPUs: 2,3

@shiya0705 How do you propose to cover this scenario?

Copy link
Copy Markdown

@shiya0705 shiya0705 Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If lastUpdateState is empty, how about read cpuset from runtime using function func (m *kubeGenericRuntimeManager) GetContainerStatus, use this cpuset to replace lastUpdateState.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I miss something, so please correct me if i don't see the valid possibility, but to me it looks like:

That would be a huge modification and we probably need to get community opinion first.
The communication is intended to be one direction only: kubelet -> runtime
with exception of requesting status (but limited to only several basic fields)

Currently getting CPUSet from runtime is not implemented on any layer, so there will be a lot to do to implement it through all the layers.
Also the runtime is completely separate from kubelet itself, so different runtimes might not want to implement that and even if they do that's an API change and needs to be introduced a feature.


So without reading the current runtime state, we cannot know what is set and cannot avoid collisions.
Accepting that, I think your approach @shiya0705 is fine. I will implement it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In func (m *kubeGenericRuntimeManager) GetContainerStatus,
resp, err := m.runtimeService.ContainerStatus(ctx, id.ID, false) read container status from runtime , cpuset has been included in resp (type of ContainerStatusResponse).
ContainerStatusResponse
-> ContainerStatus
-> ContainerResources
-> LinuxContainerResources
-> CpusetCpus string `protobuf:"
So, I think it is possible to read cpuset from runtime, but need some modification.

Another thing we need to consider is if updateContainerCPUSet return error, even though it is possible to Get cpuset from runtime, but get cpuset may also fail and lead to conflict.

If accepting that, I think this approach is fine.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shiya0705

I modified the algorithm as you suggested at the beginning. I also added some more test cases to verify failures during multi-pass reconciliation.

I think we cannot go with adding CPUSetCPUs to runtime response. That would change the protocol between kubernetes and runtime and is a huge change that requires separate FeatureGate and consultation with runtime implementing projects.

Please make another round of review

continue
}
m.lastUpdateState.SetCPUSet(rca.podUID, rca.containerName, iset)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container is not appended to success list in this loop when it updateContainerCPUSet success.
Even though it could be updated in the third loop(step 3-3), however, the prerequisite is that there is no container fail in the previous loop.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's intentional, because it is not fully set yet. The exclusiveCPUContainers will get to success or failure list in the 3rd pass.

However if you open the discussion about it, maybe we can chat about it:

  • So first of all please notice that the success and failure lists are used only in test
  • Second, there is no definition what success or failure means.
    (before my patch, it was easier, because every container ended in one of the categories)

So here are possible solution to how we can handle that:

Option 1
Define success as a successful application of final state to the container.
So containers can be on success list if application of CPU sets is complete, on failure list if an error happened or not present on any of lists if it was not processed at all or just partially.

Results:

  • if there is an error in 1st pass - only failed containers from exclusiveCPUContainers will be reported on failure list; nonExclusiveCPUContainers won't be reported.
  • if there is an error in 2nd pass (first pass no errors) all nonExclusiveCPUContainers will be returned as success or failure; none of exclusiveCPUContainers will be reported.
  • if we reach 3rd pass (no errors in passes 1 and 2) all containers will be reported in one of two categories.

Option 2
Make every container to appear on one of the returned lists. That would require adding 2 more lists to be returned: unprocessed, partial.

Results:

  • if there is an error in 1st pass - failed containers from exclusiveCPUContainers will be reported on failure list; partially set up containers from exclusiveCPUContainers will be reported on partial list; nonExclusiveCPUContainers will be reported on unprocessed list.
  • if there is an error in 2nd pass (first pass no errors) all nonExclusiveCPUContainers will be returned as success or failure; all of exclusiveCPUContainers will be reported as partial.
  • if we reach 3rd pass (no errors in passes 1 and 2) all containers will be reported in one of two categories: success, failure; the partial and unprocessed lists will be empty.

@shiya0705 , @esotsal please let me know your opinion.
After we decide I will apply changes to the code (or not) and add tests to cover the scenarios.

My opinion is to leave this code as is (so choosing Option 1), but add unit tests to cover all execution paths.

Copy link
Copy Markdown

@shiya0705 shiya0705 Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these two options are based on if there is one container failed in the previous stage, all of the containers in the following stage will stop cpuset update. It may not be very friendly.
Option 1 only reported updateContainerCPUSet failed container,unprocessed and partial container cpuset not been updated but been ignored, the option 2 seems to be covered comprehensively.

I added a new design option to comment https://github.com/esotsal/kubernetes/pull/11#discussion_r2870674556,and let's check if it helps.

@esotsal esotsal force-pushed the policy_static branch 2 times, most recently from 7a6a74b to ff01a41 Compare February 16, 2026 14:09
@esotsal esotsal force-pushed the policy_static branch 2 times, most recently from a1182a8 to 7321dd1 Compare February 23, 2026 08:11
@esotsal esotsal force-pushed the policy_static branch 4 times, most recently from 5bd954d to 3452089 Compare February 27, 2026 12:49
@esotsal esotsal force-pushed the policy_static branch 2 times, most recently from 7bbfcdd to 3e66885 Compare March 3, 2026 13:00
@esotsal esotsal force-pushed the policy_static branch 8 times, most recently from 044d951 to f52feef Compare March 6, 2026 10:56
// Determine the CPU set to use based on the pass
var targetCPUSet cpuset.CPUSet
if preliminary {
targetCPUSet = rca.allocatedSet.Intersection(lcset)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preliminary as true only in case remove CPUs from containers using exclusive CPUs, in this case
targetCPUSet = rca.allocatedSet.Intersection(lcset) and targetCPUSet = rca.allocatedSet are same. No need to add a preliminary condition check here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think also about a case when the set of CPUs is not only scaled down, but changed.

Let's consider the following scenario for a container:

  1. initial scenario state: allocated: {original: {1}, resized: {1, 2, 3} }, lastUpdateState: {1, 2, 3}}
  2. scale down 3--> 1 : allocated: {original: {1}, resized: {1} }, lastUpdateState: {1, 2, 3}}
  3. meanwhile some other container is resized and got allocated CPUs 2 and 3.
  4. scale up 1 --> 2 : allocated: {original: {1}, resized: {1, 4} }, lastUpdateState: {1, 2, 3}}
  5. only now is reconcileState launched
    The targetCPUSet for the first pass must be {1} - the intersection of allocated and lastUpdateState.
    We don't want to use {1, 4} as CPU 4 might be still used by some other containers.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your design idea, It seems reasonable.

}

// Check if update is needed
if !targetCPUSet.Equals(lcset) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first pass updateContainers(exclusiveCPUContainers, true), all the containers in the exclusiveCPUContainers which has been resized will upate cpuset, not only the CPUs removed container.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they update lastUpdateState each time changes to runtime are made.
However, the change in the first pass (updateContainers(exclusiveCPUContainers, true)) might not be final - only some CPUs were removed, so in the 3rd pass we still need to check if another update is required.

Look at the scenario from above comment.
reconcileState is run with following state: allocated: {original: {1}, resized: {1, 4} }, lastUpdateState: {1, 2, 3}}
So

  • in the 1st pass there will be update to {1} - only removal of CPUs: 2 and 3
  • in the 3rd pass there will be update to {1, 4} - final set after we have the guarantee that CPU 4 is no longer used by any other container.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right~

@esotsal esotsal force-pushed the policy_static branch 6 times, most recently from aa5c826 to ecd92e6 Compare March 14, 2026 07:31
@esotsal
Copy link
Copy Markdown
Owner

esotsal commented Mar 18, 2026

Since we missed the deadline of v1.36 and aiming now for v1.37, i think it makes sense to merge this to the main PR and continue the discussion there. Please @lukaszwojciechowski rebase to merge your PR.

@esotsal esotsal force-pushed the policy_static branch 16 times, most recently from bc6bdbf to d5ec56f Compare April 11, 2026 12:31
@esotsal esotsal force-pushed the policy_static branch 4 times, most recently from 50e451c to c458676 Compare April 13, 2026 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants