Fix CPU sets reconcile for in place vertical scaling with exclusive CPUs#11
Fix CPU sets reconcile for in place vertical scaling with exclusive CPUs#11lukaszwojciechowski wants to merge 15 commits intoesotsal:policy_staticfrom
Conversation
| err := m.updateContainerCPUSet(ctx, rcat.containerID, rcat.allocatedSet) | ||
| if err != nil { | ||
| logger.Error(err, "failed to update container", "containerID", rcat.containerID, "cpuSet", rcat.allocatedSet) | ||
| failure = append(failure, rcat.reconciledContainer) |
There was a problem hiding this comment.
Why not update ok= false in this err case, do you have other consideration?
There was a problem hiding this comment.
You are absolutely right, The line is missing. I will add it.
We cannot allow 3rd pass to be executed if failure happens in 2nd pass.
Because some of the CPUs currently used by nonExclusiveCPUContainers might not be released and they might be required by exclusiveCPUContainers
There was a problem hiding this comment.
The error relationship between phases is not absolute. If a scale down exclusive container fails in the 1st pass , a container in other pass stage may still applied the cpuset successfully. e.g.
1)1st pass :exclusive container scale down: container #0: {1, 2, 3, 4} -> {1, 2} container#1:{5, 6 } ->{5}
2) 2nd pass: non-exclusive cpuset is scale up as {7, 8}->{7, 8, 6},
3)3rd pass: exclusive scale up: Container#2 is {9}->{ 9, 3}, container#3 is {10} ->{10, 4}
If container#0 ok, container#1 fail, container 2 and container 3 can still ok, that means if there is one fail in 1st pass stage, 3rd pass container still ok.
If use ok check before each pass stage, one container update failure can affect all containers in the following pass stage, even if there is no cpuset conflict between them.
How about delete the ok check before every stage? And then change process like below:
-
1st pass : if one exclusive container scale down fail, add this container to failure list and then continue
-
2nd pass: before updateContainerCPUSet, check intersection(non-exclusive cset, Union(containers lcset in failure list))
2-1) If it is not null, append all non-exclusive containers to failure list
2-2) If it is null, there is no cpuset conflict, continue to updateContainerCPUSet for each container. updateContainerCPUSet failed container append to failure list and then continue. -
3rd pass: Loop 3rd pass stage container, if intersection(container cset, Union(container lcset in failure list)),
2-1) If it is not null, append this container to failure list and continue.
2-2) If it is null, there is no cpuset conflict, updateContainerCPUSet for each container and continue.
With this design, failed containers in the before process only affect the containers which have conflict cpuset between them. And cpuset update failed container is appended to failure list.
There was a problem hiding this comment.
@shiya0705 Thank you for your design proposal, I think it will be good to implement it.
However I think we need to think about one more scenario:
What if this is first run of reconcile loop after kubelet restart?
In such situation we don't know what is set in runtime (lastUpdateState will be an empty set for all containers). If calling updateContainerCPUSet fails for any of containers in 2nd pass, and we will add shared-CPU containers to failed list, the union of their lcsets will remain empty. Because of that intersection of container cset and this union will also be empty and we will try to continue all exclusive-CPU containers possible leading to conflicts, e.g.
- initial state:
defaultSet={0,1,2,3}, allocations={A:{4,5}},
lastUpdateState={A:{4,5}, B:{0,1,2,3}},
runtime={A:{4,5}, B:{0,1,2,3}} - scale up of A (just allocation):
defaultSet={0,1}, allocations={A:{2,3,4,5}},
lastUpdateState={A:{4,5}, B:{0,1,2,3}},
runtime={A:{4,5}, B:{0,1,2,3}} - kubelet restart
defaultSet={0,1}, allocations={A:{2,3,4,5}},
lastUpdateState={},
runtime={A:{4,5}, B:{0,1,2,3}} - reconcile of B fails, we continue with setting up A:
defaultSet={0,1}, allocations={A:{2,3,4,5}},
lastUpdateState={A:{2,3,4,5}},
runtime={A:{2,3,4,5}, B:{0,1,2,3}} <-- conflict on CPUs: 2,3
@shiya0705 How do you propose to cover this scenario?
There was a problem hiding this comment.
If lastUpdateState is empty, how about read cpuset from runtime using function func (m *kubeGenericRuntimeManager) GetContainerStatus, use this cpuset to replace lastUpdateState.
There was a problem hiding this comment.
Maybe I miss something, so please correct me if i don't see the valid possibility, but to me it looks like:
That would be a huge modification and we probably need to get community opinion first.
The communication is intended to be one direction only: kubelet -> runtime
with exception of requesting status (but limited to only several basic fields)
Currently getting CPUSet from runtime is not implemented on any layer, so there will be a lot to do to implement it through all the layers.
Also the runtime is completely separate from kubelet itself, so different runtimes might not want to implement that and even if they do that's an API change and needs to be introduced a feature.
So without reading the current runtime state, we cannot know what is set and cannot avoid collisions.
Accepting that, I think your approach @shiya0705 is fine. I will implement it.
There was a problem hiding this comment.
In func (m *kubeGenericRuntimeManager) GetContainerStatus,
resp, err := m.runtimeService.ContainerStatus(ctx, id.ID, false) read container status from runtime , cpuset has been included in resp (type of ContainerStatusResponse).
ContainerStatusResponse
-> ContainerStatus
-> ContainerResources
-> LinuxContainerResources
-> CpusetCpus string `protobuf:"
So, I think it is possible to read cpuset from runtime, but need some modification.
Another thing we need to consider is if updateContainerCPUSet return error, even though it is possible to Get cpuset from runtime, but get cpuset may also fail and lead to conflict.
If accepting that, I think this approach is fine.
There was a problem hiding this comment.
Thanks @shiya0705
I modified the algorithm as you suggested at the beginning. I also added some more test cases to verify failures during multi-pass reconciliation.
I think we cannot go with adding CPUSetCPUs to runtime response. That would change the protocol between kubernetes and runtime and is a huge change that requires separate FeatureGate and consultation with runtime implementing projects.
Please make another round of review
| continue | ||
| } | ||
| m.lastUpdateState.SetCPUSet(rca.podUID, rca.containerName, iset) | ||
| } |
There was a problem hiding this comment.
Container is not appended to success list in this loop when it updateContainerCPUSet success.
Even though it could be updated in the third loop(step 3-3), however, the prerequisite is that there is no container fail in the previous loop.
There was a problem hiding this comment.
Yes, that's intentional, because it is not fully set yet. The exclusiveCPUContainers will get to success or failure list in the 3rd pass.
However if you open the discussion about it, maybe we can chat about it:
- So first of all please notice that the
successandfailurelists are used only in test - Second, there is no definition what success or failure means.
(before my patch, it was easier, because every container ended in one of the categories)
So here are possible solution to how we can handle that:
Option 1
Define success as a successful application of final state to the container.
So containers can be on success list if application of CPU sets is complete, on failure list if an error happened or not present on any of lists if it was not processed at all or just partially.
Results:
- if there is an error in 1st pass - only failed containers from
exclusiveCPUContainerswill be reported onfailurelist;nonExclusiveCPUContainerswon't be reported. - if there is an error in 2nd pass (first pass no errors) all
nonExclusiveCPUContainerswill be returned assuccessorfailure; none ofexclusiveCPUContainerswill be reported. - if we reach 3rd pass (no errors in passes 1 and 2) all containers will be reported in one of two categories.
Option 2
Make every container to appear on one of the returned lists. That would require adding 2 more lists to be returned: unprocessed, partial.
Results:
- if there is an error in 1st pass - failed containers from
exclusiveCPUContainerswill be reported onfailurelist; partially set up containers fromexclusiveCPUContainerswill be reported onpartiallist;nonExclusiveCPUContainerswill be reported onunprocessedlist. - if there is an error in 2nd pass (first pass no errors) all
nonExclusiveCPUContainerswill be returned assuccessorfailure; all ofexclusiveCPUContainerswill be reported aspartial. - if we reach 3rd pass (no errors in passes 1 and 2) all containers will be reported in one of two categories:
success,failure; thepartialandunprocessedlists will be empty.
@shiya0705 , @esotsal please let me know your opinion.
After we decide I will apply changes to the code (or not) and add tests to cover the scenarios.
My opinion is to leave this code as is (so choosing Option 1), but add unit tests to cover all execution paths.
There was a problem hiding this comment.
Both of these two options are based on if there is one container failed in the previous stage, all of the containers in the following stage will stop cpuset update. It may not be very friendly.
Option 1 only reported updateContainerCPUSet failed container,unprocessed and partial container cpuset not been updated but been ignored, the option 2 seems to be covered comprehensively.
I added a new design option to comment https://github.com/esotsal/kubernetes/pull/11#discussion_r2870674556,and let's check if it helps.
7a6a74b to
ff01a41
Compare
9acd6cf to
8071ebc
Compare
ff01a41 to
de84720
Compare
8071ebc to
1a734ac
Compare
de84720 to
6245285
Compare
1a734ac to
5d3ebe8
Compare
a1182a8 to
7321dd1
Compare
5d3ebe8 to
ed9dfc4
Compare
5bd954d to
3452089
Compare
ed9dfc4 to
31b5189
Compare
7bbfcdd to
3e66885
Compare
31b5189 to
d793f64
Compare
044d951 to
f52feef
Compare
cc7b71a to
4b31238
Compare
| // Determine the CPU set to use based on the pass | ||
| var targetCPUSet cpuset.CPUSet | ||
| if preliminary { | ||
| targetCPUSet = rca.allocatedSet.Intersection(lcset) |
There was a problem hiding this comment.
preliminary as true only in case remove CPUs from containers using exclusive CPUs, in this case
targetCPUSet = rca.allocatedSet.Intersection(lcset) and targetCPUSet = rca.allocatedSet are same. No need to add a preliminary condition check here.
There was a problem hiding this comment.
We need to think also about a case when the set of CPUs is not only scaled down, but changed.
Let's consider the following scenario for a container:
- initial scenario state:
allocated: {original: {1}, resized: {1, 2, 3} }, lastUpdateState: {1, 2, 3}} - scale down 3--> 1 :
allocated: {original: {1}, resized: {1} }, lastUpdateState: {1, 2, 3}} - meanwhile some other container is resized and got allocated CPUs 2 and 3.
- scale up 1 --> 2 :
allocated: {original: {1}, resized: {1, 4} }, lastUpdateState: {1, 2, 3}} - only now is reconcileState launched
The targetCPUSet for the first pass must be {1} - the intersection of allocated and lastUpdateState.
We don't want to use {1, 4} as CPU 4 might be still used by some other containers.
There was a problem hiding this comment.
I understand your design idea, It seems reasonable.
| } | ||
|
|
||
| // Check if update is needed | ||
| if !targetCPUSet.Equals(lcset) { |
There was a problem hiding this comment.
In the first pass updateContainers(exclusiveCPUContainers, true), all the containers in the exclusiveCPUContainers which has been resized will upate cpuset, not only the CPUs removed container.
There was a problem hiding this comment.
Yes, they update lastUpdateState each time changes to runtime are made.
However, the change in the first pass (updateContainers(exclusiveCPUContainers, true)) might not be final - only some CPUs were removed, so in the 3rd pass we still need to check if another update is required.
Look at the scenario from above comment.
reconcileState is run with following state: allocated: {original: {1}, resized: {1, 4} }, lastUpdateState: {1, 2, 3}}
So
- in the 1st pass there will be update to {1} - only removal of CPUs: 2 and 3
- in the 3rd pass there will be update to {1, 4} - final set after we have the guarantee that CPU 4 is no longer used by any other container.
aa5c826 to
ecd92e6
Compare
|
Since we missed the deadline of v1.36 and aiming now for v1.37, i think it makes sense to merge this to the main PR and continue the discussion there. Please @lukaszwojciechowski rebase to merge your PR. |
bc6bdbf to
d5ec56f
Compare
50e451c to
c458676
Compare
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR modifies the algorithm for CPU Set application to containers runtime in CPU manager's reconciliation loop.
It allows to solve conflicts of CPUs usage that might appear in scenarios of exclusive CPU resources resizing without restart with Guaranteed QoS Class, static CPU policy and InPlacePodVerticalScaling enabled.
For details about conflicting scenarios see Which issue(s) this PR fixes section below.
PR structure
PR consists of 5 commits:
-- verification if lastUpdateState contains expected values
-- enabling verification of how the reconcile process completed for multiple containers, not just a single one
-- extending test cases to cover the above changes and add a simple multi-container test
-- extend mockRuntimeService in the CPU manager tests to store CPUSets that were applied to the runtime
-- verify if exclusive CPUs are assigned to a single container only
-- verified if the runtime state matches expectations
-- add test cases for verification of CPUSets reconcilation for containers using exclusive CPUs
-- modification of the algorithm (NOTE: test cases pass again)
-- adds testcases covering UpdateContainerResources failure at different passes
Algorithm:
exclusiveCPUContainersandnonExclusiveCPUContainers. Usage of the lock guarantees consistent state.Step 3.1. remove scaled down exclusive CPUs from containers:
Step 3.2. apply CPU Sets for all non-exclusive containers:
Step 3.3. set final CPU Sets for containers using exclusive CPUs
Which issue(s) this PR fixes:
Resize of integer CPU resources of pods with Guaranteed QoS Class without restarting containers when CPU policy is static and InPlacePodVerticalScaling enabled requires handling of more complex scenarios when applying the allocated CPUs to containers' runtime.
The algorithm used in reconcileState in CPU manager has few drawbacks that may lead to temporary conflicts of CPU usage (exclusive CPUs used by multiple containers):
Consistent state application
The reconcileState function does not apply CPU sets of all containers as a consistent state. The loop for all pods and containers applies CPU sets one by one without any critical section. During iteration over loop, the allocated CPU sets and the default set can be changed by parallel resize. Such operation executes Allocate in another go routine changing allocations, e.g.
Since this moment CPUs that were allocated in step 2. are now applied to runtime of both containers A and B.
The situation will fix by itself next time a reconcileState is executed (after 10s).
Temporary conflicts when moving CPUs between containers
Consider following scenario:
[1, 2]; container B uses[3, 4][3][1, 2, 4](so container's A runtime uses
[1, 2, 4], while container B still uses[3, 4])Between steps 4. and 5. CPU 4 is applied in both containers. Situation is temporary and usually very short, but when it is coincided with kubelet restart or some delay, it might affect operation of containers.
Special notes for your reviewer:
Only patch "rework CPUSet reconciliation algorithm" contains changes to the code.
Other patches extend tests and add test cases covering 100% of changes made and scenarios that new algorithm fixes.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: