elastic_kubernetes_service: add management plane operations#6748
elastic_kubernetes_service: add management plane operations#6748ashishsuneja wants to merge 2 commits into
Conversation
- _DiscoverSubnets: discovers cluster subnet IDs (cached) - _DiscoverSubnetsPerAZ: maps AZ to subnet ID (cached) - _DiscoverNodeRoleArn: discovers IAM node role ARN (cached) - _ResolveReleaseVersion: resolves EKS release version for minor - CreateNodePoolAsync: creates nodegroup, returns ng_active handle - DeleteNodePoolAsync: deletes nodegroup, returns ng_gone handle - UpgradeNodePoolAsync: upgrades nodegroup, returns ng_active handle - UpdateClusterAsync: toggles cluster label, returns cluster handle - WaitForOperation: polls nodegroup/cluster until terminal state - ResolveNodePoolVersions: auto-detects initial/target versions - Capacity reservation support flag-gated behind --eks_reserve_capacity_per_az (default: false) - logging.warning -> logging.info/raise per hubatish feedback Tested: - 43/44 tests passing (1 pre-existing failure) - EKS end-to-end: 99 pools, 100% success all 7 scenarios - pyink + lint-diffs clean
53a5035 to
958562d
Compare
| else: | ||
| if nodepool_name not in self.nodepools: | ||
| logging.warning( | ||
| logging.info( |
There was a problem hiding this comment.
revert changes to these pre-existing GetNodePool name functions; they're unrelated to your PR
There was a problem hiding this comment.
Checked the current diff against master — no GetNodePool-named functions are touched in this PR. Looks like this was already cleaned up in an earlier amend.
|
|
||
| def _DiscoverSubnets(self) -> list[str]: | ||
| """Returns the EKS cluster's subnet IDs (cached after first call).""" | ||
| # pylint: disable-next=access-member-before-definition |
There was a problem hiding this comment.
add this as a parameter in init put set it to None. Should not need to pylint out of this
There was a problem hiding this comment.
Went with functools.lru_cache instead (per your alternative suggestion below) — cleaner than either the pylint-disabled getattr pattern or adding init placeholders, since it removes the manual caching entirely. Applied the same fix to _DiscoverSubnets, _DiscoverSubnetsPerAZ, _DiscoverNodeRoleArn, and _ResolveReleaseVersion (which also dropped its threading.Lock — lru_cache is thread-safe by construction)
| # and nodes launched there cannot reach the EKS API server to join. | ||
| az_map: dict[str, str] = {} | ||
| az_map_private: dict[str, str] = {} | ||
| for s in subnets: |
There was a problem hiding this comment.
definitely need a unit test for this for loops & mapping logic
There was a problem hiding this comment.
Added testDiscoverSubnetsPerAZPrefersPublicSubnets, covering the public-preferred-over-private-in-same-AZ logic and the private-fallback-when-no-public-exists case.
| def _DiscoverSubnetsPerAZ(self) -> dict[str, str]: | ||
| """Returns a mapping of {AZ: subnet_id} for the cluster's subnets. | ||
|
|
||
| Used by CreateNodePoolAsync to distribute nodegroups round-robin across |
There was a problem hiding this comment.
This seems annoyingly complicated/overly specific. Why do you need to do any of this? Does EKS Auto or Karpenter do this for you?
There was a problem hiding this comment.
Correct — EKS Auto Mode (EksAutoCluster) and Karpenter (EksKarpenterCluster) are already supported in PKB as separate cluster classes, and neither needs this AZ round-robin logic since they handle node placement automatically. This benchmark specifically uses the standard EksCluster (managed node groups) instead.
| def _DiscoverNodeRoleArn(self) -> str: | ||
| """Returns a node IAM role ARN by inspecting an existing nodegroup.""" | ||
| # pylint: disable-next=access-member-before-definition | ||
| if getattr(self, '_cached_node_role_arn', None): |
There was a problem hiding this comment.
I've put "add as variable, initialize to None in init" as a comment above. But perhaps since these are all functions, just use functools.lru_cache decorator instead. Then you don't even need an if statement; the second call will just return the results of the first.
There was a problem hiding this comment.
Done — switched all four cached discovery methods (_DiscoverSubnets, _DiscoverSubnetsPerAZ, _DiscoverNodeRoleArn, _ResolveReleaseVersion) to @functools.lru_cache(maxsize=None), dropping the manual getattr/pylint-disable pattern and the threading.Lock entirely. This also resolves Comment 2 above — no init cache attributes needed at all now.
…hods Replaces manual getattr/pylint-disable caching (and the threading.Lock in _ResolveReleaseVersion) with functools.lru_cache across the four EKS discovery methods: _DiscoverSubnets, _DiscoverSubnetsPerAZ, _DiscoverNodeRoleArn, _ResolveReleaseVersion. lru_cache is thread-safe by construction, removing the need for manual locking. Also adds testDiscoverSubnetsPerAZPrefersPublicSubnets covering the public-subnet-preferred-over-private-in-same-AZ logic and the private-fallback-when-no-public-exists case. Per hubatish's review on PR GoogleCloudPlatform#6748.
Summary
Adds EKS-specific implementations of management plane async methods.
Main changes
_ResolveReleaseVersion, CreateNodePoolAsync, DeleteNodePoolAsync,
UpgradeNodePoolAsync, UpdateClusterAsync, WaitForOperation,
ResolveNodePoolVersions
--eks_reserve_capacity_per_az (default: false)
How tested