Skip to content

elastic_kubernetes_service: add management plane operations#6748

Open
ashishsuneja wants to merge 2 commits into
GoogleCloudPlatform:masterfrom
ashishsuneja:mgmt_plane_eks
Open

elastic_kubernetes_service: add management plane operations#6748
ashishsuneja wants to merge 2 commits into
GoogleCloudPlatform:masterfrom
ashishsuneja:mgmt_plane_eks

Conversation

@ashishsuneja

Copy link
Copy Markdown
Collaborator

Summary
Adds EKS-specific implementations of management plane async methods.

Main changes

  • _DiscoverSubnets, _DiscoverSubnetsPerAZ, _DiscoverNodeRoleArn,
    _ResolveReleaseVersion, CreateNodePoolAsync, DeleteNodePoolAsync,
    UpgradeNodePoolAsync, UpdateClusterAsync, WaitForOperation,
    ResolveNodePoolVersions
  • Capacity reservation support flag-gated behind
    --eks_reserve_capacity_per_az (default: false)
  • logging.warning -> logging.info/raise per hubatish feedback

How tested

  • 43/44 tests passing (1 pre-existing failure unrelated to this PR)
  • EKS end-to-end: 99 pools, 100% success all 7 scenarios
  • pyink + lint-diffs clean

- _DiscoverSubnets: discovers cluster subnet IDs (cached)
- _DiscoverSubnetsPerAZ: maps AZ to subnet ID (cached)
- _DiscoverNodeRoleArn: discovers IAM node role ARN (cached)
- _ResolveReleaseVersion: resolves EKS release version for minor
- CreateNodePoolAsync: creates nodegroup, returns ng_active handle
- DeleteNodePoolAsync: deletes nodegroup, returns ng_gone handle
- UpgradeNodePoolAsync: upgrades nodegroup, returns ng_active handle
- UpdateClusterAsync: toggles cluster label, returns cluster handle
- WaitForOperation: polls nodegroup/cluster until terminal state
- ResolveNodePoolVersions: auto-detects initial/target versions
- Capacity reservation support flag-gated behind
  --eks_reserve_capacity_per_az (default: false)
- logging.warning -> logging.info/raise per hubatish feedback

Tested:
- 43/44 tests passing (1 pre-existing failure)
- EKS end-to-end: 99 pools, 100% success all 7 scenarios
- pyink + lint-diffs clean
@ashishsuneja ashishsuneja force-pushed the mgmt_plane_eks branch 2 times, most recently from 53a5035 to 958562d Compare June 17, 2026 19:28
else:
if nodepool_name not in self.nodepools:
logging.warning(
logging.info(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert changes to these pre-existing GetNodePool name functions; they're unrelated to your PR

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the current diff against master — no GetNodePool-named functions are touched in this PR. Looks like this was already cleaned up in an earlier amend.


def _DiscoverSubnets(self) -> list[str]:
"""Returns the EKS cluster's subnet IDs (cached after first call)."""
# pylint: disable-next=access-member-before-definition

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this as a parameter in init put set it to None. Should not need to pylint out of this

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with functools.lru_cache instead (per your alternative suggestion below) — cleaner than either the pylint-disabled getattr pattern or adding init placeholders, since it removes the manual caching entirely. Applied the same fix to _DiscoverSubnets, _DiscoverSubnetsPerAZ, _DiscoverNodeRoleArn, and _ResolveReleaseVersion (which also dropped its threading.Lock — lru_cache is thread-safe by construction)

# and nodes launched there cannot reach the EKS API server to join.
az_map: dict[str, str] = {}
az_map_private: dict[str, str] = {}
for s in subnets:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely need a unit test for this for loops & mapping logic

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added testDiscoverSubnetsPerAZPrefersPublicSubnets, covering the public-preferred-over-private-in-same-AZ logic and the private-fallback-when-no-public-exists case.

def _DiscoverSubnetsPerAZ(self) -> dict[str, str]:
"""Returns a mapping of {AZ: subnet_id} for the cluster's subnets.

Used by CreateNodePoolAsync to distribute nodegroups round-robin across

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems annoyingly complicated/overly specific. Why do you need to do any of this? Does EKS Auto or Karpenter do this for you?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct — EKS Auto Mode (EksAutoCluster) and Karpenter (EksKarpenterCluster) are already supported in PKB as separate cluster classes, and neither needs this AZ round-robin logic since they handle node placement automatically. This benchmark specifically uses the standard EksCluster (managed node groups) instead.

def _DiscoverNodeRoleArn(self) -> str:
"""Returns a node IAM role ARN by inspecting an existing nodegroup."""
# pylint: disable-next=access-member-before-definition
if getattr(self, '_cached_node_role_arn', None):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put "add as variable, initialize to None in init" as a comment above. But perhaps since these are all functions, just use functools.lru_cache decorator instead. Then you don't even need an if statement; the second call will just return the results of the first.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — switched all four cached discovery methods (_DiscoverSubnets, _DiscoverSubnetsPerAZ, _DiscoverNodeRoleArn, _ResolveReleaseVersion) to @functools.lru_cache(maxsize=None), dropping the manual getattr/pylint-disable pattern and the threading.Lock entirely. This also resolves Comment 2 above — no init cache attributes needed at all now.

…hods

Replaces manual getattr/pylint-disable caching (and the threading.Lock
in _ResolveReleaseVersion) with functools.lru_cache across the four
EKS discovery methods: _DiscoverSubnets, _DiscoverSubnetsPerAZ,
_DiscoverNodeRoleArn, _ResolveReleaseVersion. lru_cache is thread-safe
by construction, removing the need for manual locking.

Also adds testDiscoverSubnetsPerAZPrefersPublicSubnets covering the
public-subnet-preferred-over-private-in-same-AZ logic and the
private-fallback-when-no-public-exists case.

Per hubatish's review on PR GoogleCloudPlatform#6748.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants