google_kubernetes_engine: add management plane operations by ashishsuneja · Pull Request #6747 · GoogleCloudPlatform/PerfKitBenchmarker

ashishsuneja · 2026-06-05T17:11:48Z

Summary
Adds GKE-specific implementations of management plane async methods.

Main changes

_IssueAsync, CreateNodePoolAsync, DeleteNodePoolAsync,
UpgradeNodePoolAsync, UpdateClusterAsync, WaitForOperation,
ResolveNodePoolVersions, _GetLatestOperationName, GetNodePoolNames

How tested

Existing GKE test suite passing
pyink + lint-diffs clean

- _IssueAsync: issues gcloud --async commands, returns op name - CreateNodePoolAsync: creates nodepool, returns op handle - DeleteNodePoolAsync: deletes nodepool, returns op handle - UpgradeNodePoolAsync: upgrades nodepool, returns op handle - UpdateClusterAsync: toggles label for non-destructive update - WaitForOperation: polls until DONE/ABORTING - ResolveNodePoolVersions: auto-detects initial/target versions - _GetLatestOperationName: fallback op lookup by type+target - GetNodePoolNames: lists current node pools Tested: - Existing GKE test suite passing - pyink + lint-diffs clean

hubatish · 2026-06-11T20:35:44Z

+        Operation name string, or empty string if none found.
+    """
+    link_target = target_name or self.name
+    if op_start_time:


This is good logic; can we not just do it every time / require op_start_time? Additional conditions when not needed are just additional complexity.

Done — collapsed it. _GetLatestOperationName now always uses the broadened filter (RUNNING/PENDING/DONE with a startTime>= guard) and always takes op_start_time. The guard already prevents matching stale completed ops, so the two-branch conditional was redundant.

hubatish · 2026-06-11T20:37:14Z


+  def _IssueAsync(self, cmd: util.GcloudCommand) -> str:
+    """Issues a gcloud command with --async, returns the operation name."""
+    cmd.args.append('--async')


Ah, I was wondering what made these operations async. got it.

Acknowledged

hubatish · 2026-06-11T20:39:44Z

+        f'targetLink ~ {link_target}'
+        f'{time_filter}'
+    )
+    for attempt in range(1, max_attempts + 1):


what are some examples where you try to call the operation but it doesn't get seen immediately? does it just take a minute for results to show up?

A unit test with semi-real output examples would be helpful.

Added unit tests (GoogleKubernetesEngineAsyncOpsTestCase) with captured-style gcloud output: create returns the op name directly (no fallback), upgrade and update fall back and recover from the operations list, and the no-fallback-configured case raises. On the timing: the operation can take a moment to leave PENDING, and fast ops (the label-update) may already be DONE by the time we query — both handled by the broadened filter. In practice on the 100-pool run it resolved on the first query every time; the 5×3s retry is a safety margin for the PENDING-transition window.

hubatish · 2026-06-11T20:41:19Z

+      raise errors.Resource.CreationError(
+          f'GKE async command returned no operation name; stderr={stderr}'
+      )
+    return op_name


below you have _GetLatestOperationName command.. does this command here not output the operation name? Shouldn't it just be:

start operation, name is returned

wait for operation to finish. now done.

For create and delete, that's exactly the flow — _IssueAsync issues --async, gcloud prints the operation name, we wait on it. The exception is clusters upgrade --node-pool and clusters update: those reliably return success with empty stdout (gcloud just doesn't print the name for those two subcommands). Verified on a 100-pool GKE run — 99/99 upgrades and the cluster-update all came back with no operation name, so _GetLatestOperationName recovers it from the operations list. I didn't find a gcloud flag that makes upgrade/update print it; if one exists I'd happily drop the fallback.

hubatish · 2026-06-11T20:42:53Z

+      # Fallback: gcloud succeeded but printed nothing. Query the operations
+      # list scoped to this specific nodepool to find the operation name.
+      logging.info(
+          'UpgradeNodePoolAsync: ops list fallback for %s: %s',


why/when/how frequently does this happen?

Every time, not intermittently — clusters upgrade --node-pool --async returned no operation name on 99/99 upgrades in the 100-pool run. So the fallback runs on every upgrade/update.

hubatish · 2026-06-11T21:03:29Z

+    try:
+      return self._IssueAsync(cmd)
+    except errors.Resource.CreationError as e:
+      if 'returned no operation name' not in str(e):


this logic is shared with UpgradeNodePoolAsync above. Refactor to only use in one location. Perhaps an optional parameter in _IssueAsync which can handle this "no operation name" case. If ofc this is necessary at all.

Done, exactly as suggested — moved the fallback into _IssueAsync behind optional fallback_op_type / fallback_target params, so the "no op name → query operations list" path lives in one place. UpgradeNodePoolAsync and UpdateClusterAsync are now one-liners

…sueAsync; add tests

hubatish · 2026-06-26T16:55:12Z

/gcbrun

hubatish · 2026-06-26T16:59:47Z

+        f'targetLink ~ {link_target} AND '
+        f'startTime>="{from_time}"'
+    )
+    for attempt in range(1, max_attempts + 1):


have you looked at vm_util.Retry? Used with @vm_util.Retry(poll_interval,max_retries,timeout). I suspect you're rewriting code rather than utilizing PKB standards/logic.

Done — refactored to use vm_util.Retry instead of the manual for-loop/sleep. Since max_attempts/retry_delay are call-time parameters here (not fixed), I apply the decorator to a locally-defined inner function inside the method rather than statically on it — poll_interval/max_retries still fully respect the caller's values via closure. Scoped retryable_exceptions to errors.Resource.GetError so only the "not found yet" case retries, not unrelated failures.

hubatish · 2026-06-26T17:04:03Z

+        fallback_op_type,
+        fallback_target,
+    )
+    return self._GetLatestOperationName(


so this is only for operations that don't return an op name? How common is that? Is it in the output but just harder to parse, or completely absent?

Only UPGRADE_NODES and UPDATE_CLUSTER hit this path — create/delete always print the operation name directly (the fallback is disabled for those via fallback_op_type=None). For upgrade/update it's not intermittent — verified 99/99 upgrades plus the cluster update returned empty stdout in the 100-pool run. It's completely absent from gcloud's output for those two subcommands, not a parsing issue

hubatish · 2026-06-26T17:08:16Z

+    Returns:
+      The operation name.
+    """
+    cmd.args.append('--async')


This method overall looks useful enough & broad enough we should consider putting it in providers/gcp/utils.py's GcloudCommand. The --async feature is supported across many gcloud commands & should be broadly applicable right? Even if the fallback "get lastest" behavior would vary - it could take like an optional "get_latest" lambda as an argument. (and given that said fallback is necessary).

Done — added GcloudCommand.IssueAsync(get_latest_op_fn=...) to util.py. GKE's _GetLatestOperationName is now passed in as the callable, so the GKE-specific filter logic stays in google_kubernetes_engine.py while the async-issue/fallback shape is reusable by any GCP resource. Purely additive — GcloudCommand.Issue() itself is unchanged, so none of the existing 194 call sites across GCP providers are affected. Verified with the full tests/providers/gcp/ suite.

hubatish · 2026-06-26T17:12:02Z

+  def UpgradeNodePoolAsync(self, name: str, target_version: str) -> str:
+    """Initiates node pool upgrade; returns op handle. Does NOT wait.
+
+    `clusters upgrade --node-pool --async` returns success with empty stdout


got it ok I see where we need this now. Are we running multiple of these in parallel? That's where "get latest" would break.

Yes — concurrent_node_pool_ops runs many of these in parallel (concurrent upgrades across node pools). It's safe: the filter includes targetLink ~ {link_target}, and each concurrent call passes its own specific fallback_target (the individual node-pool name) — so concurrent calls each resolve only operations matching their own target, not just "whatever's most recent." Combined with the op_start_time guard, this disambiguates correctly even when many ops start within the same few seconds. I'll add an explicit concurrent-calls test case to make this guarantee visible in the test suite.

hubatish · 2026-06-26T17:18:39Z

+    cmd.flags['update-labels'] = f'k8s-mgmt-ts={int(time.time())}'
+    # GcloudCommand sets --quiet by default; the label update is
+    # non-interactive so it's safe to drop (matches how this was validated).
+    cmd.flags.pop('quiet', None)


yeah this pop logic is very GcloudCommand specific. Seconding the "add async as option in utils.py" suggestion as a refactor. Might move some unit tests over as well.

Done — added GcloudCommandIssueAsyncTestCase to util_test.py, covering direct op-name return, fallback invocation, raise-when-no-fallback, raise-on-command-failure, and the --async/format flag setup.

…sync Moves the --async issue + 'no op name printed, fall back to listing operations' pattern out of GKE-specific _IssueAsync into a reusable GcloudCommand.IssueAsync method, with the resource-specific lookup (_GetLatestOperationName for GKE) passed in as a callable. Per hubatish's review on PR GoogleCloudPlatform#6747.

hubatish · 2026-06-29T17:19:08Z

/gcbrun

hubatish

looks pretty reasonable! Left a few more comments on the move to util.py but generally lgtm. Will start trying to approve/submit after last round of comments & the base is submitted.

hubatish · 2026-06-29T17:05:16Z

+    """
+    self.args.append('--async')
+    self.flags['format'] = 'value(name)'
+    stdout, stderr, retcode = self.Issue(raise_on_failure=False, **kwargs)


Apologies if I asked this already, but why raise_on_failure=False ? That's a big problem for general use - maybe it should be an argument passed from google_kubernetes_engine. Especially given the "if retcode" below it just looks like you're changing the exception type returned.

Done - removed the hardcoded override. IssueAsync now just calls self.Issue(**kwargs), so it inherits Issue()'s own default (raise_on_failure=True) unless a caller explicitly opts out via **kwargs. The manual if retcode: raise is gone too — it was just reimplementing what the default already does.

hubatish · 2026-06-29T17:05:53Z

+    self.flags['format'] = 'value(name)'
+    stdout, stderr, retcode = self.Issue(raise_on_failure=False, **kwargs)
+    if retcode:
+      raise errors.Resource.CreationError(stderr)


Said exception type also doesn't really make sense in a more general location - this IssueAsync command doesn't have to be called in a _Create context.

Agreed — removed errors.Resource.CreationError from this path entirely. Since raise_on_failure now defaults to True, a failing command raises via Issue()/IssueCommand's own exception (errors.VmUtil.IssueCommandError) before we'd ever reach a retcode check here.

hubatish · 2026-06-29T17:06:41Z

+    if op_name:
+      return op_name
+    if get_latest_op_fn is None:
+      raise errors.Resource.CreationError(


raise probably errors.VmUtil.IssueCommandError instead.

Done — switched the "no op name and no fallback" raise to errors.VmUtil.IssueCommandError, matching what the command-failure path now raises by default too.

…ueCommandError Per hubatish's review: - Removed hardcoded raise_on_failure=False -- IssueAsync now inherits Issue()'s own default (True), so callers control this via **kwargs if needed, and the manual retcode check (which just reimplemented the default) is gone. - Switched the remaining 'no op name, no fallback' exception from errors.Resource.CreationError to errors.VmUtil.IssueCommandError, since IssueAsync isn't always called in a _Create context. Updated 3 tests to match: 2 needed the new exception type, and testIssueAsyncRaisesOnCommandFailure now mocks Issue() raising (side_effect) rather than returning a failure tuple, matching real behavior now that retcode isn't checked manually.

Ashish Suneja added 3 commits June 9, 2026 12:47

google_kubernetes_engine: remove leftover commented-out line

9b8ec25

google_kubernetes_engine: remove leftover commented-out code

b78f357

ashishsuneja force-pushed the mgmt_plane_gke branch from 9acc83e to b78f357 Compare June 9, 2026 12:47

ashishsuneja mentioned this pull request Jun 10, 2026

kubernetes_management: add benchmark + base cluster management plane methods #6746

Open

hubatish reviewed Jun 11, 2026

View reviewed changes

google_kubernetes_engine: consolidate async op-name fallback into _Is…

d3eec15

…sueAsync; add tests

ashishsuneja force-pushed the mgmt_plane_gke branch 2 times, most recently from 2cd4d5b to d3eec15 Compare June 17, 2026 19:21

hubatish reviewed Jun 26, 2026

View reviewed changes

google_kubernetes_engine: use vm_util.Retry in _GetLatestOperationName

b0768fc

ashishsuneja mentioned this pull request Jun 29, 2026

providers/gcp: generalize async op handling into GcloudCommand.IssueAsync #6803

Closed

Ashish Suneja added 3 commits June 29, 2026 12:20

providers/gcp: add unit tests for GcloudCommand.IssueAsync

1cba1e7

providers/gcp: fix blank-line spacing in util_test.py

94ae636

hubatish self-assigned this Jun 29, 2026

hubatish approved these changes Jun 29, 2026

View reviewed changes

Uh oh!

Conversation

ashishsuneja commented Jun 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashishsuneja Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hubatish commented Jun 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hubatish commented Jun 29, 2026

Uh oh!

hubatish left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ashishsuneja Jun 15, 2026 •

edited

Loading