google_kubernetes_engine: add management plane operations#6747
google_kubernetes_engine: add management plane operations#6747ashishsuneja wants to merge 9 commits into
Conversation
- _IssueAsync: issues gcloud --async commands, returns op name - CreateNodePoolAsync: creates nodepool, returns op handle - DeleteNodePoolAsync: deletes nodepool, returns op handle - UpgradeNodePoolAsync: upgrades nodepool, returns op handle - UpdateClusterAsync: toggles label for non-destructive update - WaitForOperation: polls until DONE/ABORTING - ResolveNodePoolVersions: auto-detects initial/target versions - _GetLatestOperationName: fallback op lookup by type+target - GetNodePoolNames: lists current node pools Tested: - Existing GKE test suite passing - pyink + lint-diffs clean
9acc83e to
b78f357
Compare
| Operation name string, or empty string if none found. | ||
| """ | ||
| link_target = target_name or self.name | ||
| if op_start_time: |
There was a problem hiding this comment.
This is good logic; can we not just do it every time / require op_start_time? Additional conditions when not needed are just additional complexity.
There was a problem hiding this comment.
Done — collapsed it. _GetLatestOperationName now always uses the broadened filter (RUNNING/PENDING/DONE with a startTime>= guard) and always takes op_start_time. The guard already prevents matching stale completed ops, so the two-branch conditional was redundant.
|
|
||
| def _IssueAsync(self, cmd: util.GcloudCommand) -> str: | ||
| """Issues a gcloud command with --async, returns the operation name.""" | ||
| cmd.args.append('--async') |
There was a problem hiding this comment.
Ah, I was wondering what made these operations async. got it.
| f'targetLink ~ {link_target}' | ||
| f'{time_filter}' | ||
| ) | ||
| for attempt in range(1, max_attempts + 1): |
There was a problem hiding this comment.
what are some examples where you try to call the operation but it doesn't get seen immediately? does it just take a minute for results to show up?
A unit test with semi-real output examples would be helpful.
There was a problem hiding this comment.
Added unit tests (GoogleKubernetesEngineAsyncOpsTestCase) with captured-style gcloud output: create returns the op name directly (no fallback), upgrade and update fall back and recover from the operations list, and the no-fallback-configured case raises. On the timing: the operation can take a moment to leave PENDING, and fast ops (the label-update) may already be DONE by the time we query — both handled by the broadened filter. In practice on the 100-pool run it resolved on the first query every time; the 5×3s retry is a safety margin for the PENDING-transition window.
| raise errors.Resource.CreationError( | ||
| f'GKE async command returned no operation name; stderr={stderr}' | ||
| ) | ||
| return op_name |
There was a problem hiding this comment.
below you have _GetLatestOperationName command.. does this command here not output the operation name? Shouldn't it just be:
- start operation, name is returned
- wait for operation to finish. now done.
There was a problem hiding this comment.
For create and delete, that's exactly the flow — _IssueAsync issues --async, gcloud prints the operation name, we wait on it. The exception is clusters upgrade --node-pool and clusters update: those reliably return success with empty stdout (gcloud just doesn't print the name for those two subcommands). Verified on a 100-pool GKE run — 99/99 upgrades and the cluster-update all came back with no operation name, so _GetLatestOperationName recovers it from the operations list. I didn't find a gcloud flag that makes upgrade/update print it; if one exists I'd happily drop the fallback.
| # Fallback: gcloud succeeded but printed nothing. Query the operations | ||
| # list scoped to this specific nodepool to find the operation name. | ||
| logging.info( | ||
| 'UpgradeNodePoolAsync: ops list fallback for %s: %s', |
There was a problem hiding this comment.
why/when/how frequently does this happen?
There was a problem hiding this comment.
Every time, not intermittently — clusters upgrade --node-pool --async returned no operation name on 99/99 upgrades in the 100-pool run. So the fallback runs on every upgrade/update.
| try: | ||
| return self._IssueAsync(cmd) | ||
| except errors.Resource.CreationError as e: | ||
| if 'returned no operation name' not in str(e): |
There was a problem hiding this comment.
this logic is shared with UpgradeNodePoolAsync above. Refactor to only use in one location. Perhaps an optional parameter in _IssueAsync which can handle this "no operation name" case. If ofc this is necessary at all.
There was a problem hiding this comment.
Done, exactly as suggested — moved the fallback into _IssueAsync behind optional fallback_op_type / fallback_target params, so the "no op name → query operations list" path lives in one place. UpgradeNodePoolAsync and UpdateClusterAsync are now one-liners
…sueAsync; add tests
2cd4d5b to
d3eec15
Compare
|
/gcbrun |
| f'targetLink ~ {link_target} AND ' | ||
| f'startTime>="{from_time}"' | ||
| ) | ||
| for attempt in range(1, max_attempts + 1): |
There was a problem hiding this comment.
have you looked at vm_util.Retry? Used with @vm_util.Retry(poll_interval,max_retries,timeout). I suspect you're rewriting code rather than utilizing PKB standards/logic.
There was a problem hiding this comment.
Done — refactored to use vm_util.Retry instead of the manual for-loop/sleep. Since max_attempts/retry_delay are call-time parameters here (not fixed), I apply the decorator to a locally-defined inner function inside the method rather than statically on it — poll_interval/max_retries still fully respect the caller's values via closure. Scoped retryable_exceptions to errors.Resource.GetError so only the "not found yet" case retries, not unrelated failures.
| fallback_op_type, | ||
| fallback_target, | ||
| ) | ||
| return self._GetLatestOperationName( |
There was a problem hiding this comment.
so this is only for operations that don't return an op name? How common is that? Is it in the output but just harder to parse, or completely absent?
There was a problem hiding this comment.
Only UPGRADE_NODES and UPDATE_CLUSTER hit this path — create/delete always print the operation name directly (the fallback is disabled for those via fallback_op_type=None). For upgrade/update it's not intermittent — verified 99/99 upgrades plus the cluster update returned empty stdout in the 100-pool run. It's completely absent from gcloud's output for those two subcommands, not a parsing issue
| Returns: | ||
| The operation name. | ||
| """ | ||
| cmd.args.append('--async') |
There was a problem hiding this comment.
This method overall looks useful enough & broad enough we should consider putting it in providers/gcp/utils.py's GcloudCommand. The --async feature is supported across many gcloud commands & should be broadly applicable right? Even if the fallback "get lastest" behavior would vary - it could take like an optional "get_latest" lambda as an argument. (and given that said fallback is necessary).
There was a problem hiding this comment.
Done — added GcloudCommand.IssueAsync(get_latest_op_fn=...) to util.py. GKE's _GetLatestOperationName is now passed in as the callable, so the GKE-specific filter logic stays in google_kubernetes_engine.py while the async-issue/fallback shape is reusable by any GCP resource. Purely additive — GcloudCommand.Issue() itself is unchanged, so none of the existing 194 call sites across GCP providers are affected. Verified with the full tests/providers/gcp/ suite.
| def UpgradeNodePoolAsync(self, name: str, target_version: str) -> str: | ||
| """Initiates node pool upgrade; returns op handle. Does NOT wait. | ||
|
|
||
| `clusters upgrade --node-pool --async` returns success with empty stdout |
There was a problem hiding this comment.
got it ok I see where we need this now. Are we running multiple of these in parallel? That's where "get latest" would break.
There was a problem hiding this comment.
Yes — concurrent_node_pool_ops runs many of these in parallel (concurrent upgrades across node pools). It's safe: the filter includes targetLink ~ {link_target}, and each concurrent call passes its own specific fallback_target (the individual node-pool name) — so concurrent calls each resolve only operations matching their own target, not just "whatever's most recent." Combined with the op_start_time guard, this disambiguates correctly even when many ops start within the same few seconds. I'll add an explicit concurrent-calls test case to make this guarantee visible in the test suite.
| cmd.flags['update-labels'] = f'k8s-mgmt-ts={int(time.time())}' | ||
| # GcloudCommand sets --quiet by default; the label update is | ||
| # non-interactive so it's safe to drop (matches how this was validated). | ||
| cmd.flags.pop('quiet', None) |
There was a problem hiding this comment.
yeah this pop logic is very GcloudCommand specific. Seconding the "add async as option in utils.py" suggestion as a refactor. Might move some unit tests over as well.
There was a problem hiding this comment.
Done — added GcloudCommandIssueAsyncTestCase to util_test.py, covering direct op-name return, fallback invocation, raise-when-no-fallback, raise-on-command-failure, and the --async/format flag setup.
…sync Moves the --async issue + 'no op name printed, fall back to listing operations' pattern out of GKE-specific _IssueAsync into a reusable GcloudCommand.IssueAsync method, with the resource-specific lookup (_GetLatestOperationName for GKE) passed in as a callable. Per hubatish's review on PR GoogleCloudPlatform#6747.
|
/gcbrun |
hubatish
left a comment
There was a problem hiding this comment.
looks pretty reasonable! Left a few more comments on the move to util.py but generally lgtm. Will start trying to approve/submit after last round of comments & the base is submitted.
| """ | ||
| self.args.append('--async') | ||
| self.flags['format'] = 'value(name)' | ||
| stdout, stderr, retcode = self.Issue(raise_on_failure=False, **kwargs) |
There was a problem hiding this comment.
Apologies if I asked this already, but why raise_on_failure=False ? That's a big problem for general use - maybe it should be an argument passed from google_kubernetes_engine. Especially given the "if retcode" below it just looks like you're changing the exception type returned.
There was a problem hiding this comment.
Done - removed the hardcoded override. IssueAsync now just calls self.Issue(**kwargs), so it inherits Issue()'s own default (raise_on_failure=True) unless a caller explicitly opts out via **kwargs. The manual if retcode: raise is gone too — it was just reimplementing what the default already does.
| self.flags['format'] = 'value(name)' | ||
| stdout, stderr, retcode = self.Issue(raise_on_failure=False, **kwargs) | ||
| if retcode: | ||
| raise errors.Resource.CreationError(stderr) |
There was a problem hiding this comment.
Said exception type also doesn't really make sense in a more general location - this IssueAsync command doesn't have to be called in a _Create context.
There was a problem hiding this comment.
Agreed — removed errors.Resource.CreationError from this path entirely. Since raise_on_failure now defaults to True, a failing command raises via Issue()/IssueCommand's own exception (errors.VmUtil.IssueCommandError) before we'd ever reach a retcode check here.
There was a problem hiding this comment.
Agreed — removed errors.Resource.CreationError from this path entirely. Since raise_on_failure now defaults to True, a failing command raises via Issue()/IssueCommand's own exception (errors.VmUtil.IssueCommandError) before we'd ever reach a retcode check here.
| if op_name: | ||
| return op_name | ||
| if get_latest_op_fn is None: | ||
| raise errors.Resource.CreationError( |
There was a problem hiding this comment.
raise probably errors.VmUtil.IssueCommandError instead.
There was a problem hiding this comment.
Done — switched the "no op name and no fallback" raise to errors.VmUtil.IssueCommandError, matching what the command-failure path now raises by default too.
…ueCommandError Per hubatish's review: - Removed hardcoded raise_on_failure=False -- IssueAsync now inherits Issue()'s own default (True), so callers control this via **kwargs if needed, and the manual retcode check (which just reimplemented the default) is gone. - Switched the remaining 'no op name, no fallback' exception from errors.Resource.CreationError to errors.VmUtil.IssueCommandError, since IssueAsync isn't always called in a _Create context. Updated 3 tests to match: 2 needed the new exception type, and testIssueAsyncRaisesOnCommandFailure now mocks Issue() raising (side_effect) rather than returning a failure tuple, matching real behavior now that retcode isn't checked manually.
Summary
Adds GKE-specific implementations of management plane async methods.
Main changes
UpgradeNodePoolAsync, UpdateClusterAsync, WaitForOperation,
ResolveNodePoolVersions, _GetLatestOperationName, GetNodePoolNames
How tested