add agentic benchmarking on gke by george-kalisse-sada · Pull Request #6772 · GoogleCloudPlatform/PerfKitBenchmarker

george-kalisse-sada · 2026-06-16T08:13:52Z

Agentic Workload Benchmarking for GKE (PKB Extension)

Summary

Adds a complete benchmarking framework for Agentic Workloads on Google Kubernetes Engine (GKE) — specifically measuring per-operation performance of untrusted Python code execution and headless Chromium browser tasks running under gVisor (GKE Agent Sandbox) isolation.

Motivation

AI agent systems require ephemeral, isolated execution environments (sandboxes) for running untrusted code. Understanding the performance characteristics of these sandboxes under gVisor — including cold-start latency, execution overhead, memory density limits, and scheduling throughput — is critical for production capacity planning.

This framework enables systematic, repeatable measurement of these characteristics across multiple GCP machine families.

Architecture

Benchmark Definitions (7 Use Cases)

Benchmark	Use Case	Measures
`gke_snapshot`	UC-A: Cold Start & Snapshot	Pod snapshot create/restore latency under CRIU
`gke_python_density`	UC-B: Python Density	CEL, TTFE, RSS growth at varying concurrency
`gke_chromium_density`	UC-C: Chromium Density	Interaction latency, screenshot time at scale
`gke_payload`	UC-D: Payload Transfer	Sandbox→orchestrator data transfer saturation
`gke_warmpool`	UC-E: Warmpool Scale-Up	Bulk provisioning speed (0→N pods)
`gke_qps`	UC-F: QPS Saturation	Scheduling throughput until pool drain
`gke_deletion`	UC-G: Deletion & Cleanup	Bulk deletion latency and IP reclamation

Shared Utilities

Module	Purpose
`gke_benchmark_utils.py`	Agent API interaction, kubectl helpers, warm pool management, port-forward manager, sample construction
`gke_deploy_utils.py`	Idempotent workload deployment (CRDs, templates, warm pools, router, ADK agent, PSI reader)
`gke_provision_utils.py`	Full GKE infrastructure lifecycle (VPC, NAT, cluster, node pools, AR, IAM)
`gke_image_build_utils.py`	Container image builds via Cloud Build (ADK agent, Chrome sandbox, Sandbox Router)
`gke_prerequisite_setup.py`	Standalone script for pre-PKB infrastructure (VPC, NAT, AR, SA, images)

Dual Provisioning Modes

custom mode: Direct gcloud calls for full infrastructure control
native mode: Uses PKB's built-in container_cluster provisioner with prerequisite script for resources PKB cannot manage

PKB Provider Extensions

Small additions to support GKE preview features:

--gke_use_beta flag (forces gcloud beta container clusters create)
--gke_additional_flags list (appended to cluster create)
--gke_additional_nodepool_flags list (appended to node pool create)

In-Cluster Components

ADK Agent (`workloads/adk_agent/`)

A FastAPI service deployed inside GKE that:

Exposes REST endpoints for each benchmark type (/benchmark/python/density, /benchmark/python/payload, /benchmark/python/qps, /benchmark/chromium/density)
Uses a Mock LLM (no real model calls) to drive the ADK Runner through sandbox claim→execute→release cycles
Connects to sandboxes via DirectConnection (in-cluster) or kubectl port-forward (dev mode)
Measures both orchestrator-side and sandbox-side metrics

Sandbox Scripts (`sandboxed_apps/`)

benchmark_density.py — CPU-bound, syscall-heavy, and import-heavy tasks with RSS tracking
benchmark_payload.py — Payload generation, serialization, and stdout transfer measurement
benchmark_qps.py — Minimal script proving sandbox liveness
benchmark_density.js — Playwright-driven Chromium interaction benchmark

Vibe Coding Workloads (`workloads/vibe_coding/`)

Startup scripts simulating real-world agentic cold-starts:

startup_pip_fastapi.sh — pip install + FastAPI server boot
startup_npm_vite.sh — npm install + Vite dev server boot

Usage

Prerequisites (once per environment)

python -m perfkitbenchmarker.linux_benchmarks.kubernetes.agentic.gke_prerequisite_setup \
    --project_id=sada-gke-benchmarking2 \
    --region=us-central1 \
    --zone=us-central1-a \
    --machine_type=c4-standard-8

Provision Cluster

python pkb.py --benchmarks=gke_python_density \
    --run_stage=provision \
    --gke_provision_mode=native \
    --project=sada-gke-benchmarking2 \
    --owner=george-kalisse \
    --benchmark_config_file=k8s_agents/config/native_provision_config.yaml \
    --gce_network_name=george-agentic-vpc \
    --gce_subnet_region=us-central1 \
    --zone=us-central1-a \
    --container_cluster_version=1.35.3-gke.1389000 \
    --gke_use_beta=true \
    --gke_additional_flags="--enable-pod-snapshots,--enable-dataplane-v2,--enable-private-nodes,--enable-ip-alias,--master-ipv4-cidr=172.16.0.0/28,--workload-pool=sada-gke-benchmarking2.svc.id.goog,--subnetwork=george-agentic-subnet,--enable-master-authorized-networks,--master-authorized-networks=$(curl -s ifconfig.me)/32" \
    --gke_additional_nodepool_flags="--max-pods-per-node=250" \
    --gke_enable_shielded_nodes=false \
    --run_uri=test \
    --temp_dir=./testing/pkb/c4-standard-8/ucb```

### Run Benchmark
```bash
python pkb.py --benchmarks=gke_python_density \
    --run_stage=prepare,run,cleanup \
    --gke_provision_mode=native \
    --gke_project_id=sada-gke-benchmarking2 \
    --gke_region=us-central1 \
    --gke_zone=us-central1-a \
    --gke_sandbox_machine_type=c4-standard-8 \
    --gke_namespace=agentic \
    --gke_sandbox_version=v0.4.6 \
    --gke_python_density=4 \
    --gke_python_density_sample_count=20 \
    --gke_python_density_sample_warmup=0 \
    --gke_python_density_patch_warmpool=true \
    --gke_python_density_exec_timeout=600 \
    --gke_machine_type=c4-standard-8 \
    --gke_gvisor=true \
    --gke_api_url=http://localhost:8080 \
    --run_uri=test \
    --temp_dir=./testing/pkb/c4-standard-8/ucb

google-cla · 2026-06-16T08:13:57Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

hubatish · 2026-06-16T16:55:30Z

+# Used with --gke_provision_mode=native
+#
+# Prerequisites (run once before PKB):
+#   python tools/agentic-benchmark/scripts/prerequisite_setup.py \


This tool isn't being included & therefore this comment doesn't need to be here.

Done, removed stale references in new commit.

hubatish · 2026-06-16T16:59:36Z

+# For sweeps (cluster pre-exists, PKB skips provision/teardown):
+#   The sweep bridge injects --run_stage=run,cleanup automatically.
+
+gke_python_density:


Internally we put a lot of this info but externally it is useful.. it's probably a good addition.

hubatish · 2026-06-16T17:00:43Z

@@ -0,0 +1,240 @@
+from google.adk.agents import LlmAgent
+from google.adk.code_executors import GkeCodeExecutor


Where is this file run? From the same machine running PKB or a different one?

This is the ADK Agent we're benchmarking against (i.e calling its FASTAPI APIs). It Get's Docker-built and deployed to GKE. PKB Benchmarks target it via kubectl port-forward.

hubatish · 2026-06-16T17:04:24Z

 six>=1.13.0
 timeout-decorator
 scipy
+matplotlib


I don't see a reference to this elsewhere with a ctrl-f; is it leftover from an earlier version?
In general we prefer not making many changes to requirements.txt.

Stale, Removed.

hubatish · 2026-06-16T17:11:11Z

    ' beyond the default node pool (e.g. kubernetes_node_scale with 5k nodes).',
 )
+
+GKE_USE_BETA = flags.DEFINE_boolean(


If we add this flag, IMO just make it "gcloud_use_beta" (or actually an enum use alpha, beta, None "gcloud_beta_version") & being referenced from gcp/util.py directly seems best.

Alternatively we often will say in the provider "if preview feature used, cmd.use_beta_gcloud = True". In general what feature are you using that needs beta?

I removed this Flag now. It was for --enable-pod-snapshots on GKE, which was BETA at the time of development.

hubatish · 2026-06-16T17:22:22Z

+by all seven UC benchmark scripts.  Each benchmark's Provision() and
+Teardown() functions delegate to the public functions in this module.
+
+Infrastructure created (in order):


The very premise of this file is incorrect. PKB (and esp eg google_kubernetes_engine.py _Create) should be handling all of the provisioning logic.

I'm not sure how much of this is a) completely unnecessary because it's handled elsewhere in PKB (like we do setup subnets & networks automatically if you don't specify a network" or b) is indeed necessary but should be located in some other Resource.py class.

+1. Let's set up the cloud infra using PKB-native way.

There were 2 approaches for Provisioning when this comment was made; 'Custom', and PKB 'Native'. I removed the 'Custom' option and any unnecessary code related to Custom; PKB-Native is the only way now.

hubatish · 2026-06-16T17:28:38Z

+    chromium_replicas = FLAGS.gke_chromium_replicas
+
+    manifest = """---
+apiVersion: extensions.agents.x-k8s.io/v1alpha1


should go in some .yaml.j2 file

Done, moved all inline-templates to Jinja2 templates.

hubatish · 2026-06-16T17:29:48Z

+    return _RunCmd(cmd, check=check, timeout=timeout)
+
+
+def _KubectlApply(manifest_str):


why have you rewritten kubectl apply & _RunKubectl when implementations exist container_service/kubectl.py ?

Done, moved refactored all to use kubectl.py

roycaihw · 2026-06-16T23:44:48Z

@@ -0,0 +1,362 @@
+"""PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B).


For easier review and faster iteration, I'd recommend keeping one benchmark in this PR and leave the other benchmarks for followup PRs. My recommendation is to keep the Python density benchmark.

roycaihw · 2026-06-16T23:47:27Z

@@ -0,0 +1,362 @@
+"""PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B).


Let's drop "(Use Case B)" from the description. For the published PKB benchmarks, the documentation should clearly state what the benchmarks are about. The ordering of A,B,C... will become stale and confusing to readers.

roycaihw · 2026-06-16T23:49:19Z

@@ -0,0 +1,362 @@
+"""PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B).


Can we drop "GKE" from the file name and the description? Based on the path this is a Kubernetes benchmark, and presumably this benchmark can be reused for other cloud provider without significant change, right?

The benchmarks have GKE-specific dependencies at the moment, such as Pod Snapshots (podsnapshot.gke.io/v1 CRD), image building using cloud build, ... etc. Abstracting this coupling would require some research and refactoring, and possibly another PR.

roycaihw · 2026-06-16T23:54:01Z

+# ---------------------------------------------------------------------------
+
+flags.DEFINE_integer(
+    "gke_python_density",


gke_python_density
nit: Shall we name the flag something like "concurrent_sandbox_count"? gke and python can already be implied based on the file name and description of the benchmark.

renamed to 'gke_python_density_concurrent_sandbox_count'.
Benchmark-/usecase- specific flags should maintain the benchmark name (for example: gke_python_density_) as a prefix in order not to have any potential 'namespace collisions' when multiple benchmarks are imported or executed.

roycaihw · 2026-06-16T23:56:16Z

+flags.DEFINE_integer(
+    "gke_python_density_sample_warmup",
+    0,
+    "Number of warmup iterations per session (excluded from stats).",


It's unclear what "warmup iterations" means as it's not mentioned before. Shall we document the workflow in the benchmark description?

added a description to the docstring at the top of the file.

roycaihw · 2026-06-17T00:00:34Z

+by all seven UC benchmark scripts.  Each benchmark's Provision() and
+Teardown() functions delegate to the public functions in this module.
+
+Infrastructure created (in order):


+1. Let's set up the cloud infra using PKB-native way.

roycaihw · 2026-06-17T00:05:19Z

+# ---------------------------------------------------------------------------
+
+
+def _emit(samples, agg, agg_key, metric_suffix, unit, namespace, extra):


Can you document how the metrics emit works and what the parameters are?

roycaihw · 2026-06-17T16:01:21Z

cc @yuanwang04 @oceanxie1

hubatish · 2026-06-18T15:57:12Z

+# ---------------------------------------------------------------------------
+
+
+def _BuildADKAgentImage(


Probably this whole gke_image_build_utils file is not needed. See GoogleArtifactRegistry in google_kubernetes_engine.py & kubernetes_hpa
with
container_specs:
kubernetes_fib:
image: fibonacci
in the config + data/docker/fibonacci for the dockerimage

We're building 3 images.
The ADK Image, whose codebase is in the repo, and two other images (Sandbox Router and Chrom Sandbox) whose codebase is not in the repo; they're in https://github.com/kubernetes-sigs/agent-sandbox, and they are not built/published publically, they need to be built per use.

I can try to move ADK Image to PKB NAtive, but the other 2 will still require to be in a "repreq" python script, and that is because I do not want to import their static code into PKB.

add agentic benchmarking on gke

f614265

hubatish reviewed Jun 16, 2026

View reviewed changes

roycaihw reviewed Jun 17, 2026

View reviewed changes

attend to comments, fixes, and improvements

0338d09

hubatish reviewed Jun 18, 2026

View reviewed changes

		@@ -0,0 +1,240 @@
		from google.adk.agents import LlmAgent
		from google.adk.code_executors import GkeCodeExecutor

		return _RunCmd(cmd, check=check, timeout=timeout)


		def _KubectlApply(manifest_str):

		@@ -0,0 +1,362 @@
		"""PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B).

		# ---------------------------------------------------------------------------


		def _emit(samples, agg, agg_key, metric_suffix, unit, namespace, extra):

		# ---------------------------------------------------------------------------


		def _BuildADKAgentImage(

Uh oh!

Conversation

george-kalisse-sada commented Jun 16, 2026

Agentic Workload Benchmarking for GKE (PKB Extension)

Summary

Motivation

Architecture

Benchmark Definitions (7 Use Cases)

Shared Utilities

Dual Provisioning Modes

PKB Provider Extensions

In-Cluster Components

ADK Agent (workloads/adk_agent/)

Sandbox Scripts (sandboxed_apps/)

Vibe Coding Workloads (workloads/vibe_coding/)

Usage

Prerequisites (once per environment)

Provision Cluster

Uh oh!

google-cla Bot commented Jun 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

george-kalisse-sada Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roycaihw commented Jun 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ADK Agent (`workloads/adk_agent/`)

Sandbox Scripts (`sandboxed_apps/`)

Vibe Coding Workloads (`workloads/vibe_coding/`)

george-kalisse-sada Jun 18, 2026 •

edited

Loading