add agentic benchmarking on gke#6772
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
| # Used with --gke_provision_mode=native | ||
| # | ||
| # Prerequisites (run once before PKB): | ||
| # python tools/agentic-benchmark/scripts/prerequisite_setup.py \ |
There was a problem hiding this comment.
This tool isn't being included & therefore this comment doesn't need to be here.
There was a problem hiding this comment.
Done, removed stale references in new commit.
| # For sweeps (cluster pre-exists, PKB skips provision/teardown): | ||
| # The sweep bridge injects --run_stage=run,cleanup automatically. | ||
|
|
||
| gke_python_density: |
There was a problem hiding this comment.
Internally we put a lot of this info but externally it is useful.. it's probably a good addition.
| @@ -0,0 +1,240 @@ | |||
| from google.adk.agents import LlmAgent | |||
| from google.adk.code_executors import GkeCodeExecutor | |||
There was a problem hiding this comment.
Where is this file run? From the same machine running PKB or a different one?
There was a problem hiding this comment.
This is the ADK Agent we're benchmarking against (i.e calling its FASTAPI APIs). It Get's Docker-built and deployed to GKE. PKB Benchmarks target it via kubectl port-forward.
| six>=1.13.0 | ||
| timeout-decorator | ||
| scipy | ||
| matplotlib |
There was a problem hiding this comment.
I don't see a reference to this elsewhere with a ctrl-f; is it leftover from an earlier version?
In general we prefer not making many changes to requirements.txt.
| ' beyond the default node pool (e.g. kubernetes_node_scale with 5k nodes).', | ||
| ) | ||
|
|
||
| GKE_USE_BETA = flags.DEFINE_boolean( |
There was a problem hiding this comment.
If we add this flag, IMO just make it "gcloud_use_beta" (or actually an enum use alpha, beta, None "gcloud_beta_version") & being referenced from gcp/util.py directly seems best.
Alternatively we often will say in the provider "if preview feature used, cmd.use_beta_gcloud = True". In general what feature are you using that needs beta?
There was a problem hiding this comment.
I removed this Flag now. It was for --enable-pod-snapshots on GKE, which was BETA at the time of development.
| by all seven UC benchmark scripts. Each benchmark's Provision() and | ||
| Teardown() functions delegate to the public functions in this module. | ||
|
|
||
| Infrastructure created (in order): |
There was a problem hiding this comment.
The very premise of this file is incorrect. PKB (and esp eg google_kubernetes_engine.py _Create) should be handling all of the provisioning logic.
I'm not sure how much of this is a) completely unnecessary because it's handled elsewhere in PKB (like we do setup subnets & networks automatically if you don't specify a network" or b) is indeed necessary but should be located in some other Resource.py class.
There was a problem hiding this comment.
+1. Let's set up the cloud infra using PKB-native way.
There was a problem hiding this comment.
There were 2 approaches for Provisioning when this comment was made; 'Custom', and PKB 'Native'. I removed the 'Custom' option and any unnecessary code related to Custom; PKB-Native is the only way now.
| chromium_replicas = FLAGS.gke_chromium_replicas | ||
|
|
||
| manifest = """--- | ||
| apiVersion: extensions.agents.x-k8s.io/v1alpha1 |
There was a problem hiding this comment.
should go in some .yaml.j2 file
There was a problem hiding this comment.
Done, moved all inline-templates to Jinja2 templates.
| return _RunCmd(cmd, check=check, timeout=timeout) | ||
|
|
||
|
|
||
| def _KubectlApply(manifest_str): |
There was a problem hiding this comment.
why have you rewritten kubectl apply & _RunKubectl when implementations exist container_service/kubectl.py ?
There was a problem hiding this comment.
Done, moved refactored all to use kubectl.py
| @@ -0,0 +1,362 @@ | |||
| """PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B). | |||
There was a problem hiding this comment.
For easier review and faster iteration, I'd recommend keeping one benchmark in this PR and leave the other benchmarks for followup PRs. My recommendation is to keep the Python density benchmark.
| @@ -0,0 +1,362 @@ | |||
| """PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B). | |||
There was a problem hiding this comment.
Let's drop "(Use Case B)" from the description. For the published PKB benchmarks, the documentation should clearly state what the benchmarks are about. The ordering of A,B,C... will become stale and confusing to readers.
| @@ -0,0 +1,362 @@ | |||
| """PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B). | |||
There was a problem hiding this comment.
Can we drop "GKE" from the file name and the description? Based on the path this is a Kubernetes benchmark, and presumably this benchmark can be reused for other cloud provider without significant change, right?
There was a problem hiding this comment.
The benchmarks have GKE-specific dependencies at the moment, such as Pod Snapshots (podsnapshot.gke.io/v1 CRD), image building using cloud build, ... etc. Abstracting this coupling would require some research and refactoring, and possibly another PR.
| # --------------------------------------------------------------------------- | ||
|
|
||
| flags.DEFINE_integer( | ||
| "gke_python_density", |
There was a problem hiding this comment.
gke_python_density
nit: Shall we name the flag something like "concurrent_sandbox_count"?gkeandpythoncan already be implied based on the file name and description of the benchmark.
There was a problem hiding this comment.
renamed to 'gke_python_density_concurrent_sandbox_count'.
Benchmark-/usecase- specific flags should maintain the benchmark name (for example: gke_python_density_) as a prefix in order not to have any potential 'namespace collisions' when multiple benchmarks are imported or executed.
| flags.DEFINE_integer( | ||
| "gke_python_density_sample_warmup", | ||
| 0, | ||
| "Number of warmup iterations per session (excluded from stats).", |
There was a problem hiding this comment.
It's unclear what "warmup iterations" means as it's not mentioned before. Shall we document the workflow in the benchmark description?
There was a problem hiding this comment.
added a description to the docstring at the top of the file.
| by all seven UC benchmark scripts. Each benchmark's Provision() and | ||
| Teardown() functions delegate to the public functions in this module. | ||
|
|
||
| Infrastructure created (in order): |
There was a problem hiding this comment.
+1. Let's set up the cloud infra using PKB-native way.
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def _emit(samples, agg, agg_key, metric_suffix, unit, namespace, extra): |
There was a problem hiding this comment.
Can you document how the metrics emit works and what the parameters are?
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def _BuildADKAgentImage( |
There was a problem hiding this comment.
Probably this whole gke_image_build_utils file is not needed. See GoogleArtifactRegistry in google_kubernetes_engine.py & kubernetes_hpa
with
container_specs:
kubernetes_fib:
image: fibonacci
in the config + data/docker/fibonacci for the dockerimage
There was a problem hiding this comment.
We're building 3 images.
The ADK Image, whose codebase is in the repo, and two other images (Sandbox Router and Chrom Sandbox) whose codebase is not in the repo; they're in https://github.com/kubernetes-sigs/agent-sandbox, and they are not built/published publically, they need to be built per use.
I can try to move ADK Image to PKB NAtive, but the other 2 will still require to be in a "repreq" python script, and that is because I do not want to import their static code into PKB.
Agentic Workload Benchmarking for GKE (PKB Extension)
Summary
Adds a complete benchmarking framework for Agentic Workloads on Google Kubernetes Engine (GKE) — specifically measuring per-operation performance of untrusted Python code execution and headless Chromium browser tasks running under gVisor (GKE Agent Sandbox) isolation.
Motivation
AI agent systems require ephemeral, isolated execution environments (sandboxes) for running untrusted code. Understanding the performance characteristics of these sandboxes under gVisor — including cold-start latency, execution overhead, memory density limits, and scheduling throughput — is critical for production capacity planning.
This framework enables systematic, repeatable measurement of these characteristics across multiple GCP machine families.
Architecture
Benchmark Definitions (7 Use Cases)
gke_snapshotgke_python_densitygke_chromium_densitygke_payloadgke_warmpoolgke_qpsgke_deletionShared Utilities
gke_benchmark_utils.pygke_deploy_utils.pygke_provision_utils.pygke_image_build_utils.pygke_prerequisite_setup.pyDual Provisioning Modes
custommode: Directgcloudcalls for full infrastructure controlnativemode: Uses PKB's built-incontainer_clusterprovisioner with prerequisite script for resources PKB cannot managePKB Provider Extensions
Small additions to support GKE preview features:
--gke_use_betaflag (forcesgcloud beta container clusters create)--gke_additional_flagslist (appended to cluster create)--gke_additional_nodepool_flagslist (appended to node pool create)In-Cluster Components
ADK Agent (
workloads/adk_agent/)A FastAPI service deployed inside GKE that:
/benchmark/python/density,/benchmark/python/payload,/benchmark/python/qps,/benchmark/chromium/density)DirectConnection(in-cluster) orkubectl port-forward(dev mode)Sandbox Scripts (
sandboxed_apps/)benchmark_density.py— CPU-bound, syscall-heavy, and import-heavy tasks with RSS trackingbenchmark_payload.py— Payload generation, serialization, and stdout transfer measurementbenchmark_qps.py— Minimal script proving sandbox livenessbenchmark_density.js— Playwright-driven Chromium interaction benchmarkVibe Coding Workloads (
workloads/vibe_coding/)Startup scripts simulating real-world agentic cold-starts:
startup_pip_fastapi.sh— pip install + FastAPI server bootstartup_npm_vite.sh— npm install + Vite dev server bootUsage
Prerequisites (once per environment)
python -m perfkitbenchmarker.linux_benchmarks.kubernetes.agentic.gke_prerequisite_setup \ --project_id=sada-gke-benchmarking2 \ --region=us-central1 \ --zone=us-central1-a \ --machine_type=c4-standard-8Provision Cluster
python pkb.py --benchmarks=gke_python_density \ --run_stage=provision \ --gke_provision_mode=native \ --project=sada-gke-benchmarking2 \ --owner=george-kalisse \ --benchmark_config_file=k8s_agents/config/native_provision_config.yaml \ --gce_network_name=george-agentic-vpc \ --gce_subnet_region=us-central1 \ --zone=us-central1-a \ --container_cluster_version=1.35.3-gke.1389000 \ --gke_use_beta=true \ --gke_additional_flags="--enable-pod-snapshots,--enable-dataplane-v2,--enable-private-nodes,--enable-ip-alias,--master-ipv4-cidr=172.16.0.0/28,--workload-pool=sada-gke-benchmarking2.svc.id.goog,--subnetwork=george-agentic-subnet,--enable-master-authorized-networks,--master-authorized-networks=$(curl -s ifconfig.me)/32" \ --gke_additional_nodepool_flags="--max-pods-per-node=250" \ --gke_enable_shielded_nodes=false \ --run_uri=test \ --temp_dir=./testing/pkb/c4-standard-8/ucb``` ### Run Benchmark ```bash python pkb.py --benchmarks=gke_python_density \ --run_stage=prepare,run,cleanup \ --gke_provision_mode=native \ --gke_project_id=sada-gke-benchmarking2 \ --gke_region=us-central1 \ --gke_zone=us-central1-a \ --gke_sandbox_machine_type=c4-standard-8 \ --gke_namespace=agentic \ --gke_sandbox_version=v0.4.6 \ --gke_python_density=4 \ --gke_python_density_sample_count=20 \ --gke_python_density_sample_warmup=0 \ --gke_python_density_patch_warmpool=true \ --gke_python_density_exec_timeout=600 \ --gke_machine_type=c4-standard-8 \ --gke_gvisor=true \ --gke_api_url=http://localhost:8080 \ --run_uri=test \ --temp_dir=./testing/pkb/c4-standard-8/ucb