Skip to content

[V2] Add GPU support for sandbox via FLYTE_SANDBOX_GPU env var#7177

Open
pingsutw wants to merge 8 commits intov2from
worktree-dgx
Open

[V2] Add GPU support for sandbox via FLYTE_SANDBOX_GPU env var#7177
pingsutw wants to merge 8 commits intov2from
worktree-dgx

Conversation

@pingsutw
Copy link
Copy Markdown
Member

@pingsutw pingsutw commented Apr 9, 2026

Summary

  • Add opt-in NVIDIA GPU support for the Flyte sandbox, targeting machines like DGX Spark
  • When FLYTE_SANDBOX_GPU=true is set, the sandbox container gets --gpus all and auto-configures K3s with the NVIDIA container runtime and device plugin
  • No impact on non-GPU machines — GPU support is disabled by default

Changes

  • docker/sandbox-bundled/Makefile: Add FLYTE_SANDBOX_GPU env var that conditionally passes --gpus all to Docker
  • docker/sandbox-bundled/bin/k3d-entrypoint-gpu.sh (new): Entrypoint script that, when GPU is enabled:
    • Configures K3s containerd to use nvidia-container-runtime as default runtime
    • Deploys NVIDIA k8s-device-plugin DaemonSet for nvidia.com/gpu resource advertisement

Usage

# On GPU machine (e.g., DGX Spark)
FLYTE_SANDBOX_GPU=true make start

# On non-GPU machine (default)
make start

Test plan

  • Verify make start still works without FLYTE_SANDBOX_GPU (no regression on non-GPU machines)
  • Verify FLYTE_SANDBOX_GPU=true make start on a machine with NVIDIA GPU + Container Toolkit
  • Verify nvidia.com/gpu resources are advertised: kubectl describe node | grep nvidia
  • Verify a pod requesting GPU can be scheduled: kubectl run gpu-test --image=nvidia/cuda:12.0-base --restart=Never -- nvidia-smi
  • main

Enable NVIDIA GPU passthrough in the sandbox container by setting
FLYTE_SANDBOX_GPU=true. This is useful for running GPU workloads
on machines like DGX Spark.

When enabled:
- Passes --gpus all to the Docker container
- Configures K3s containerd to use the NVIDIA container runtime
- Deploys the NVIDIA k8s-device-plugin DaemonSet for nvidia.com/gpu
  resource advertisement

Usage: FLYTE_SANDBOX_GPU=true make start
Signed-off-by: Kevin Su <pingsutw@apache.org>
@github-actions github-actions bot mentioned this pull request Apr 9, 2026
3 tasks
pingsutw added 7 commits April 8, 2026 17:30
Add Dockerfile.gpu that extends the base sandbox image with NVIDIA
Container Toolkit binaries (runtime, CLI, ctk) copied from the
official NVIDIA container toolkit image.

Add build-and-push-sandbox-bundled-gpu-image job to the CI workflow
that builds and pushes ghcr.io/flyteorg/flyte-sandbox-v2-gpu.
The GPU image is amd64-only since NVIDIA DGX targets x86_64.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
The base sandbox image is only pushed on push/workflow_dispatch events,
so the sha-based tag doesn't exist during PR builds. Use the nightly
tag as fallback for PRs, and only push the GPU image on push/dispatch.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Resolve conflicts from docker/sandbox-bundled -> docker/demo-bundled rename.
Move GPU Dockerfile and entrypoint into docker/demo-bundled, and update
the flyte-binary-v2 workflow GPU job to reference the renamed path and
depend on build-and-push-demo-bundled-image.

Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw self-assigned this Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant