[V2] Add GPU support for sandbox via FLYTE_SANDBOX_GPU env var#7177
Open
[V2] Add GPU support for sandbox via FLYTE_SANDBOX_GPU env var#7177
Conversation
Enable NVIDIA GPU passthrough in the sandbox container by setting FLYTE_SANDBOX_GPU=true. This is useful for running GPU workloads on machines like DGX Spark. When enabled: - Passes --gpus all to the Docker container - Configures K3s containerd to use the NVIDIA container runtime - Deploys the NVIDIA k8s-device-plugin DaemonSet for nvidia.com/gpu resource advertisement Usage: FLYTE_SANDBOX_GPU=true make start Signed-off-by: Kevin Su <pingsutw@apache.org>
Add Dockerfile.gpu that extends the base sandbox image with NVIDIA Container Toolkit binaries (runtime, CLI, ctk) copied from the official NVIDIA container toolkit image. Add build-and-push-sandbox-bundled-gpu-image job to the CI workflow that builds and pushes ghcr.io/flyteorg/flyte-sandbox-v2-gpu. The GPU image is amd64-only since NVIDIA DGX targets x86_64. Signed-off-by: Kevin Su <pingsutw@apache.org>
The base sandbox image is only pushed on push/workflow_dispatch events, so the sha-based tag doesn't exist during PR builds. Use the nightly tag as fallback for PRs, and only push the GPU image on push/dispatch. Signed-off-by: Kevin Su <pingsutw@apache.org>
Resolve conflicts from docker/sandbox-bundled -> docker/demo-bundled rename. Move GPU Dockerfile and entrypoint into docker/demo-bundled, and update the flyte-binary-v2 workflow GPU job to reference the renamed path and depend on build-and-push-demo-bundled-image. Signed-off-by: Kevin Su <pingsutw@apache.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
FLYTE_SANDBOX_GPU=trueis set, the sandbox container gets--gpus alland auto-configures K3s with the NVIDIA container runtime and device pluginChanges
docker/sandbox-bundled/Makefile: AddFLYTE_SANDBOX_GPUenv var that conditionally passes--gpus allto Dockerdocker/sandbox-bundled/bin/k3d-entrypoint-gpu.sh(new): Entrypoint script that, when GPU is enabled:nvidia-container-runtimeas default runtimenvidia.com/gpuresource advertisementUsage
Test plan
make startstill works withoutFLYTE_SANDBOX_GPU(no regression on non-GPU machines)FLYTE_SANDBOX_GPU=true make starton a machine with NVIDIA GPU + Container Toolkitnvidia.com/gpuresources are advertised:kubectl describe node | grep nvidiakubectl run gpu-test --image=nvidia/cuda:12.0-base --restart=Never -- nvidia-smimain