diff --git a/docs/user-guide/auto-mnnvl.md b/docs/user-guide/auto-mnnvl.md index 01922dee1..cf7dbebe7 100644 --- a/docs/user-guide/auto-mnnvl.md +++ b/docs/user-guide/auto-mnnvl.md @@ -1,32 +1,34 @@ # Auto MNNVL (Multi-Node NVLink) -Grove can automatically enable Multi-Node NVLink (MNNVL) acceleration without requiring any manual configuration. This guide explains how to enable, use, and manage the auto MNNVL feature. +Grove can automatically manage Multi-Node NVLink (MNNVL) acceleration for your GPU workloads. This guide explains how to enable, configure, and manage the auto MNNVL feature using the `grove.io/mnnvl-group` annotation. ## Overview -**MNNVL (Multi-Node NVLink)** is an NVIDIA technology that extends high-bandwidth, low-latency NVLink GPU-to-GPU communication across multiple physical nodes. +**MNNVL (Multi-Node NVLink)** is an NVIDIA technology that extends high-bandwidth, low-latency NVLink GPU-to-GPU communication across multiple physical nodes. In Kubernetes, MNNVL is exposed through NVIDIA's Dynamic Resource Allocation (DRA) driver via a custom resource called **ComputeDomain**. The ComputeDomain represents a logical GPU fabric spanning multiple nodes. -Without the auto MNNVL feature, users must manually create ComputeDomain resources and wire up `resourceClaims` in their pod specs. With auto MNNVL enabled, Grove handles this automatically: +Without the auto MNNVL feature, you must manually create ComputeDomain resources and wire up `resourceClaims` in your pod specs. With auto MNNVL enabled, Grove handles this automatically: -When auto MNNVL is enabled, Grove will: -- Detect GPU containers in your PodCliqueSet -- Create one ComputeDomain per PCS replica -- Manage the full ComputeDomain lifecycle (creation, scaling, deletion) +- Detects GPU containers in your PodCliqueSet (PCS) +- Creates one ComputeDomain per MNNVL group per PCS replica +- Injects resource claim references into enrolled PodClique (PCLQ) pod specs +- Manages the full ComputeDomain lifecycle (creation, scaling, deletion) -| Mode | Description | Best For | -|------|-------------|----------| -| **Auto** (default when enabled) | Grove automatically creates ComputeDomains and injects resource claims for GPU workloads | Most GPU workloads on MNNVL-capable clusters | -| **Opt-out** | User explicitly disables MNNVL per PodCliqueSet via annotation | Custom ComputeDomain configurations, non-MNNVL GPU workloads | +You control MNNVL participation with a single annotation, `grove.io/mnnvl-group`, which can be placed at any level of the PCS hierarchy: + +| Annotation value | Meaning | +|---|---| +| A group name (e.g., `"my-group"`, `"workers"`) | Opt in to MNNVL. PodCliques with the same group name share a ComputeDomain per replica. | +| `"none"` | Explicit opt-out. Overrides a parent layer's group assignment. | +| Absent | Inherit from the parent layer. If no parent sets it, no MNNVL. | ## Prerequisites and Constraints Before enabling auto MNNVL, ensure your cluster meets the following requirements: -1. **NVIDIA GPUs with MNNVL support** on all nodes that will run MNNVL workloads +1. **NVIDIA GPUs with MNNVL support** on the nodes that will run MNNVL workloads. 2. **NVIDIA DRA driver** installed, which provides the ComputeDomain CRD (`computedomains.resource.nvidia.com`). See the [NVIDIA DRA driver installation guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-dra.html) for setup instructions. -3. **Homogeneous GPU cluster** -- all nodes must have identical GPU types and NVLink topology. Grove does not validate or enforce cluster homogeneity; it is the cluster administrator's responsibility to ensure this requirement is met. Enabling MNNVL on heterogeneous clusters may result in undefined scheduling behavior. -4. **Grove operator** deployed via Helm +3. **Grove operator** deployed via Helm. ## Enabling the Feature @@ -47,38 +49,51 @@ helm upgrade -i grove oci://ghcr.io/ai-dynamo/grove/grove-charts --version **Note:** The `grove.io/auto-mnnvl` annotation is **immutable** after PCS creation. Any attempt to add, modify, or remove it on an existing PCS will be rejected. To change MNNVL behavior, delete the PCS and recreate it. +> **Note:** The `grove.io/mnnvl-group` annotation is **immutable** after PCS creation. Any attempt to add, modify, or remove it on an existing PCS, PCSG, or PCLQ is rejected. To change MNNVL configuration, delete the PCS and recreate it. -## Opting Out +## Usage Examples -If auto MNNVL is enabled globally but you want a specific PodCliqueSet to **not** use MNNVL, explicitly set the annotation to `"disabled"` at creation time: +### Simple Opt-In (All GPU PodCliques in One Group) + +Add `grove.io/mnnvl-group` at the PCS level. All GPU PodCliques share a single ComputeDomain per replica: ```yaml apiVersion: grove.io/v1alpha1 kind: PodCliqueSet metadata: - name: my-non-mnnvl-workload + name: my-workload annotations: - grove.io/auto-mnnvl: "disabled" + grove.io/mnnvl-group: "my-group" spec: - replicas: 1 + replicas: 2 template: cliques: - name: worker spec: + replicas: 1 podSpec: containers: - name: model @@ -88,22 +103,212 @@ spec: nvidia.com/gpu: "8" ``` -When opting out, the operator will **not** create ComputeDomains or any other MNNVL-related artifacts for that PCS. +This creates ComputeDomains `my-workload-0-my-group` and `my-workload-1-my-group`, for all PodCliques. + +### Multiple MNNVL Groups + +Assign different PodCliques to separate groups. Each group gets its own ComputeDomain per replica: + +```yaml +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: training-job +spec: + replicas: 2 + template: + cliques: + - name: workers + annotations: + grove.io/mnnvl-group: "workers" + spec: + replicas: 1 + podSpec: + containers: + - name: worker + image: my-model:latest + resources: + limits: + nvidia.com/gpu: "8" + - name: encoders + annotations: + grove.io/mnnvl-group: "encoders" + spec: + replicas: 1 + podSpec: + containers: + - name: encoder + image: my-encoder:latest + resources: + limits: + nvidia.com/gpu: "4" + - name: param-servers + spec: + replicas: 1 + podSpec: + containers: + - name: ps + image: my-ps:latest + resources: + limits: + cpu: "4" +``` + +Result per replica: +- `workers` → group `"workers"` → ComputeDomain `training-job-0-workers` / `training-job-1-workers` +- `encoders` → group `"encoders"` → ComputeDomain `training-job-0-encoders` / `training-job-1-encoders` +- `param-servers` → no annotation → no MNNVL + +### PCS-Level Default With PCLQ Opt-Out + +Set a default group at the PCS level and override on a specific PodClique: + +```yaml +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: my-workload + annotations: + grove.io/mnnvl-group: "my-group" +spec: + replicas: 1 + template: + cliques: + - name: workers + spec: + replicas: 1 + podSpec: + containers: + - name: worker + image: my-model:latest + resources: + limits: + nvidia.com/gpu: "8" + - name: monitoring + annotations: + grove.io/mnnvl-group: "none" + spec: + replicas: 1 + podSpec: + containers: + - name: monitor + image: my-monitor:latest + resources: + limits: + nvidia.com/gpu: "1" +``` + +- `workers` → inherits `"my-group"` from PCS → enrolled +- `monitoring` → `"none"` overrides PCS → no MNNVL + +### Three-Layer Hierarchy (PCS → PCSG → PCLQ) + +A PCS-level group propagates through PodCliqueScalingGroups down to PodCliques. Each layer can override or opt out: + +```yaml +apiVersion: grove.io/v1alpha1 +kind: PodCliqueSet +metadata: + name: my-workload + annotations: + grove.io/mnnvl-group: "my-group" +spec: + replicas: 1 + template: + podCliqueScalingGroups: + - name: opted-out-sg + annotations: + grove.io/mnnvl-group: "none" + cliques: [gpu-a, gpu-b] + - name: inherited-sg + cliques: [gpu-c, gpu-d] + cliques: + - name: gpu-a + spec: + replicas: 1 + podSpec: + containers: + - name: a + image: my-model:latest + resources: + limits: + nvidia.com/gpu: "8" + - name: gpu-b + annotations: + grove.io/mnnvl-group: "my-group" + spec: + replicas: 1 + podSpec: + containers: + - name: b + image: my-model:latest + resources: + limits: + nvidia.com/gpu: "8" + - name: gpu-c + spec: + replicas: 1 + podSpec: + containers: + - name: c + image: my-model:latest + resources: + limits: + nvidia.com/gpu: "8" + - name: gpu-d + annotations: + grove.io/mnnvl-group: "none" + spec: + replicas: 1 + podSpec: + containers: + - name: d + image: my-model:latest + resources: + limits: + nvidia.com/gpu: "8" +``` + +Resolution for each PodClique: + +| PodClique | PCLQ annotation | PCSG annotation | PCS annotation | Effective group | +|---|---|---|---|---| +| `gpu-a` | Absent | `"none"` | `"my-group"` | No MNNVL — PCSG opts out | +| `gpu-b` | `"my-group"` | `"none"` | `"my-group"` | `"my-group"` — PCLQ overrides PCSG back into the PCS group | +| `gpu-c` | Absent | Absent | `"my-group"` | `"my-group"` — inherited through PCSG from PCS | +| `gpu-d` | `"none"` | Absent | `"my-group"` | No MNNVL — PCLQ opts out | + +### Group Name Requirements + +The `grove.io/mnnvl-group` value must be either `"none"` (opt-out) or a valid DNS-1123 label: lowercase alphanumeric characters or dashes, starting and ending with an alphanumeric character, max 63 characters. The group name becomes part of the ComputeDomain resource name. Invalid values are rejected at admission time. + +## ComputeDomain Naming -> **Note:** The annotation value must be exactly `"enabled"` or `"disabled"`. Other values (e.g., `"true"`, `"false"`, empty string) will be rejected. +Grove creates one ComputeDomain per MNNVL group per PCS replica. The name follows the pattern `{pcs-name}-{replica-index}-{group-name}`. + +For a PCS named `my-workload` with `replicas: 2` and the annotation `grove.io/mnnvl-group: "my-group"`: + +| Replica | ComputeDomain name | +|---|---| +| 0 | `my-workload-0-my-group` | +| 1 | `my-workload-1-my-group` | + +If multiple groups exist within the same PCS (e.g., `"workers"` and `"encoders"`), each group produces its own set of ComputeDomains: `my-workload-0-workers`, `my-workload-0-encoders`, etc. + +Group names are scoped to the PCS — different PCS resources can reuse the same group names without conflict. ## Observability ### Checking ComputeDomain Status -ComputeDomains have no meaningful status at creation time -- they only become active after pods referencing their ResourceClaimTemplate are scheduled. To check the status of ComputeDomains managed by Grove: +ComputeDomains have no meaningful status at creation time — they only become active after pods referencing their ResourceClaimTemplate are scheduled. To check the status of ComputeDomains managed by Grove: ```bash # List all ComputeDomains for a specific PCS -kubectl get computedomain -l app.kubernetes.io/part-of=my-inference +kubectl get computedomain -l app.kubernetes.io/part-of=my-workload # Get detailed status for a specific ComputeDomain -kubectl describe computedomain my-inference-0 +kubectl describe computedomain my-workload-0-my-group ``` ### Kubernetes Events @@ -111,19 +316,19 @@ kubectl describe computedomain my-inference-0 Grove emits Kubernetes events on the PCS resource for ComputeDomain lifecycle operations: ```bash -kubectl describe pcs my-inference +kubectl describe pcs my-workload # Events: -# Normal ComputeDomainCreated ComputeDomain my-inference-0 created -# Normal ComputeDomainCreated ComputeDomain my-inference-1 created +# Normal ComputeDomainCreated ComputeDomain my-workload-0-my-group created +# Normal ComputeDomainCreated ComputeDomain my-workload-1-my-group created # Warning ComputeDomainFailed Failed to create ComputeDomain for replica 2: ``` -### Verifying MNNVL Annotation +### Verifying MNNVL Group -To check whether a PCS has auto MNNVL enabled: +To check which MNNVL group a PCS belongs to: ```bash -kubectl get pcs my-inference -o jsonpath='{.metadata.annotations.grove\.io/auto-mnnvl}' +kubectl get pcs my-workload -o jsonpath='{.metadata.annotations.grove\.io/mnnvl-group}' ``` ## Scaling Behavior @@ -132,10 +337,10 @@ Auto MNNVL integrates seamlessly with PodCliqueSet scaling: ### Scale-Out -When you increase the replica count on a PCS, the operator automatically creates new ComputeDomains for the additional replicas. For example, scaling `my-inference` from 2 to 4 replicas creates `my-inference-2` and `my-inference-3`. +When you increase the replica count on a PCS, the operator automatically creates new ComputeDomains for the additional replicas. For example, scaling `my-workload` from 2 to 4 replicas creates `my-workload-2-my-group` and `my-workload-3-my-group`. ```bash -kubectl scale pcs my-inference --replicas=4 +kubectl scale pcs my-workload --replicas=4 ``` ### Scale-In @@ -143,22 +348,30 @@ kubectl scale pcs my-inference --replicas=4 When you decrease the replica count, the operator removes ComputeDomains for the excess replicas. Any ComputeDomain with a replica index equal to or greater than the new count is cleaned up. ```bash -kubectl scale pcs my-inference --replicas=1 -# ComputeDomains my-inference-1, my-inference-2, my-inference-3 are deleted +kubectl scale pcs my-workload --replicas=1 +# ComputeDomains my-workload-1-my-group, my-workload-2-my-group, my-workload-3-my-group are deleted ``` ### Deletion Protection -Grove adds a finalizer (`grove.io/computedomain-finalizer`) to each ComputeDomain it creates. This prevents accidental deletion of ComputeDomains while pods are actively using them. If a user attempts to delete a ComputeDomain manually, it will remain in `Terminating` state until the owning PCS is deleted or scaled down. +Grove adds a finalizer (`grove.io/computedomain-finalizer`) to each ComputeDomain it creates. This prevents accidental deletion of ComputeDomains while pods are actively using them. If a user attempts to delete a ComputeDomain manually, the controller recreates it as long as the owning PCS still requires it. ## Backward Compatibility -Existing PodCliqueSet resources created before the MNNVL feature was enabled will not have the `grove.io/auto-mnnvl` annotation. These workloads will continue to operate without ComputeDomains, even after the feature is enabled globally. To enable MNNVL for an existing workload, the PCS must be deleted and recreated. +The `grove.io/mnnvl-group` annotation replaces the previous `grove.io/auto-mnnvl` annotation. Existing PodCliqueSet resources using `grove.io/auto-mnnvl` must be updated: + +| Before | After | +|---|---| +| `grove.io/auto-mnnvl: "enabled"` | `grove.io/mnnvl-group: ""` | +| `grove.io/auto-mnnvl: "disabled"` | `grove.io/mnnvl-group: "none"` or remove the annotation | +| No annotation | No annotation (no change needed) | + +The `grove.io/auto-mnnvl` annotation is no longer recognized. PCS resources carrying it will have the annotation ignored. + +> **Note:** ComputeDomain naming has changed. Previously, CDs were named `{pcs-name}-{replica-index}`. Now they are named `{pcs-name}-{replica-index}-{group-name}`. Existing ComputeDomains created under the old naming scheme are not automatically migrated. Delete and recreate affected PCS resources after upgrading. ## Limitations -- **One IMEX channel per node:** Currently, only one IMEX domain can be supported per node. Since, ComputeDomains do not support sharing IMEX domain between workloads, each node can only support pods from at most one MNNVL-enabled workload at a time. MNNVL-enabled pods from different workloads cannot share the same node at the same time. -- **PCS-level granularity:** Auto-MNNVL feature is uniformly applied to the entire PodCliqueSet. All GPU pods in a PCS will automatically get the IMEX channel setup if the underlying GPUs support it. This feature cannot be special-cased for a subset of PodCliques in a PodCliqueSet. -- **No ComputeDomain customization:** ComputeDomain and ResourceClaimTemplate configurations are automatically generated and cannot be customized. Opt out if you need custom settings. -- **Immutable annotation:** The `grove.io/auto-mnnvl` annotation cannot be changed after PCS creation. Delete and recreate the PCS to change MNNVL behavior. -- **No ComputeDomain status propagation:** Grove does not surface ComputeDomain status in PCS status fields. Inspect the ComputeDomain resource directly using `kubectl get computedomain`. \ No newline at end of file +- **One IMEX channel per node:** Currently, only one IMEX domain can be supported per node. ComputeDomains do not support sharing an IMEX domain between workloads, so each node can support pods from at most one MNNVL-enabled workload at a time. +- **Immutable annotation:** The `grove.io/mnnvl-group` annotation cannot be changed after PCS creation at any level (PCS, PCSG, PCLQ). Delete and recreate the PCS to change MNNVL configuration. +- **No ComputeDomain status propagation:** Grove does not surface ComputeDomain status in PCS status fields. Inspect the ComputeDomain resource directly using `kubectl get computedomain`.