Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 73 additions & 15 deletions docs/source/fault_tolerance/usage_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -109,33 +109,91 @@ Validation behavior:
- Other existing types (e.g., devices/symlinks): performs ``stat`` access


Attribution service integration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Attribution integration
^^^^^^^^^^^^^^^^^^^^^^

Enable artifact analysis (e.g., logs) during rendezvous health checks by pointing to a running attribution service.
The feature is enabled by specifying both host and port.
Enable artifact analysis (e.g., logs) during rendezvous to make RESTART/STOP decisions.
You can configure **one or more backends** (e.g. ``mcp`` for LogSage + FR via MCP, plus an HTTP URL for a third-party service). The run stops the workload (no restart) if **any** backend reports do not restart.

* CLI:
Use ``--ft-attribution-backend`` (repeatable) and/or YAML ``attribution_backends``.

* ``mcp``: Log analysis via MCP subprocess (``nvrx-mcp-analysis``).
* **HTTP URL** (no separate keyword): pass the URL as the flag value, e.g.
``--ft-attribution-backend http://127.0.0.1:8000`` or ``--ft-attribution-backend host:port``
(``http://`` is added when you use ``host:port`` form).

- ``--ft-attrsvc-host <HOST>`` (alias: ``--ft_attrsvc_host``)
- ``--ft-attrsvc-port <PORT>`` (alias: ``--ft_attrsvc_port``)
* CLI:

Example:
- ``--ft-attribution-backend`` (alias: ``--ft_attribution_backend``): Add one backend; repeat for multiple.
Each value is ``mcp`` or an HTTP URL. Combined with YAML ``attribution_backends``.
- ``--ft-attribution-timeout`` (alias: ``--ft_attribution_timeout``): Wait/timeout in seconds;
skip result if exceeded (default: 60).
- ``--ft-attribution-dry-run`` (alias: ``--ft_attribution_dry_run``): Dry run. Run the full
attribution chain (log analysis, Slack, dataflow) but do not apply the restart/stop decision.
Log what would happen instead. Useful for validating the pipeline without affecting behavior.
- ``--ft-slack-token-file`` (alias: ``--ft_slack_token_file``): Path to file containing Slack bot token.
When not set, uses ``SLACK_BOT_TOKEN`` or ``SLACK_BOT_TOKEN_FILE`` env vars.
- ``--ft-slack-channel`` (alias: ``--ft_slack_channel``): Slack channel for alerts.
When not set, uses ``SLACK_CHANNEL`` env var.
- ``--ft-dataflow-index`` (alias: ``--ft_dataflow_index``): Elasticsearch/dataflow index for posting
attribution results (mcp/URL). Requires ``nvdataflow`` (install via ``pip install nvidia-resiliency-ext[dataflow]``).
When not set, dataflow posting is disabled.
- ``--ft-llm-api-key-file`` (alias: ``--ft_llm_api_key_file``): Path to a file containing the LLM API key.
Sets ``LLM_API_KEY_FILE`` in the process before MCP attribution starts. Overrides YAML ``llm_api_key_file`` when both are set.

Examples:

.. code-block:: bash

ft_launcher \
--ft-attrsvc-host 127.0.0.1 \
--ft-attrsvc-port 8000 \
train.py
# MCP: log analysis via nvrx-mcp-analysis
ft_launcher --ft-attribution-backend mcp train.py

# URL mode (HTTP attribution service)
ft_launcher --ft-attribution-backend http://127.0.0.1:8000 train.py

# Service with custom timeout
ft_launcher --ft-attribution-backend http://127.0.0.1:8000 --ft-attribution-timeout 90 train.py

# MCP with Slack and dataflow (token from file; channel from env)
ft_launcher --ft-attribution-backend mcp --ft-slack-token-file /etc/secrets/slack-token train.py

# MCP with explicit Slack channel and dataflow index
ft_launcher --ft-attribution-backend mcp \
--ft-slack-token-file /etc/secrets/slack-token --ft-slack-channel "#alerts" \
--ft-dataflow-index my-attribution-index train.py

# Dry run: exercise full attribution chain without applying restart/stop decision
ft_launcher --ft-attribution-backend mcp --ft-attribution-dry-run train.py

# Multiple backends: MCP plus third-party HTTP service
ft_launcher --ft-attribution-backend mcp --ft-attribution-backend http://127.0.0.1:8000 train.py

* YAML: under the ``fault_tolerance`` section
* YAML: under the ``fault_tolerance`` section use ``attribution_backends`` (list of ``mcp`` and/or URLs),
``attribution_timeout_seconds``, ``slack``, ``dataflow_index``, and optional ``llm_api_key_file``:

.. code-block:: yaml

fault_tolerance:
attrsvc_host: "127.0.0.1"
attrsvc_port: 8000
# Prefer explicit list for multiple backends:
attribution_backends:
- "mcp"
- "http://127.0.0.1:8000"
attribution_timeout_seconds: 60
attribution_dry_run: false # true = run chain but don't apply action; log only
slack:
bot_token_file: "/etc/secrets/slack-token" # or bot_token for inline (less secure)
channel: "#alerts"
dataflow_index: "my-attribution-index" # optional; requires nvdataflow
llm_api_key_file: "/etc/secrets/llm-api-key" # optional; sets LLM_API_KEY_FILE for MCP

* Environment (fallback when CLI/YAML not set):

- ``SLACK_BOT_TOKEN`` or ``SLACK_BOT_TOKEN_FILE``: Slack bot token for mcp/URL alerts.
- ``SLACK_CHANNEL``: Slack channel for alerts.
- **LLM / LogSage API key** (MCP backend): ``LLM_API_KEY`` or ``LLM_API_KEY_FILE``, or default files
``~/.llm_api_key`` / ``~/.config/nvrx/llm_api_key`` (see ``load_llm_api_key`` in
``nvidia_resiliency_ext.attribution.api_keys``). For ``ft_launcher``, use YAML ``llm_api_key_file`` or
``--ft-llm-api-key-file``.

GPU Memory Reclaim
^^^^^^^^^^^^^^^^^^
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ setproctitle = ">=1.3.0"
logsage = ">=0.1.7"
grpcio = "^1.76.0"
grpcio-tools = "^1.76.0"
httpx = ">=0.24.0"
protobuf = ">=4.22.0"

[tool.poetry.scripts]
Expand Down
10 changes: 5 additions & 5 deletions services/nvrx_attrsvc/ATTRSVC_SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,10 +94,10 @@ Two layers: **library** (`nvidia_resiliency_ext.attribution`) and **service**
**3.1 Environment variables** — Full table and defaults: **README.md** (source of truth).

Summary:
- Prefix **`NVRX_ATTRSVC_`** for service settings (see README for exceptions: NVIDIA
API key, Slack tokens, optional `NVIDIA_API_KEY_FILE` / file paths in `api_keys.py`).
- **`NVIDIA_API_KEY`**: required for attribution; loaded in `config.setup()` after
logging — **empty/missing → log error and process exit**. Slack is optional.
- Prefix **`NVRX_ATTRSVC_`** for service settings (see README for exceptions: LLM
API key, Slack tokens, optional `LLM_API_KEY_FILE` / file paths in `api_keys.py`).
- **`LLM_API_KEY`** / **`LLM_API_KEY_FILE`**: required for attribution (or default key files);
loaded in `config.setup()` after logging — **empty/missing → log error and process exit**. Slack is optional.
- LLM-related env vars are optional; unset → library defaults (`LogAnalyzerConfig`).
- Rate limits: slowapi, `RATE_LIMIT_SUBMIT` / `RATE_LIMIT_ANALYZE` / `RATE_LIMIT_PREVIEW`.

Expand Down Expand Up @@ -144,7 +144,7 @@ Patterns tried in order (scheduler-agnostic where possible): `_(\d+)_date_`,
--------------------------------------------------------------------------------

**Startup (conceptual)**
Load `Settings` → configure logging → **require non-empty NVIDIA API key** → wire
Load `Settings` → configure logging → **require non-empty LLM API key** → wire
postprocessing (`configure`, poster, dataflow index, Slack) → construct
`AttributionService` / **`Analyzer`** → background poll → Uvicorn. Optional cache
import.
Expand Down
14 changes: 7 additions & 7 deletions services/nvrx_attrsvc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ pip install -e .

# Run
export NVRX_ATTRSVC_ALLOWED_ROOT=/path/to/logs
# API key: set env var OR create ~/.nvidia_api_key file
export NVIDIA_API_KEY=nvapi-...
# API key: set env var OR create ~/.llm_api_key file
export LLM_API_KEY=your-llm-api-key-here
nvrx-attrsvc
```

Expand Down Expand Up @@ -57,11 +57,11 @@ Environment variables (prefix: `NVRX_ATTRSVC_`):
| `NVRX_ATTRSVC_COMPUTE_TIMEOUT` | Timeout for analysis in seconds |
| `NVRX_ATTRSVC_ANALYSIS_BACKEND` | `mcp` (subprocess MCP, default) or `lib` (in-process LogSage and flight-recorder analysis). Same setting for both; library behavior: **ARCHITECTURE.md §7**. Legacy env: `NVRX_ATTRSVC_LOG_ANALYSIS_BACKEND`. |

**NVIDIA API Key** (required, checked in order):
1. `NVIDIA_API_KEY` environment variable
2. `NVIDIA_API_KEY_FILE` environment variable (path to file)
3. `~/.nvidia_api_key` file
4. `~/.config/nvrx/nvidia_api_key` file
**LLM API Key** (required, checked in order — see `api_keys.load_llm_api_key`):
1. `LLM_API_KEY` environment variable
2. `LLM_API_KEY_FILE` environment variable (path to file)
3. `~/.llm_api_key` file
4. `~/.config/nvrx/llm_api_key` file

**Slack Notifications** (optional; no `NVRX_ATTRSVC_` prefix):

Expand Down
12 changes: 6 additions & 6 deletions services/nvrx_attrsvc/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -262,14 +262,14 @@ def setup() -> Settings:
logging.getLogger("nvidia_resiliency_ext.attribution.mcp_integration").setLevel(_root_lvl)
logging.getLogger("uvicorn.access").setLevel(logging.WARNING)

from nvidia_resiliency_ext.attribution.api_keys import load_nvidia_api_key, load_slack_bot_token
from nvidia_resiliency_ext.attribution.api_keys import load_llm_api_key, load_slack_bot_token

nvidia_key = load_nvidia_api_key()
if not nvidia_key:
llm_key = load_llm_api_key()
if not llm_key:
logger.error(
"NVIDIA API key not found or empty. Attribution requires a key. Set NVIDIA_API_KEY "
"or NVIDIA_API_KEY_FILE, or place a key in ~/.nvidia_api_key or "
"~/.config/nvrx/nvidia_api_key. Slack notifications remain optional (SLACK_BOT_TOKEN)."
"LLM API key not found or empty. Attribution requires a key. Set LLM_API_KEY or "
"LLM_API_KEY_FILE, or default key files (~/.llm_api_key, ~/.config/nvrx/llm_api_key). "
"Slack notifications remain optional (SLACK_BOT_TOKEN)."
)
raise SystemExit(1)

Expand Down
2 changes: 1 addition & 1 deletion services/nvrx_attrsvc/deploy/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# docker run -d \
# -p 8000:8000 \
# -e NVRX_ATTRSVC_ALLOWED_ROOT=/data/logs \
# -e NVIDIA_API_KEY=nvapi-... \
# -e LLM_API_KEY=your-llm-api-key-here \
# -v /path/to/logs:/data/logs:ro \
# nvrx-attrsvc
#
Expand Down
6 changes: 3 additions & 3 deletions services/nvrx_attrsvc/deploy/kubernetes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# kubectl apply -f services/nvrx_attrsvc/deploy/kubernetes.yaml
#
# Prerequisites:
# - Create secret: kubectl create secret generic nvidia-api-key --from-literal=api-key=nvapi-...
# - Create secret: kubectl create secret generic llm-api-key --from-literal=api-key=your-llm-api-key-here
# - Ensure log volume is accessible (update hostPath as needed)
#
# Deployment considerations:
Expand Down Expand Up @@ -54,10 +54,10 @@ spec:
- configMapRef:
name: nvrx-attrsvc-config
env:
- name: NVIDIA_API_KEY
- name: LLM_API_KEY
valueFrom:
secretKeyRef:
name: nvidia-api-key
name: llm-api-key
key: api-key
volumeMounts:
- name: logs
Expand Down
2 changes: 1 addition & 1 deletion services/nvrx_attrsvc/deploy/nvrx-attrsvc.service
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# Manual installation:
# 1. Create venv: python3 -m venv /opt/nvrx/venv
# 2. Install: /opt/nvrx/venv/bin/pip install -e services
# 3. Create API key: echo "nvapi-xxx" | sudo tee /etc/nvrx/nvidia_api_key
# 3. Create API key: echo "your-llm-api-key-here" | sudo tee /etc/nvrx/llm_api_key
# 4. Copy service: sudo cp nvrx-attrsvc.service /etc/systemd/system/
# 5. Reload: sudo systemctl daemon-reload
# 6. Enable: sudo systemctl enable nvrx-attrsvc
Expand Down
6 changes: 3 additions & 3 deletions services/nvrx_attrsvc/deploy/run_attrsvc.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
# Required environment variables:
# NVRX_ATTRSVC_ALLOWED_ROOT - Root path for log files to analyze
# NVIDIA_API_KEY - API key for LLM (or NVIDIA_API_KEY_FILE)
# LLM_API_KEY - API key for LLM (or LLM_API_KEY_FILE)
#
# Optional environment variables:
# NVRX_ATTRSVC_PORT - Listen port (default: 8000)
Expand All @@ -17,7 +17,7 @@
#
# Example:
# export NVRX_ATTRSVC_ALLOWED_ROOT=/lustre/logs
# export NVIDIA_API_KEY=nvapi-...
# export LLM_API_KEY=your-llm-api-key-here
# ./run_attrsvc.sh ~/nvrx_logs

set -e
Expand All @@ -38,7 +38,7 @@ PID_FILE="${OUTPUT_DIR}/${PREFIX}_attrsvc.pid"
validate_attrsvc_allowed_root || exit 1

# Setup API key
setup_nvidia_api_key || exit 1
setup_llm_api_key || exit 1

# Create output directory
ensure_directory "${OUTPUT_DIR}" "logs directory" || exit 1
Expand Down
8 changes: 4 additions & 4 deletions services/nvrx_attrsvc/deploy/slurm.sbatch
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@
# NVRX_ATTRSVC_ALLOWED_ROOT - Root path for log files to analyze
#
# API Key Options (in priority order):
# 1. NVIDIA_API_KEY env var (direct key)
# 2. NVIDIA_API_KEY_FILE env var (path to file containing key)
# 3. Default: ~/.nvidia_api_key
# 1. LLM_API_KEY env var (direct key)
# 2. LLM_API_KEY_FILE env var (path to file containing key)
# 3. Default: ~/.llm_api_key or ~/.config/nvrx/llm_api_key
#
# Example:
# NVRX_ATTRSVC_ALLOWED_ROOT=/lustre/logs sbatch --account=myaccount slurm.sbatch
Expand Down Expand Up @@ -48,7 +48,7 @@ export NVRX_ATTRSVC_NVDATAFLOW_PROJECT="${NVRX_ATTRSVC_NVDATAFLOW_PROJECT:-}"
export NVRX_ATTRSVC_CLUSTER_NAME="${NVRX_ATTRSVC_CLUSTER_NAME:-${SLURM_CLUSTER_NAME:-unknown}}"

# Setup API key
setup_nvidia_api_key || exit 1
setup_llm_api_key || exit 1

# Install packages
install_nvrx_packages "attrsvc"
Expand Down
14 changes: 7 additions & 7 deletions services/scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ Shared shell scripts for deployment and monitoring.
```bash
# Set required environment
export NVRX_ATTRSVC_ALLOWED_ROOT=/lustre/logs
# API key: set env var OR create ~/.nvidia_api_key file
export NVIDIA_API_KEY=nvapi-...
# API key: set env var OR create ~/.llm_api_key file
export LLM_API_KEY=your-llm-api-key-here

# Install, start, and manage
./scripts/run_services.sh install # Install packages
Expand Down Expand Up @@ -50,10 +50,10 @@ sudo ./scripts/setup_systemd.sh start
### API Key

The API key can be provided in multiple ways (checked in order):
1. `NVIDIA_API_KEY` environment variable
2. `NVIDIA_API_KEY_FILE` environment variable (path to key file)
3. `~/.nvidia_api_key` file
4. `~/.config/nvrx/nvidia_api_key` file
1. `LLM_API_KEY` environment variable
2. `LLM_API_KEY_FILE` environment variable (path to key file)
3. `~/.llm_api_key` file
4. `~/.config/nvrx/llm_api_key` file

**Output files** (in `~/nvrx_logs/` by default):
- `<timestamp>_attrsvc.log` - Attribution service stdout/stderr
Expand Down Expand Up @@ -129,7 +129,7 @@ Shared functions sourced by other scripts:

| Function | Description |
|----------|-------------|
| `setup_nvidia_api_key` | Load API key from env, file, or default location |
| `setup_llm_api_key` | Load LLM API key from env, file, or default location |
| `install_nvrx_packages` | Install NVRX packages from local repo |
| `validate_commands` | Check required commands exist |

Expand Down
4 changes: 2 additions & 2 deletions services/scripts/build_enroot_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
# # Run attribution service
# srun --container-image=/path/to/nvrx-services.sqsh \
# --container-env=NVRX_ATTRSVC_ALLOWED_ROOT=/data \
# --container-env=NVIDIA_API_KEY=${NVIDIA_API_KEY} \
# --container-env=LLM_API_KEY=${LLM_API_KEY} \
# --container-mounts=/path/to/logs:/data:ro \
# nvrx-attrsvc
#
Expand Down Expand Up @@ -150,7 +150,7 @@ echo ""
echo " # Attribution service"
echo " srun --container-image=${OUTPUT_PATH} \\"
echo " --container-env=NVRX_ATTRSVC_ALLOWED_ROOT=/data \\"
echo " --container-env=NVIDIA_API_KEY=\${NVIDIA_API_KEY} \\"
echo " --container-env=LLM_API_KEY=\${LLM_API_KEY} \\"
echo " --container-mounts=/path/to/logs:/data:ro \\"
echo " nvrx-attrsvc"
echo ""
Expand Down
Loading
Loading