Transolver cae shard tensor#1584
Conversation
Greptile SummaryThis PR adds end-to-end ShardTensor / domain-parallelism support to the Transolver example, introducing a 2D
Important Files Changed
|
| # Debug: show rank, local rank, device, and node info for each process | ||
| import socket | ||
| hostname = socket.gethostname() | ||
| device = dist_manager.device | ||
| rank = dist_manager.rank | ||
| local_rank = dist_manager.local_rank | ||
| world_size = dist_manager.world_size | ||
| gpu_name = torch.cuda.get_device_name(device) if torch.cuda.is_available() else "N/A" | ||
| print( | ||
| f"[DEBUG] Rank {rank}/{world_size - 1} | " | ||
| f"Local Rank {local_rank} | " | ||
| f"Node: {hostname} | " | ||
| f"Device: {device} | " | ||
| f"GPU: {gpu_name}" | ||
| ) | ||
| if torch.cuda.is_available(): | ||
| mem_total = torch.cuda.get_device_properties(device).total_memory / (1024**3) | ||
| print( | ||
| f"[DEBUG] Rank {rank} | GPU Memory: {mem_total:.1f} GB | " | ||
| f"CUDA Version: {torch.version.cuda}" | ||
| ) | ||
| torch.distributed.barrier() |
There was a problem hiding this comment.
Debug block with barrier should be removed before merging
This block is labelled [DEBUG], imports socket inside main(), and ends with torch.distributed.barrier(). The barrier adds a hard synchronisation point on every run — straggling ranks will block all others there, and in a 64-GPU job it can add measurable wall-clock overhead. Debug print calls from all ranks will also produce interleaved noise in logs.
Please remove this block (or guard it behind a --debug flag / logger.debug) before landing in main.
| export USER_LUSTRE=/lustre/fsw/portfolios/coreai/users/coreya | ||
| export GROUP_LUSTRE=/lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae | ||
| export HOME=${USER_LUSTRE} | ||
|
|
||
| # Container setup | ||
| CONTAINER_IMAGE=$GROUP_LUSTRE/containers/pytorch26.01-py3.sqsh | ||
| CONTAINER_MOUNTS="$USER_LUSTRE:/user_data/,$GROUP_LUSTRE:/group_data,$HOME:/root/,/lustre:/lustre,/tmp:/tmp" | ||
|
|
||
| # Virtual environment path | ||
| VENV_PATH="$USER_LUSTRE/venvs/shard_tensor_benchmarks/" | ||
|
|
||
| WORKDIR="$USER_LUSTRE/workdir/shard_tensor_benchmarks/examples/cfd/external_aerodynamics/transformer_models/" |
There was a problem hiding this comment.
Hardcoded NVIDIA-internal paths make this script non-functional for external users
USER_LUSTRE, GROUP_LUSTRE, CONTAINER_IMAGE, VENV_PATH, WORKDIR, ZARR_TRAIN_PATH, and ZARR_VAL_PATH are all hardcoded to a specific user's (coreya) NVIDIA-internal Lustre filesystem. Anyone else cloning the repo will need to change every one of these before the script can run.
Consider replacing these with placeholder values (e.g. YOUR_LUSTRE_HOME, YOUR_CONTAINER_IMAGE) and documenting each variable so the script serves as a usable template.
| step_records.append({ | ||
| "timestamp": datetime.now(timezone.utc).isoformat(), | ||
| "phase": "train", | ||
| "epoch": epoch, | ||
| "iteration": i, | ||
| "global_step": i + epoch_len * epoch, | ||
| "loss": this_loss, | ||
| "learning_rate": optimizer.param_groups[0]["lr"], | ||
| "duration_s": duration, | ||
| "throughput_per_gpu": images_per_second, | ||
| "memory_reserved_gb": mem_usage, | ||
| **{k: float(v) if hasattr(v, "item") else v for k, v in metrics.items()}, | ||
| }) |
There was a problem hiding this comment.
step_records.append in train_epoch is not guarded by rank 0 — inconsistent with val_epoch
In val_epoch the corresponding step_records.append block is wrapped in if dist_manager.rank == 0:, but here it runs on every rank. Because metrics_log_path is None on non-zero ranks no file write occurs, but all non-zero ranks silently build an in-memory list that is never used. Consider adding the same if dist_manager.rank == 0: guard here for consistency.
PhysicsNeMo Pull Request
This PR adds Shard Tensor support to the transolver example end to end.
Description
Checklist
Dependencies
Review Process
All PRs are reviewed by the PhysicsNeMo team before merging.
Depending on which files are changed, GitHub may automatically assign a maintainer for review.
We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.
AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.