Transolver cae shard tensor by coreyjadams · Pull Request #1584 · NVIDIA/physicsnemo

coreyjadams · 2026-04-23T14:21:21Z

PhysicsNeMo Pull Request

This PR adds Shard Tensor support to the transolver example end to end.

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

greptile-apps · 2026-04-23T14:24:05Z

Greptile Summary

This PR adds end-to-end ShardTensor / domain-parallelism support to the Transolver example, introducing a 2D (ddp, domain) DeviceMesh, FSDP2 model wrapping, ShardTensor-aware subsampling in the datapipe, corrected metric reduction scoped to the data-parallel dimension, and per-step JSONL logging.

P1 – debug barrier in main(): A [DEBUG] print block with torch.distributed.barrier() was left in main(). The barrier adds a hard synchronisation point on every run regardless of config.
P1 – silent fallback for physics inputs: air_density and stream_velocity default silently to 1.0 when sharded reads drop Zarr attributes, which can corrupt training without any warning to the user.
The train_transolver_sharded.sh script hardcodes NVIDIA-internal Lustre paths and should use placeholder variables before merging as a public example.

Important Files Changed

Filename	Overview
examples/cfd/external_aerodynamics/transformer_models/src/train.py	Adds domain-parallelism setup, FSDP2 model wrapping, 2D DeviceMesh, ShardTensor materialisation, and JSONL step logging; contains leftover debug print blocks with a barrier call in main()
physicsnemo/datapipes/cae/transolver_datapipe.py	Adds ShardTensor-aware subsampling paths for surface and volume data, _promote_to_replicated helper, and silent fallback defaults for air_density/stream_velocity when sharded reads drop Zarr attributes
examples/cfd/external_aerodynamics/transformer_models/src/metrics.py	Updates all_reduce_dict to operate only over the data-parallel mesh dimension, avoiding double-counting from domain-parallel ranks; logic is correct
examples/cfd/external_aerodynamics/transformer_models/train_transolver_sharded.sh	New SLURM launch script for sharded training; contains multiple hardcoded NVIDIA-internal paths (user name, project lustre dirs, container image) that make this script non-functional for other users
examples/cfd/external_aerodynamics/transformer_models/src/conf/training/base.yaml	Learning rate reduced from 1e-3 to 3e-4
examples/cfd/external_aerodynamics/transformer_models/src/conf/transolver_surface.yaml	Adds domain_parallel_size: 1 config key (default disabled)
examples/cfd/external_aerodynamics/transformer_models/src/conf/transolver_volume.yaml	Adds domain_parallel_size: 1, and disables torch.compile (set to false)
examples/cfd/external_aerodynamics/transformer_models/.gitignore	New .gitignore ignoring .err, .out, runs/, outputs/

Comments Outside Diff (2)

physicsnemo/datapipes/cae/transolver_datapipe.py, line 966-978 (link)

Silent fallback to 1.0 for physics parameters can corrupt training without warning

When the sharded reader drops Zarr attributes, air_density and stream_velocity silently fall back to 1.0. A stream velocity of 1 m/s is ~200× smaller than a typical automotive aerodynamics case (~50–70 m/s), which will produce wrong Reynolds-number-scaled inputs and quietly degrade model quality. There is no warnings.warn or logger call to alert the user.

At minimum, emit a warning when these fallback values are actually used. Alternatively, fail loudly until read_file_sharded is fixed, so users are not silently training on incorrect data.
examples/cfd/external_aerodynamics/transformer_models/src/train.py, line 307-318 (link)

broadcast_to returns a non-writable (read-only) view

features.broadcast_to(...) returns a tensor that shares storage with the original and is marked non-writable. If any downstream code (inside the model or a fused kernel) attempts an in-place operation on features, it will raise a RuntimeError. expand has the same semantic but is the more idiomatic PyTorch call; adding .contiguous() after the broadcast would produce a normal writable tensor and avoid potential surprises.

_{Reviews (1): Last reviewed commit: "Update configs" | Re-trigger Greptile}

greptile-apps · 2026-04-23T14:24:09Z

+    # Debug: show rank, local rank, device, and node info for each process
+    import socket
+    hostname = socket.gethostname()
+    device = dist_manager.device
+    rank = dist_manager.rank
+    local_rank = dist_manager.local_rank
+    world_size = dist_manager.world_size
+    gpu_name = torch.cuda.get_device_name(device) if torch.cuda.is_available() else "N/A"
+    print(
+        f"[DEBUG] Rank {rank}/{world_size - 1} | "
+        f"Local Rank {local_rank} | "
+        f"Node: {hostname} | "
+        f"Device: {device} | "
+        f"GPU: {gpu_name}"
+    )
+    if torch.cuda.is_available():
+        mem_total = torch.cuda.get_device_properties(device).total_memory / (1024**3)
+        print(
+            f"[DEBUG] Rank {rank} | GPU Memory: {mem_total:.1f} GB | "
+            f"CUDA Version: {torch.version.cuda}"
+        )
+    torch.distributed.barrier()


Debug block with barrier should be removed before merging

This block is labelled [DEBUG], imports socket inside main(), and ends with torch.distributed.barrier(). The barrier adds a hard synchronisation point on every run — straggling ranks will block all others there, and in a 64-GPU job it can add measurable wall-clock overhead. Debug print calls from all ranks will also produce interleaved noise in logs.

Please remove this block (or guard it behind a --debug flag / logger.debug) before landing in main.

greptile-apps · 2026-04-23T14:24:14Z

+export USER_LUSTRE=/lustre/fsw/portfolios/coreai/users/coreya
+export GROUP_LUSTRE=/lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae
+export HOME=${USER_LUSTRE}
+
+# Container setup
+CONTAINER_IMAGE=$GROUP_LUSTRE/containers/pytorch26.01-py3.sqsh
+CONTAINER_MOUNTS="$USER_LUSTRE:/user_data/,$GROUP_LUSTRE:/group_data,$HOME:/root/,/lustre:/lustre,/tmp:/tmp"
+
+# Virtual environment path
+VENV_PATH="$USER_LUSTRE/venvs/shard_tensor_benchmarks/"
+
+WORKDIR="$USER_LUSTRE/workdir/shard_tensor_benchmarks/examples/cfd/external_aerodynamics/transformer_models/"


Hardcoded NVIDIA-internal paths make this script non-functional for external users

USER_LUSTRE, GROUP_LUSTRE, CONTAINER_IMAGE, VENV_PATH, WORKDIR, ZARR_TRAIN_PATH, and ZARR_VAL_PATH are all hardcoded to a specific user's (coreya) NVIDIA-internal Lustre filesystem. Anyone else cloning the repo will need to change every one of these before the script can run.

Consider replacing these with placeholder values (e.g. YOUR_LUSTRE_HOME, YOUR_CONTAINER_IMAGE) and documenting each variable so the script serves as a usable template.

greptile-apps · 2026-04-23T14:24:15Z

+            step_records.append({
+                "timestamp": datetime.now(timezone.utc).isoformat(),
+                "phase": "train",
+                "epoch": epoch,
+                "iteration": i,
+                "global_step": i + epoch_len * epoch,
+                "loss": this_loss,
+                "learning_rate": optimizer.param_groups[0]["lr"],
+                "duration_s": duration,
+                "throughput_per_gpu": images_per_second,
+                "memory_reserved_gb": mem_usage,
+                **{k: float(v) if hasattr(v, "item") else v for k, v in metrics.items()},
+            })


step_records.append in train_epoch is not guarded by rank 0 — inconsistent with val_epoch

In val_epoch the corresponding step_records.append block is wrapped in if dist_manager.rank == 0:, but here it runs on every rank. Because metrics_log_path is None on non-zero ranks no file write occurs, but all non-zero ranks silently build an in-memory list that is never used. Consider adding the same if dist_manager.rank == 0: guard here for consistency.

coreyjadams added 2 commits April 1, 2026 14:03

Enable domain parallelism with transolver for external aerodynamics

f25de67

Update configs

1264083

coreyjadams requested a review from RishikeshRanade as a code owner April 23, 2026 14:21

greptile-apps Bot reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transolver cae shard tensor#1584

Transolver cae shard tensor#1584
coreyjadams wants to merge 2 commits into
NVIDIA:mainfrom
coreyjadams:transolver_cae_shard_tensor

coreyjadams commented Apr 23, 2026

Uh oh!

greptile-apps Bot commented Apr 23, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Apr 23, 2026

Uh oh!

greptile-apps Bot Apr 23, 2026

Uh oh!

greptile-apps Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coreyjadams commented Apr 23, 2026

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

Uh oh!

greptile-apps Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Important Files Changed

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 23, 2026 •

edited

Loading