Skip to content

Transolver cae shard tensor#1584

Open
coreyjadams wants to merge 2 commits into
NVIDIA:mainfrom
coreyjadams:transolver_cae_shard_tensor
Open

Transolver cae shard tensor#1584
coreyjadams wants to merge 2 commits into
NVIDIA:mainfrom
coreyjadams:transolver_cae_shard_tensor

Conversation

@coreyjadams
Copy link
Copy Markdown
Collaborator

PhysicsNeMo Pull Request

This PR adds Shard Tensor support to the transolver example end to end.

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

This PR adds end-to-end ShardTensor / domain-parallelism support to the Transolver example, introducing a 2D (ddp, domain) DeviceMesh, FSDP2 model wrapping, ShardTensor-aware subsampling in the datapipe, corrected metric reduction scoped to the data-parallel dimension, and per-step JSONL logging.

  • P1 – debug barrier in main(): A [DEBUG] print block with torch.distributed.barrier() was left in main(). The barrier adds a hard synchronisation point on every run regardless of config.
  • P1 – silent fallback for physics inputs: air_density and stream_velocity default silently to 1.0 when sharded reads drop Zarr attributes, which can corrupt training without any warning to the user.
  • The train_transolver_sharded.sh script hardcodes NVIDIA-internal Lustre paths and should use placeholder variables before merging as a public example.

Important Files Changed

Filename Overview
examples/cfd/external_aerodynamics/transformer_models/src/train.py Adds domain-parallelism setup, FSDP2 model wrapping, 2D DeviceMesh, ShardTensor materialisation, and JSONL step logging; contains leftover debug print blocks with a barrier call in main()
physicsnemo/datapipes/cae/transolver_datapipe.py Adds ShardTensor-aware subsampling paths for surface and volume data, _promote_to_replicated helper, and silent fallback defaults for air_density/stream_velocity when sharded reads drop Zarr attributes
examples/cfd/external_aerodynamics/transformer_models/src/metrics.py Updates all_reduce_dict to operate only over the data-parallel mesh dimension, avoiding double-counting from domain-parallel ranks; logic is correct
examples/cfd/external_aerodynamics/transformer_models/train_transolver_sharded.sh New SLURM launch script for sharded training; contains multiple hardcoded NVIDIA-internal paths (user name, project lustre dirs, container image) that make this script non-functional for other users
examples/cfd/external_aerodynamics/transformer_models/src/conf/training/base.yaml Learning rate reduced from 1e-3 to 3e-4
examples/cfd/external_aerodynamics/transformer_models/src/conf/transolver_surface.yaml Adds domain_parallel_size: 1 config key (default disabled)
examples/cfd/external_aerodynamics/transformer_models/src/conf/transolver_volume.yaml Adds domain_parallel_size: 1, and disables torch.compile (set to false)
examples/cfd/external_aerodynamics/transformer_models/.gitignore New .gitignore ignoring *.err, *.out, runs/, outputs/

Comments Outside Diff (2)

  1. physicsnemo/datapipes/cae/transolver_datapipe.py, line 966-978 (link)

    P1 Silent fallback to 1.0 for physics parameters can corrupt training without warning

    When the sharded reader drops Zarr attributes, air_density and stream_velocity silently fall back to 1.0. A stream velocity of 1 m/s is ~200× smaller than a typical automotive aerodynamics case (~50–70 m/s), which will produce wrong Reynolds-number-scaled inputs and quietly degrade model quality. There is no warnings.warn or logger call to alert the user.

    At minimum, emit a warning when these fallback values are actually used. Alternatively, fail loudly until read_file_sharded is fixed, so users are not silently training on incorrect data.

  2. examples/cfd/external_aerodynamics/transformer_models/src/train.py, line 307-318 (link)

    P2 broadcast_to returns a non-writable (read-only) view

    features.broadcast_to(...) returns a tensor that shares storage with the original and is marked non-writable. If any downstream code (inside the model or a fused kernel) attempts an in-place operation on features, it will raise a RuntimeError. expand has the same semantic but is the more idiomatic PyTorch call; adding .contiguous() after the broadcast would produce a normal writable tensor and avoid potential surprises.

Reviews (1): Last reviewed commit: "Update configs" | Re-trigger Greptile

Comment on lines +687 to +708
# Debug: show rank, local rank, device, and node info for each process
import socket
hostname = socket.gethostname()
device = dist_manager.device
rank = dist_manager.rank
local_rank = dist_manager.local_rank
world_size = dist_manager.world_size
gpu_name = torch.cuda.get_device_name(device) if torch.cuda.is_available() else "N/A"
print(
f"[DEBUG] Rank {rank}/{world_size - 1} | "
f"Local Rank {local_rank} | "
f"Node: {hostname} | "
f"Device: {device} | "
f"GPU: {gpu_name}"
)
if torch.cuda.is_available():
mem_total = torch.cuda.get_device_properties(device).total_memory / (1024**3)
print(
f"[DEBUG] Rank {rank} | GPU Memory: {mem_total:.1f} GB | "
f"CUDA Version: {torch.version.cuda}"
)
torch.distributed.barrier()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Debug block with barrier should be removed before merging

This block is labelled [DEBUG], imports socket inside main(), and ends with torch.distributed.barrier(). The barrier adds a hard synchronisation point on every run — straggling ranks will block all others there, and in a 64-GPU job it can add measurable wall-clock overhead. Debug print calls from all ranks will also produce interleaved noise in logs.

Please remove this block (or guard it behind a --debug flag / logger.debug) before landing in main.

Comment on lines +13 to +24
export USER_LUSTRE=/lustre/fsw/portfolios/coreai/users/coreya
export GROUP_LUSTRE=/lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae
export HOME=${USER_LUSTRE}

# Container setup
CONTAINER_IMAGE=$GROUP_LUSTRE/containers/pytorch26.01-py3.sqsh
CONTAINER_MOUNTS="$USER_LUSTRE:/user_data/,$GROUP_LUSTRE:/group_data,$HOME:/root/,/lustre:/lustre,/tmp:/tmp"

# Virtual environment path
VENV_PATH="$USER_LUSTRE/venvs/shard_tensor_benchmarks/"

WORKDIR="$USER_LUSTRE/workdir/shard_tensor_benchmarks/examples/cfd/external_aerodynamics/transformer_models/"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Hardcoded NVIDIA-internal paths make this script non-functional for external users

USER_LUSTRE, GROUP_LUSTRE, CONTAINER_IMAGE, VENV_PATH, WORKDIR, ZARR_TRAIN_PATH, and ZARR_VAL_PATH are all hardcoded to a specific user's (coreya) NVIDIA-internal Lustre filesystem. Anyone else cloning the repo will need to change every one of these before the script can run.

Consider replacing these with placeholder values (e.g. YOUR_LUSTRE_HOME, YOUR_CONTAINER_IMAGE) and documenting each variable so the script serves as a usable template.

Comment on lines +483 to +495
step_records.append({
"timestamp": datetime.now(timezone.utc).isoformat(),
"phase": "train",
"epoch": epoch,
"iteration": i,
"global_step": i + epoch_len * epoch,
"loss": this_loss,
"learning_rate": optimizer.param_groups[0]["lr"],
"duration_s": duration,
"throughput_per_gpu": images_per_second,
"memory_reserved_gb": mem_usage,
**{k: float(v) if hasattr(v, "item") else v for k, v in metrics.items()},
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 step_records.append in train_epoch is not guarded by rank 0 — inconsistent with val_epoch

In val_epoch the corresponding step_records.append block is wrapped in if dist_manager.rank == 0:, but here it runs on every rank. Because metrics_log_path is None on non-zero ranks no file write occurs, but all non-zero ranks silently build an in-memory list that is never used. Consider adding the same if dist_manager.rank == 0: guard here for consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant