contrib/aws: Move g4dn.8xlarge EFA tests to g4dn.12xlarge, SHM tests to g5g.8xlarge#12164
Open
yinliaws wants to merge 1 commit intoofiwg:mainfrom
Open
contrib/aws: Move g4dn.8xlarge EFA tests to g4dn.12xlarge, SHM tests to g5g.8xlarge#12164yinliaws wants to merge 1 commit intoofiwg:mainfrom
yinliaws wants to merge 1 commit intoofiwg:mainfrom
Conversation
75687a9 to
8b67a53
Compare
shijin-aws
previously approved these changes
Apr 21, 2026
sunkuamzn
previously approved these changes
Apr 21, 2026
Contributor
Author
|
bot:aws:retest |
6 similar comments
Contributor
Author
|
bot:aws:retest |
Contributor
Author
|
bot:aws:retest |
Contributor
Author
|
bot:aws:retest |
Contributor
Author
|
bot:aws:retest |
Contributor
|
bot:aws:retest |
Contributor
Author
|
bot:aws:retest |
55c6df5 to
30e1af2
Compare
Contributor
Author
|
bot:aws:retest |
shijin-aws
reviewed
May 7, 2026
shijin-aws
approved these changes
May 7, 2026
sunkuamzn
approved these changes
May 8, 2026
…to g5g.8xlarge g4dn.8xlarge has only 1 T4 GPU (16GB) which causes cudaMalloc OOM during CUDA fabtests. Move the rhel8-efa stage to g4dn.12xlarge (4 GPUs, 64GB) to resolve the OOM. Move SHM stages from g4dn.8xlarge to g5g.8xlarge (single GPU, Graviton). g5g.8xlarge has 1 GPU so cudaIPC works for single-node SHM tests, and adds Graviton + CUDA coverage to the PR CI. Signed-off-by: Yin Li <yinliq@amazon.com>
30e1af2 to
2302d5c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
g4dn.8xlarge has only 1 T4 GPU (16GB) which causes cudaMalloc OOM during CUDA fabtests when running both server and client on the same node.
Solution
g5g.8xlarge has 1 GPU so cudaIPC works for single-node SHM tests, and adds Graviton + CUDA coverage to the PR CI.