Skip to content

contrib/aws: Move g4dn.8xlarge EFA tests to g4dn.12xlarge, SHM tests to g5g.8xlarge#12164

Open
yinliaws wants to merge 1 commit intoofiwg:mainfrom
yinliaws:move-g4dn-8x-to-g5g
Open

contrib/aws: Move g4dn.8xlarge EFA tests to g4dn.12xlarge, SHM tests to g5g.8xlarge#12164
yinliaws wants to merge 1 commit intoofiwg:mainfrom
yinliaws:move-g4dn-8x-to-g5g

Conversation

@yinliaws
Copy link
Copy Markdown
Contributor

Problem

g4dn.8xlarge has only 1 T4 GPU (16GB) which causes cudaMalloc OOM during CUDA fabtests when running both server and client on the same node.

Solution

  • Move rhel8-efa stage from g4dn.8xlarge to g4dn.12xlarge (4 GPUs, 64GB)
  • Move 3 SHM stages from g4dn.8xlarge to g5g.8xlarge (single GPU, Graviton)

g5g.8xlarge has 1 GPU so cudaIPC works for single-node SHM tests, and adds Graviton + CUDA coverage to the PR CI.

@yinliaws yinliaws force-pushed the move-g4dn-8x-to-g5g branch from 75687a9 to 8b67a53 Compare April 21, 2026 20:03
@yinliaws yinliaws requested review from a-szegel and shijin-aws April 21, 2026 20:10
shijin-aws
shijin-aws previously approved these changes Apr 21, 2026
sunkuamzn
sunkuamzn previously approved these changes Apr 21, 2026
@yinliaws
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

6 similar comments
@yinliaws
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

@yinliaws
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

@yinliaws
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

@yinliaws
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

@a-szegel
Copy link
Copy Markdown
Contributor

a-szegel commented May 4, 2026

bot:aws:retest

@yinliaws
Copy link
Copy Markdown
Contributor Author

yinliaws commented May 4, 2026

bot:aws:retest

@yinliaws yinliaws dismissed stale reviews from sunkuamzn and shijin-aws via 4a00386 May 4, 2026 18:24
@yinliaws yinliaws force-pushed the move-g4dn-8x-to-g5g branch 4 times, most recently from 55c6df5 to 30e1af2 Compare May 7, 2026 05:50
@yinliaws
Copy link
Copy Markdown
Contributor Author

yinliaws commented May 7, 2026

bot:aws:retest

@yinliaws yinliaws requested review from shijin-aws and sunkuamzn May 7, 2026 07:16
Comment thread contrib/aws/Jenkinsfile
@yinliaws yinliaws requested a review from shijin-aws May 7, 2026 22:00
…to g5g.8xlarge

g4dn.8xlarge has only 1 T4 GPU (16GB) which causes cudaMalloc OOM
during CUDA fabtests. Move the rhel8-efa stage to g4dn.12xlarge
(4 GPUs, 64GB) to resolve the OOM.

Move SHM stages from g4dn.8xlarge to g5g.8xlarge (single GPU,
Graviton). g5g.8xlarge has 1 GPU so cudaIPC works for single-node
SHM tests, and adds Graviton + CUDA coverage to the PR CI.

Signed-off-by: Yin Li <yinliq@amazon.com>
@yinliaws yinliaws force-pushed the move-g4dn-8x-to-g5g branch from 30e1af2 to 2302d5c Compare May 8, 2026 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants