contrib/aws: Move g4dn.8xlarge EFA tests to g4dn.12xlarge, SHM tests to g5g.8xlarge by yinliaws · Pull Request #12164 · ofiwg/libfabric

yinliaws · 2026-04-21T17:08:17Z

Problem

g4dn.8xlarge has only 1 T4 GPU (16GB) which causes cudaMalloc OOM during CUDA fabtests when running both server and client on the same node.

Solution

Move rhel8-efa stage from g4dn.8xlarge to g4dn.12xlarge (4 GPUs, 64GB)
Move 3 SHM stages from g4dn.8xlarge to g5g.8xlarge (single GPU, Graviton)

g5g.8xlarge has 1 GPU so cudaIPC works for single-node SHM tests, and adds Graviton + CUDA coverage to the PR CI.

yinliaws · 2026-04-23T22:08:45Z

bot:aws:retest

yinliaws · 2026-04-24T17:37:47Z

bot:aws:retest

yinliaws · 2026-04-24T17:55:53Z

bot:aws:retest

yinliaws · 2026-04-26T22:10:03Z

bot:aws:retest

yinliaws · 2026-04-27T05:30:28Z

bot:aws:retest

a-szegel · 2026-05-04T16:08:39Z

bot:aws:retest

yinliaws · 2026-05-04T17:59:49Z

bot:aws:retest

yinliaws · 2026-05-07T05:59:43Z

bot:aws:retest

…to g5g.8xlarge g4dn.8xlarge has only 1 T4 GPU (16GB) which causes cudaMalloc OOM during CUDA fabtests. Move the rhel8-efa stage to g4dn.12xlarge (4 GPUs, 64GB) to resolve the OOM. Move SHM stages from g4dn.8xlarge to g5g.8xlarge (single GPU, Graviton). g5g.8xlarge has 1 GPU so cudaIPC works for single-node SHM tests, and adds Graviton + CUDA coverage to the PR CI. Signed-off-by: Yin Li <yinliq@amazon.com>

yinliaws force-pushed the move-g4dn-8x-to-g5g branch from 75687a9 to 8b67a53 Compare April 21, 2026 20:03

yinliaws requested review from a-szegel and shijin-aws April 21, 2026 20:10

shijin-aws previously approved these changes Apr 21, 2026

View reviewed changes

sunkuamzn previously approved these changes Apr 21, 2026

View reviewed changes

yinliaws dismissed stale reviews from sunkuamzn and shijin-aws via 4a00386 May 4, 2026 18:24

yinliaws force-pushed the move-g4dn-8x-to-g5g branch 4 times, most recently from 55c6df5 to 30e1af2 Compare May 7, 2026 05:50

yinliaws requested review from shijin-aws and sunkuamzn May 7, 2026 07:16

shijin-aws reviewed May 7, 2026

View reviewed changes

Comment thread contrib/aws/Jenkinsfile

yinliaws requested a review from shijin-aws May 7, 2026 22:00

shijin-aws approved these changes May 7, 2026

View reviewed changes

sunkuamzn approved these changes May 8, 2026

View reviewed changes

yinliaws force-pushed the move-g4dn-8x-to-g5g branch from 30e1af2 to 2302d5c Compare May 8, 2026 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contrib/aws: Move g4dn.8xlarge EFA tests to g4dn.12xlarge, SHM tests to g5g.8xlarge#12164

contrib/aws: Move g4dn.8xlarge EFA tests to g4dn.12xlarge, SHM tests to g5g.8xlarge#12164
yinliaws wants to merge 1 commit intoofiwg:mainfrom
yinliaws:move-g4dn-8x-to-g5g

yinliaws commented Apr 21, 2026

Uh oh!

yinliaws commented Apr 23, 2026

Uh oh!

yinliaws commented Apr 24, 2026

Uh oh!

yinliaws commented Apr 24, 2026

Uh oh!

yinliaws commented Apr 26, 2026

Uh oh!

yinliaws commented Apr 27, 2026

Uh oh!

a-szegel commented May 4, 2026

Uh oh!

yinliaws commented May 4, 2026

Uh oh!

yinliaws commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yinliaws commented Apr 21, 2026

Uh oh!

yinliaws commented Apr 23, 2026

Uh oh!

yinliaws commented Apr 24, 2026

Uh oh!

yinliaws commented Apr 24, 2026

Uh oh!

yinliaws commented Apr 26, 2026

Uh oh!

yinliaws commented Apr 27, 2026

Uh oh!

a-szegel commented May 4, 2026

Uh oh!

yinliaws commented May 4, 2026

Uh oh!

yinliaws commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants