[FLINK-39482][filesystem] Support configurable maxConnections in S3ClientProvider by Samrat002 · Pull Request #27970 · apache/flink

Samrat002 · 2026-04-19T15:50:59Z

What is the purpose of the change

This pull request prevents S3 connection pool exhaustion during RocksDB state restore when using the Native S3 filesystem. When NativeS3BulkCopyHelper fires concurrent downloads via S3TransferManager, each multipart download can consume multiple HTTP connections. With the default pool size of 50 and a batch concurrency of 16, the pool can be exhausted, causing downloads to hang until the SDK's acquire timeout expires. This results in opaque SdkClientException failures during checkpoint restore.

The fix introduces a configurable s3.connection.max option, clamps s3.bulk-copy.max-concurrent to the connection pool size, raises the connection acquisition timeout from the SDK default to the user-configured connection timeout

Brief change log

Added configurable maxConnections.
Unit tests for connection pool exhautions

Verifying this change

Added testConnectionPoolExhaustedDetection in NativeS3BulkCopyHelperTest to verify detection of the SDK's connection acquire timeout message in both direct and nested causal
chains, as well as false-positive resistance
Added testEmptyRequestListIsNoOp in NativeS3BulkCopyHelperTest to verify no NPE when no files are requested
Added testInvalidMaxConnectionsThrowsException and testInvalidBulkCopyMaxConcurrentThrowsException in NativeS3FileSystemFactoryTest to verify config validation through the
factory
Added testBulkCopyMaxConcurrentClampedToMaxConnections and testBulkCopyMaxConcurrentPreservedWithinMaxConnections in NativeS3FileSystemFactoryTest to verify the clamping logic
end-to-end through factory → filesystem → bulk copy helper

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: yes (affects state restore from S3 checkpoints)
The S3 file system connector: yes

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? JavaDocs

…ientProvider

flinkbot · 2026-04-19T15:55:37Z

CI report:

d342900 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

Samrat002 · 2026-04-20T06:40:21Z

@gaborgsomogyi PTAL

gaborgsomogyi · 2026-04-20T09:40:43Z

+        config.set(NativeS3FileSystemFactory.BULK_COPY_MAX_CONCURRENT, 32);
+        config.set(NativeS3FileSystemFactory.MAX_CONNECTIONS, 10);


Just for my own understanding. BULK_COPY_MAX_CONCURRENT drives bulk operation concurrency but then what exactly MAX_CONNECTIONS drives (I read the config explanation but that's a bit cloudy)?

They operate at different layers:

s3.connection.max - the HTTP connection pool size in the underlying HTTP clients (Apache for sync ops, Netty for async ops). This is a shared pool: every S3 API call, like GetObject, PutObject, HeadObject, ListObjectsV2, etc., borrows a connection from it and returns it when done. This maps directly to
https://github.com/apache/flink/blob/FLINK-39482/flink-filesystems/flink-s3-fs-native/src/main/java/org/apache/flink/fs/s3native/S3ClientProvider.java#L391 on the Netty async
client and maxConnections on the Apache sync client.

s3.bulk-copy.max-concurrent - how many S3 download operations NativeS3BulkCopyHelper fires in parallel during state restore (the batch size in copyFiles).

The root cause of FLINK-39482: these two layers interact because S3TransferManager uses multipart downloads for files > 8MB, each part is a separate HTTP connection from the shared pool. So maxConcurrentCopies=16 files × ~4 parts each = ~64 HTTP connections needed, but the pool only has 50 → acquire timeout → opaque SdkClientException.

Does this configuration make sense, or are there any suggestions to simplify it?

…est, rename method and use key

gaborgsomogyi · 2026-04-20T16:36:56Z

After the nit fix + other comments resolution it's good to go from my perspective.

…stions

gaborgsomogyi · 2026-04-21T05:48:34Z

Intended to merge this unless comments arrive

[FLINK-39482][filesystem] Support configurable maxConnections in S3Cl…

b0ea378

…ientProvider

davidradl reviewed Apr 20, 2026

View reviewed changes

Comment thread ...ms/flink-s3-fs-native/src/main/java/org/apache/flink/fs/s3native/NativeS3BulkCopyHelper.java

davidradl reviewed Apr 20, 2026

View reviewed changes

Comment thread ...ms/flink-s3-fs-native/src/main/java/org/apache/flink/fs/s3native/NativeS3BulkCopyHelper.java

davidradl reviewed Apr 20, 2026

View reviewed changes

Comment thread ...link-s3-fs-native/src/test/java/org/apache/flink/fs/s3native/NativeS3BulkCopyHelperTest.java Outdated

gaborgsomogyi reviewed Apr 20, 2026

View reviewed changes

[FLINK-39482][filesystem] Address to review comments : paramertized t…

28ef8d9

…est, rename method and use key

Samrat002 force-pushed the FLINK-39482 branch from 3df3358 to 28ef8d9 Compare April 20, 2026 12:59

Samrat002 requested review from davidradl and gaborgsomogyi April 20, 2026 13:08

gaborgsomogyi reviewed Apr 20, 2026

View reviewed changes

Comment thread ...k-s3-fs-native/src/test/java/org/apache/flink/fs/s3native/NativeS3FileSystemFactoryTest.java Outdated

Comment thread ...k-s3-fs-native/src/test/java/org/apache/flink/fs/s3native/NativeS3FileSystemFactoryTest.java Outdated

[FLINK-39482][filesystem] Address to review comments : Nits and sugge…

d342900

…stions

Samrat002 requested a review from gaborgsomogyi April 20, 2026 18:16

gaborgsomogyi approved these changes Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-39482][filesystem] Support configurable maxConnections in S3ClientProvider#27970

[FLINK-39482][filesystem] Support configurable maxConnections in S3ClientProvider#27970
Samrat002 wants to merge 3 commits intoapache:masterfrom
Samrat002:FLINK-39482

Samrat002 commented Apr 19, 2026 •

edited

Loading

Uh oh!

flinkbot commented Apr 19, 2026 •

edited

Loading

Uh oh!

Samrat002 commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaborgsomogyi Apr 20, 2026

Uh oh!

Samrat002 Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaborgsomogyi commented Apr 20, 2026

Uh oh!

gaborgsomogyi commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		config.set(NativeS3FileSystemFactory.BULK_COPY_MAX_CONCURRENT, 32);
		config.set(NativeS3FileSystemFactory.MAX_CONNECTIONS, 10);

Conversation

Samrat002 commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

Samrat002 commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaborgsomogyi Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Samrat002 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaborgsomogyi commented Apr 20, 2026

Uh oh!

gaborgsomogyi commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Samrat002 commented Apr 19, 2026 •

edited

Loading

flinkbot commented Apr 19, 2026 •

edited

Loading