[FLINK-36753][runtime]Adaptive Scheduler actively triggers a Checkpoint after all resources are ready by Samrat002 · Pull Request #27921 · apache/flink

Samrat002 · 2026-04-12T17:47:35Z

What is the purpose of the change

FLIP-461 introduced checkpoint-synchronized rescaling where the Adaptive Scheduler waits for a checkpoint to complete before rescaling. However, it passively waits for the next periodic checkpoint, which can delay rescaling significantly when checkpoint intervals are large (e.g., 10 minutes).
This PR makes the Adaptive Scheduler actively trigger a checkpoint when resources change and rescaling is desired. The trigger fires at the right time. ie, when the DefaultStateTransitionManager enters the Stabilizing or Stabilized phase (i.e., when the resource gate is open and the scheduler is waiting for the checkpoint gate). The feature is controlled by a new configuration option jobmanager.adaptive-scheduler.rescale-trigger.active-checkpoint.enabled (default: false).

The feature respects execution.checkpointing.min-pause, skips if a checkpoint is already in progress, and only fires when parallelism has actually changed.

Brief change log

Added requestActiveCheckpointTrigger() to StateTransitionManager.Context interface
DefaultStateTransitionManager calls requestActiveCheckpointTrigger() when entering Stabilizing, on onChange during Stabilizing, and when entering Stabilized
Executing implements the callback with guard conditions (config enabled, checkpointing configured, parallelism changed, no checkpoint in progress)
Added config option jobmanager.adaptive-scheduler.rescale-trigger.active-checkpoint.enabled wired through AdaptiveScheduler.Settings
Added integration test proving rescale happens without periodic checkpoints or manual triggers

Verifying this change

This change added tests and can be verified as follows:

Added RescaleOnCheckpointITCase#testRescaleWithActiveCheckpointTrigger that starts a job with checkpointing interval of 1 hour, maxTriggerDelay set to infinity, and no manual triggerCheckpoint() call.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? JavaDocs

…nt after all resources are ready

flinkbot · 2026-04-12T17:51:13Z

CI report:

e4f65fc Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

Samrat002 · 2026-04-13T17:20:20Z

@1996fanrui PTAL whenever time.

pnowojski

Thanks for the contribution. I've left a couple of comments, however I don't have context to review whether this is properly integrated with AdatpiveScheduler and DefaultStateTransitionManager. Would be great for someone else to take a look as well.

pnowojski · 2026-04-14T07:42:46Z

+    public static final ConfigOption<Boolean> SCHEDULER_RESCALE_TRIGGER_ACTIVE_CHECKPOINT_ENABLED =
+            key("jobmanager.adaptive-scheduler.rescale-trigger.active-checkpoint.enabled")
+                    .booleanType()
+                    .defaultValue(false)


Is there a downside of using this option? If we expect this to be generally positive change, and you disable it by default only as a pre-caution/for backward compatibility, I would be actually fine setting it by default to true.

Earlier, I chose a defensive approach. There is no compelling reason to keep it false.
Updated default value to true

pnowojski · 2026-04-14T07:51:28Z

+            waitForRunningTasks(restClusterClient, jobId, AFTER_RESCALE_PARALLELISM);
+            final int expectedFreeSlotCount = NUMBER_OF_SLOTS - AFTER_RESCALE_PARALLELISM;
+            LOG.info(
+                    "Waiting for {} slot(s) to become available after scale down.",
+                    expectedFreeSlotCount);
+            waitForAvailableSlots(restClusterClient, expectedFreeSlotCount);


If I understand your test, it would still pass after 1h, after the regular periodic checkpoint is triggered after 1h, even with your new option disabled, right?

I think you should make sure that the timeout in waitForRunningTasks waitForAvailableSlots (or CI timeout 4h) is longer than env.enableCheckpointing(Duration.ofHours(1).toMillis());. So either, decrease the timeout in waiting to < 30 minutes, or increase checkpointing interval to 24h (CI will be killed after 4h AFAIR).

Updated checkpointing to 24 hours. also added assertions for min-pause

ztison

Hi, thanks for the PR. We look at it with @XComp and found few things to improve.

ztison · 2026-04-15T08:29:32Z

+                                                    + "rather than waiting for the next periodic checkpoint. "
+                                                    + "This reduces rescaling latency, especially when checkpoint intervals are large. "
+                                                    + "The active trigger respects %s and will not trigger if a checkpoint is already in progress.",
+                                            text("execution.checkpointing.min-pause"))


Might probably be:
code(CheckpointingOptions.MIN_PAUSE_BETWEEN_CHECKPOINTS.key()))

ztison · 2026-04-15T10:22:51Z

                        resourceStabilizationTimeout,
                        firstChangeEventTimestamp,
                        maxTriggerDelay));
+        transitionContext.requestActiveCheckpointTrigger();


Why this call is needed here? ISn't it enough to call it in org.apache.flink.runtime.scheduler.adaptive.DefaultStateTransitionManager.Stabilizing#onChange ?

ztison · 2026-04-15T10:27:07Z


    private void progressToStabilized(Temporal firstChangeEventTimestamp) {
        progressToPhase(new Stabilized(clock, this, firstChangeEventTimestamp, maxTriggerDelay));
+        transitionContext.requestActiveCheckpointTrigger();


Now it looks the method is called on many places. Wondering if we could control when it is called only in Phases. So moving this to Stabilized phase?

ztison · 2026-04-15T10:45:08Z

+                                            "When enabled, the Adaptive Scheduler actively triggers a checkpoint when resources change and rescaling is desired, "
+                                                    + "rather than waiting for the next periodic checkpoint. "
+                                                    + "This reduces rescaling latency, especially when checkpoint intervals are large. "
+                                                    + "The active trigger respects %s and will not trigger if a checkpoint is already in progress.",


Does it really respect the min-pause? What I see it only respects that it doesn't trigger a new checkpoint when another is in a progress.

Ahh check for minpause was missing . PTAL

ztison · 2026-04-15T10:47:04Z

+
+        /**
+         * Requests the context to actively trigger a checkpoint to expedite rescaling. Called when
+         * the {@link DefaultStateTransitionManager} enters a phase that is ready to accept {@link


Is it true? I see that the method is called on more places: entering Stabilizing, entering Stabilized, and on each onChange event while in Stabilizing

i have fixed the problem now. PTAL at the revised version

ztison · 2026-04-15T10:48:59Z

+     *   <li>No checkpoint must be currently in progress or being triggered
+     * </ul>
+     */
+    private void triggerCheckpointForRescale() {


We need to cover all possible paths by tests.

added tests PTAL

Samrat002 · 2026-04-17T05:51:46Z

@flinkbot run azure

Samrat002 · 2026-04-17T08:45:20Z

@ztison @pnowojski PTAL . i have addressed to review comments

added Unit tests , made the IT more robust and ensured minpause is respected

ztison · 2026-04-20T07:42:44Z

@ztison @pnowojski PTAL . i have addressed to review comments

added Unit tests , made the IT more robust and ensured minpause is respected

Thanks for incorporating our improvements. I was on a vacation the last few days so I haven't responded. I am back, I will check the PR today or tomorrow.

ztison

I see some issues with retry logic.

ztison · 2026-04-21T08:50:36Z

+     * satisfy the configured {@code minPauseBetweenCheckpoints}. This can be used by callers that
+     * trigger non-periodic checkpoints but still wish to respect the min-pause constraint.
+     */
+    public boolean isMinPauseBetweenCheckpointsSatisfied() {


@pnowojski Is this safe without lock?

ztison · 2026-04-21T09:36:25Z

+            return;
+        }
+
+        if (!checkpointCoordinator.isMinPauseBetweenCheckpointsSatisfied()) {


If you return a remaining time to the next checkpoint instead of boolean, then you can directly use it in following call of context.runIfState( this, this::requestActiveCheckpointTrigger, remainingTimeToSatisfyMinPause) and you can get rid of hardcoded ACTIVE_CHECKPOINT_RETRY_DELAY .

ztison · 2026-04-21T09:39:40Z

+                                                .warn(
+                                                        "Active checkpoint trigger for rescale failed, scheduling retry.",
+                                                        throwable);
+                                        context.runIfState(


Do we really want to introduce this endless cycle? I feel we should just log it and give it up. If the checkpoint is failing there probably is different issue with job and we shouldn't try it again and again without any retry cap.

ztison · 2026-04-21T10:10:42Z

+            getLogger()
+                    .debug(
+                            "Skipping active checkpoint trigger for rescale: min pause between checkpoints not satisfied, scheduling retry.");
+            context.runIfState(


If we get e.g. 10 onChange events then we will schedule this method 10 times. We should have some kind of deduplication.

[FLINK-36753][runtime]Adaptive Scheduler actively triggers a Checkpoi…

6b7afb1

…nt after all resources are ready

[FLINK-36753][runtime] Doc update

59a76f9

Samrat002 marked this pull request as ready for review April 13, 2026 17:20

pnowojski reviewed Apr 14, 2026

View reviewed changes

ztison reviewed Apr 15, 2026

View reviewed changes

github-actions bot added the community-reviewed PR has been reviewed by the community. label Apr 15, 2026

Samrat002 force-pushed the FLINK-36753 branch from bd96e05 to c5a4baa Compare April 16, 2026 18:52

Samrat002 requested review from pnowojski and ztison April 16, 2026 18:53

Address to review comments

e4f65fc

Samrat002 force-pushed the FLINK-36753 branch from c5a4baa to e4f65fc Compare April 17, 2026 03:17

ztison reviewed Apr 21, 2026

View reviewed changes

Conversation

Samrat002 commented Apr 12, 2026

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

Samrat002 commented Apr 13, 2026

Uh oh!

pnowojski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samrat002 Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ztison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ztison Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Samrat002 commented Apr 17, 2026

Uh oh!

Samrat002 commented Apr 17, 2026

Uh oh!

ztison commented Apr 20, 2026

Uh oh!

ztison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flinkbot commented Apr 12, 2026 •

edited

Loading

Samrat002 Apr 16, 2026 •

edited

Loading

ztison Apr 15, 2026 •

edited

Loading