[FLINK-39481][tests] Fix flaky WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets by featzhang · Pull Request #27954 · apache/flink

featzhang · 2026-04-17T11:29:01Z

What is the purpose of the change

Fixes a race condition in FailingCollectionSource that causes WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets (and related CUBE/ROLLUP variants) to fail intermittently.

Brief change log

Change the artificial failure trigger condition in FailingCollectionSource.run() from lastCheckpointedEmittedNum >= 1 to lastCheckpointedEmittedNum >= failureAfterNumElements.

Root Cause

The FailingCollectionSource is used in window aggregate IT cases to test the checkpoint + restore path. It artificially fails after emitting the first failureAfterNumElements elements (= numElements / 2), then restarts from the checkpoint and emits the remaining elements.

The previous trigger condition lastCheckpointedEmittedNum >= 1 allowed the failure to occur after as few as 1 element was checkpointed. When windowDataWithTimestamp (11 elements, failureAfterNumElements = 5) is used, the checkpoint could be taken at position 1–4, causing the source to restart from that early position. After restart, only numElements - checkpointPosition elements are re-emitted.

For testCumulateWindow_GroupingSets, the CUMULATE windows [00:00:30, 00:00:35/40/45] can only be triggered by MAX_WATERMARK (emitted when the source finishes), since the last event timestamp 00:00:34 only advances the watermark to 00:00:33, which is not enough to close those windows on its own. If the source restarts from an early checkpoint position and emits all remaining data correctly, MAX_WATERMARK is emitted and windows close properly.

However, due to the non-determinism of checkpoint timing, in some runs the failure is triggered before all failureAfterNumElements elements are checkpointed, leading to non-reproducible behavior in the restore phase.

Fix

Change the trigger condition to lastCheckpointedEmittedNum >= failureAfterNumElements. This ensures:

The failure only occurs after at least failureAfterNumElements elements have been durably snapshotted.
After restart, the source always begins from a position >= failureAfterNumElements, guaranteeing that the remaining elements (including the watermark-advancing tail data) are re-emitted in the restore run.
MAX_WATERMARK is emitted when the source finishes, closing all pending windows deterministically.

Verifying this change

This is a test-only change. The modified FailingCollectionSource is used by:

WindowDistinctAggregateITCase (this fix targets)
WindowAggregateITCase
WindowJoinITCase
WindowRankITCase
WindowTableFunctionITCase
WindowDeduplicateITCase
Other window IT cases using failing-source = true

Run the flaky test with repeated retries:

mvn test -pl flink-table/flink-table-planner \
  -Dtest="WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets" \
  -Dsurefire.rerunFailingTestsCount=10

Does this pull request potentially affect one of the following areas?

Dependencies (does it add or upgrade a dependency): No
The public API, since some other components take dependency on it: No
Build tooling: No
Core execution, scheduling, checkpointing or recovery: No
Tests: Yes (test utility class only)

flinkbot · 2026-04-17T11:38:45Z

CI report:

4ae821f Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

…lateWindow_GroupingSets The FailingCollectionSource had a race condition where the artificial failure could be triggered with lastCheckpointedEmittedNum >= 1, meaning the failure could occur after as few as 1 element was checkpointed. When restarting from such an early checkpoint, the source would re-emit elements starting from position 1, but windows relying on the final watermark-advancing elements (e.g., timestamps 00:00:32 and 00:00:34 in windowDataWithTimestamp) could still be processed correctly if all data was re-emitted. However, in practice this leads to non-deterministic behavior depending on when the checkpoint barrier arrives relative to the source's emit loop. Fix by changing the failure trigger condition from lastCheckpointedEmittedNum >= 1 to lastCheckpointedEmittedNum >= failureAfterNumElements. This ensures the failure only occurs after at least failureAfterNumElements elements have been durably checkpointed, so the source always restarts from a consistent position and can emit all remaining elements (including those needed to advance the watermark past window boundaries).

spuru9 reviewed Apr 17, 2026

View reviewed changes

Comment thread ...nner/src/test/java/org/apache/flink/table/planner/runtime/utils/FailingCollectionSource.java Outdated

featzhang force-pushed the FLINK-39481 branch from 295310b to dfdb014 Compare April 17, 2026 13:14

spuru9 reviewed Apr 17, 2026

View reviewed changes

Comment thread ...nner/src/test/java/org/apache/flink/table/planner/runtime/utils/FailingCollectionSource.java Outdated

featzhang force-pushed the FLINK-39481 branch from dfdb014 to 4ae821f Compare April 17, 2026 14:54

spuru9 approved these changes Apr 17, 2026

View reviewed changes

github-actions Bot added the community-reviewed PR has been reviewed by the community. label Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-39481][tests] Fix flaky WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets#27954

[FLINK-39481][tests] Fix flaky WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets#27954
featzhang wants to merge 1 commit intoapache:masterfrom
featzhang:FLINK-39481

featzhang commented Apr 17, 2026

Uh oh!

flinkbot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

featzhang commented Apr 17, 2026

What is the purpose of the change

Brief change log

Root Cause

Fix

Verifying this change

Does this pull request potentially affect one of the following areas?

Uh oh!

flinkbot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

flinkbot commented Apr 17, 2026 •

edited

Loading