Skip to content

[FLINK-39481][tests] Fix flaky WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets#27954

Open
featzhang wants to merge 1 commit intoapache:masterfrom
featzhang:FLINK-39481
Open

[FLINK-39481][tests] Fix flaky WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets#27954
featzhang wants to merge 1 commit intoapache:masterfrom
featzhang:FLINK-39481

Conversation

@featzhang
Copy link
Copy Markdown
Member

What is the purpose of the change

Fixes a race condition in FailingCollectionSource that causes WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets (and related CUBE/ROLLUP variants) to fail intermittently.

Brief change log

  • Change the artificial failure trigger condition in FailingCollectionSource.run() from lastCheckpointedEmittedNum >= 1 to lastCheckpointedEmittedNum >= failureAfterNumElements.

Root Cause

The FailingCollectionSource is used in window aggregate IT cases to test the checkpoint + restore path. It artificially fails after emitting the first failureAfterNumElements elements (= numElements / 2), then restarts from the checkpoint and emits the remaining elements.

The previous trigger condition lastCheckpointedEmittedNum >= 1 allowed the failure to occur after as few as 1 element was checkpointed. When windowDataWithTimestamp (11 elements, failureAfterNumElements = 5) is used, the checkpoint could be taken at position 1–4, causing the source to restart from that early position. After restart, only numElements - checkpointPosition elements are re-emitted.

For testCumulateWindow_GroupingSets, the CUMULATE windows [00:00:30, 00:00:35/40/45] can only be triggered by MAX_WATERMARK (emitted when the source finishes), since the last event timestamp 00:00:34 only advances the watermark to 00:00:33, which is not enough to close those windows on its own. If the source restarts from an early checkpoint position and emits all remaining data correctly, MAX_WATERMARK is emitted and windows close properly.

However, due to the non-determinism of checkpoint timing, in some runs the failure is triggered before all failureAfterNumElements elements are checkpointed, leading to non-reproducible behavior in the restore phase.

Fix

Change the trigger condition to lastCheckpointedEmittedNum >= failureAfterNumElements. This ensures:

  1. The failure only occurs after at least failureAfterNumElements elements have been durably snapshotted.
  2. After restart, the source always begins from a position >= failureAfterNumElements, guaranteeing that the remaining elements (including the watermark-advancing tail data) are re-emitted in the restore run.
  3. MAX_WATERMARK is emitted when the source finishes, closing all pending windows deterministically.

Verifying this change

This is a test-only change. The modified FailingCollectionSource is used by:

  • WindowDistinctAggregateITCase (this fix targets)
  • WindowAggregateITCase
  • WindowJoinITCase
  • WindowRankITCase
  • WindowTableFunctionITCase
  • WindowDeduplicateITCase
  • Other window IT cases using failing-source = true

Run the flaky test with repeated retries:

mvn test -pl flink-table/flink-table-planner \
  -Dtest="WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets" \
  -Dsurefire.rerunFailingTestsCount=10

Does this pull request potentially affect one of the following areas?

  • Dependencies (does it add or upgrade a dependency): No
  • The public API, since some other components take dependency on it: No
  • Build tooling: No
  • Core execution, scheduling, checkpointing or recovery: No
  • Tests: Yes (test utility class only)

@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented Apr 17, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

…lateWindow_GroupingSets

The FailingCollectionSource had a race condition where the artificial
failure could be triggered with lastCheckpointedEmittedNum >= 1,
meaning the failure could occur after as few as 1 element was
checkpointed. When restarting from such an early checkpoint, the source
would re-emit elements starting from position 1, but windows relying on
the final watermark-advancing elements (e.g., timestamps 00:00:32 and
00:00:34 in windowDataWithTimestamp) could still be processed correctly
if all data was re-emitted. However, in practice this leads to
non-deterministic behavior depending on when the checkpoint barrier
arrives relative to the source's emit loop.

Fix by changing the failure trigger condition from
lastCheckpointedEmittedNum >= 1 to
lastCheckpointedEmittedNum >= failureAfterNumElements. This ensures the
failure only occurs after at least failureAfterNumElements elements have
been durably checkpointed, so the source always restarts from a
consistent position and can emit all remaining elements (including those
needed to advance the watermark past window boundaries).
@github-actions github-actions Bot added the community-reviewed PR has been reviewed by the community. label Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants