Skip to content

OSIDB-4914: Update rescheduler's querying logic in sync manager#1251

Open
Jincxz wants to merge 1 commit intomasterfrom
OSIDB-4914-fix-collector-rescheduler
Open

OSIDB-4914: Update rescheduler's querying logic in sync manager#1251
Jincxz wants to merge 1 commit intomasterfrom
OSIDB-4914-fix-collector-rescheduler

Conversation

@Jincxz
Copy link
Copy Markdown
Contributor

@Jincxz Jincxz commented Apr 23, 2026

Adjusted the logic behind sync manager's rescheduling.

The original code checked every sync manager, of a particular class, to determine if they should be rescheduled. The new logic filters out each conditional branch for rescheduling instead. This should minimize the OOM issue and allow the collectors to run.

The rescheduled set is there to prevent re-reschedules in the case that the conditions overlap.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: a429ba2c-4744-434f-8da8-e2dc33554cde

📥 Commits

Reviewing files that changed from the base of the PR and between 00da65e and c4f00f5.

📒 Files selected for processing (1)
  • osidb/sync_manager.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • osidb/sync_manager.py

📝 Walkthrough

Summary by CodeRabbit

  • Refactor
    • Optimized sync manager rescheduling logic for improved efficiency and reliability, enhancing how the system handles reschedule detection and duplicate prevention.

Walkthrough

check_for_reschedules now detects candidates via database queryset filters (using Q/F) and reschedules by sync_id. It defines three reschedule querysets and uses an inner helper with a shared processed set to call cls.reschedule once per sync_id; a "started but not scheduled" workaround logs and marks affected sync_ids.

Changes

Cohort / File(s) Summary
Sync Manager Reschedule Logic
osidb/sync_manager.py
Rewrote check_for_reschedules to use DB-level Q/F filters for three reschedule triggers: (1) scheduled-but-didn't-start, (2) started-but-didn't-finish, (3) started-and-failed-but-not-permanently. Added an inner reschedule(...) helper that centralizes cls.reschedule calls, derives reason strings (constant or per-row lambda), and prevents duplicate reschedules using a shared processed set. The "started but not scheduled" case now selects rows where last_scheduled_dt is null and last_started_dt is non-null, logs each sync_id, and marks them processed to avoid double handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: updating the rescheduler's querying logic in the sync manager to use database filtering instead of Python iteration.
Description check ✅ Passed The description clearly explains the change from checking all sync managers in Python to filtering via database queries, mentions the memory optimization goal, and describes the deduplication mechanism.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch OSIDB-4914-fix-collector-rescheduler

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Jincxz Jincxz requested a review from a team April 23, 2026 14:32
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@osidb/sync_manager.py`:
- Around line 325-327: The filters for stale-run rescheduling currently exclude
rows where last_scheduled_dt is NULL (e.g., started_not_scheduled) because
comparisons like last_scheduled_dt__lt=F(...) don't match NULL; update the
started_not_finished and started_failed_not_permanently filter clauses to
explicitly include NULLs by using Q(last_scheduled_dt__isnull=True) |
Q(last_scheduled_dt__lt=F(...)) (referencing last_scheduled_dt, last_started_dt,
last_failed_dt, F and Q) so managers with last_scheduled_dt=None are
rescheduled, and add unit tests that create managers with last_scheduled_dt=None
plus stale last_started_dt or last_failed_dt to verify they get rescheduled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 364c5ae6-c98c-4d22-8f34-4d00d822ea57

📥 Commits

Reviewing files that changed from the base of the PR and between a71cec7 and b6cc5a9.

📒 Files selected for processing (1)
  • osidb/sync_manager.py

Comment thread osidb/sync_manager.py Outdated
Comment on lines +325 to +327
started_not_scheduled = sync_managers.filter(
last_scheduled_dt__isnull=True, last_started_dt__isnull=False
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the relevant section of the file
wc -l osidb/sync_manager.py

Repository: RedHatProductSecurity/osidb

Length of output: 97


🏁 Script executed:

#!/bin/bash
# Check the specific lines mentioned in the review
sed -n '310,400p' osidb/sync_manager.py | head -100

Repository: RedHatProductSecurity/osidb

Length of output: 3836


🏁 Script executed:

#!/bin/bash
# Find related test files for this sync_manager module
find . -type f -name "*test*sync*" -o -name "*sync*test*" | grep -E "\.py$"

Repository: RedHatProductSecurity/osidb

Length of output: 157


🏁 Script executed:

#!/bin/bash
# Check imports in sync_manager.py to confirm F() and Q() availability
head -50 osidb/sync_manager.py

Repository: RedHatProductSecurity/osidb

Length of output: 1852


🏁 Script executed:

#!/bin/bash
# Read the test file to see if there are tests covering NULL last_scheduled_dt
head -100 osidb/tests/test_sync_manager.py

Repository: RedHatProductSecurity/osidb

Length of output: 3194


🏁 Script executed:

#!/bin/bash
# Get more context around the filters, specifically the complete started_not_finished and started_failed_not_permanently blocks
sed -n '360,410p' osidb/sync_manager.py

Repository: RedHatProductSecurity/osidb

Length of output: 2178


🏁 Script executed:

#!/bin/bash
# Search for any tests that specifically test started_not_scheduled or NULL scheduling scenarios
grep -n "started_not_scheduled\|last_scheduled_dt.*None\|last_scheduled_dt.*null" osidb/tests/test_sync_manager.py

Repository: RedHatProductSecurity/osidb

Length of output: 53


🏁 Script executed:

#!/bin/bash
# Search for any test methods that cover reschedule behavior with NULL scheduling
grep -n "test_.*reschedule\|def test_" osidb/tests/test_sync_manager.py | head -30

Repository: RedHatProductSecurity/osidb

Length of output: 419


🏁 Script executed:

#!/bin/bash
# Check if there are any test cases that create SyncManager with NULL last_scheduled_dt
grep -B 5 -A 10 "last_scheduled_dt" osidb/tests/test_sync_manager.py

Repository: RedHatProductSecurity/osidb

Length of output: 4549


🏁 Script executed:

#!/bin/bash
# Check if there are any check_for_reschedules tests or tests for the reschedule logic
grep -n "check_for_reschedules\|started_not_finished\|started_failed_not_permanently" osidb/tests/test_sync_manager.py

Repository: RedHatProductSecurity/osidb

Length of output: 53


🏁 Script executed:

#!/bin/bash
# Get the full test file length to see how comprehensive it is
wc -l osidb/tests/test_sync_manager.py

Repository: RedHatProductSecurity/osidb

Length of output: 107


🏁 Script executed:

#!/bin/bash
# Check for any TODO or OSIDB-3131 references in the codebase
grep -r "OSIDB-3131\|started_not_scheduled" --include="*.py"

Repository: RedHatProductSecurity/osidb

Length of output: 308


🏁 Script executed:

#!/bin/bash
# Verify the actual field definitions to confirm NULL is allowed on last_scheduled_dt
grep -B 2 -A 2 "last_scheduled_dt" osidb/sync_manager.py | head -20

Repository: RedHatProductSecurity/osidb

Length of output: 827


🏁 Script executed:

#!/bin/bash
# Let's verify Django's behavior with F() expressions and NULL values by examining database field properties
# and confirm the logic around the filters more carefully
sed -n '320,410p' osidb/sync_manager.py

Repository: RedHatProductSecurity/osidb

Length of output: 3828


🏁 Script executed:

#!/bin/bash
# Let me create a quick verification test to understand Django F() behavior with NULL
cat > /tmp/test_django_null.py << 'EOF'
"""
Django F() expression behavior with NULL values:

In SQL, comparisons involving NULL always return UNKNOWN/NULL:
- NULL < X evaluates to NULL (not TRUE or FALSE)
- When used in a WHERE clause, NULL is treated as FALSE
- So rows with NULL values are excluded from the result set

Example:
  SELECT * FROM table WHERE col < 10
  Will NOT return rows where col IS NULL, even if NULL should logically be "less than" 10

Solution:
  SELECT * FROM table WHERE col IS NULL OR col < 10
  This explicitly includes the NULL case

In Django ORM:
  Model.objects.filter(field__lt=F('other_field'))
  Will exclude rows where field IS NULL

  Model.objects.filter(Q(field__isnull=True) | Q(field__lt=F('other_field')))
  Will include rows where field IS NULL OR the comparison is true
"""
print("Django F() with NULL behavior:")
print("- NULL < X returns NULL (excluded from WHERE clause)")
print("- Need explicit Q(field__isnull=True) | Q(field__lt=F(...)) to include NULLs")
EOF
cat /tmp/test_django_null.py

Repository: RedHatProductSecurity/osidb

Length of output: 1037


🏁 Script executed:

#!/bin/bash
# Search for the MAX_RUN_LENGTH and FAIL_RESCHEDULE_DELAY constants to understand the timing context
grep -n "MAX_RUN_LENGTH\|FAIL_RESCHEDULE_DELAY\|MAX_SCHEDULE_DELAY" osidb/sync_manager.py | head -20

Repository: RedHatProductSecurity/osidb

Length of output: 1366


Handle NULL schedules in the stale-run reschedule filters.

Line 325 explicitly identifies the known bad state where a manager has last_started_dt but no last_scheduled_dt. Those rows are logged but never rescheduled: the later last_scheduled_dt__lt=F(...) filters exclude NULL values, so a long-running or failed manager stuck in that state will remain unscheduled instead of being rescheduled.

In SQL, NULL < X evaluates to NULL (falsy in WHERE clauses), so rows with last_scheduled_dt IS NULL never match filters like last_scheduled_dt__lt=F("last_started_dt"). This leaves stale managers in a permanent hung state.

Use Q(last_scheduled_dt__isnull=True) | Q(last_scheduled_dt__lt=F(...)) in both started_not_finished and started_failed_not_permanently filters to explicitly include NULL cases.

Also add a test case for last_scheduled_dt=None with stale last_started_dt or last_failed_dt to ensure rescheduling coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@osidb/sync_manager.py` around lines 325 - 327, The filters for stale-run
rescheduling currently exclude rows where last_scheduled_dt is NULL (e.g.,
started_not_scheduled) because comparisons like last_scheduled_dt__lt=F(...)
don't match NULL; update the started_not_finished and
started_failed_not_permanently filter clauses to explicitly include NULLs by
using Q(last_scheduled_dt__isnull=True) | Q(last_scheduled_dt__lt=F(...))
(referencing last_scheduled_dt, last_started_dt, last_failed_dt, F and Q) so
managers with last_scheduled_dt=None are rescheduled, and add unit tests that
create managers with last_scheduled_dt=None plus stale last_started_dt or
last_failed_dt to verify they get rescheduled.

@Jincxz Jincxz added the technical For PRs that introduce changes not worthy of a CHANGELOG entry label Apr 23, 2026
@Jincxz Jincxz force-pushed the OSIDB-4914-fix-collector-rescheduler branch from b6cc5a9 to 00da65e Compare April 23, 2026 16:10
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
osidb/sync_manager.py (1)

310-406: Materializing full model rows partially defeats the stated OOM goal.

The PR description targets OOM reduction, but each filtered QuerySet is still fully loaded into memory when iterated by the inner reschedule helper (Django caches the result list by default). For the two reason-constant branches (scheduled_not_started, started_not_finished) only sync_id is used, and for started_failed_not_permanently only sync_id and last_consecutive_failures are needed.

Consider projecting just what's needed and/or streaming with .iterator() to avoid holding entire SyncManager instances in memory:

♻️ Suggested refinement
-        def reschedule(sync_managers, msg=None, msg_fn=None):
-            for sync_manager in sync_managers:
-                sync_id = sync_manager.sync_id
-                if sync_id in rescheduled:
-                    continue
-
-                reason = msg_fn(sync_manager) if msg_fn is not None else msg
-                cls.reschedule(sync_id, reason)
-                rescheduled.add(sync_id)
+        def reschedule(sync_managers, msg=None, msg_fn=None):
+            for sm in sync_managers.iterator():
+                sync_id = sm["sync_id"] if isinstance(sm, dict) else sm.sync_id
+                if sync_id in rescheduled:
+                    continue
+                reason = msg_fn(sm) if msg_fn is not None else msg
+                cls.reschedule(sync_id, reason)
+                rescheduled.add(sync_id)

And at the call sites, narrow the fields, e.g.:

reschedule(
    scheduled_not_started.values_list("sync_id", flat=True).iterator(),
    "Sync did not start after MAX_SCHEDULE_DELAY",
)
...
reschedule(
    started_failed_not_permanently.values("sync_id", "last_consecutive_failures").iterator(),
    msg_fn=lambda sm: f"Failed {sm['last_consecutive_failures']} times",
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@osidb/sync_manager.py` around lines 310 - 406, The QuerySets
(scheduled_not_started, started_not_finished, started_failed_not_permanently)
are being fully materialized when iterated by reschedule; to avoid OOM, project
only the needed fields and stream results (use .values_list("sync_id",
flat=True).iterator() for branches that only need sync_id, and
.values("sync_id","last_consecutive_failures").iterator() for the branch that
needs the failure count). Update the reschedule helper (reschedule) to accept
these streamed iterables and handle both simple sync_id values and dict/row
objects (adjust msg_fn use to read sm for model, sm['last_consecutive_failures']
for dicts or support tuples) so we don't load full SyncManager instances into
memory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@osidb/sync_manager.py`:
- Around line 310-406: The QuerySets (scheduled_not_started,
started_not_finished, started_failed_not_permanently) are being fully
materialized when iterated by reschedule; to avoid OOM, project only the needed
fields and stream results (use .values_list("sync_id", flat=True).iterator() for
branches that only need sync_id, and
.values("sync_id","last_consecutive_failures").iterator() for the branch that
needs the failure count). Update the reschedule helper (reschedule) to accept
these streamed iterables and handle both simple sync_id values and dict/row
objects (adjust msg_fn use to read sm for model, sm['last_consecutive_failures']
for dicts or support tuples) so we don't load full SyncManager instances into
memory.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: cc923a12-12e2-44e0-853e-c1c4634fc514

📥 Commits

Reviewing files that changed from the base of the PR and between b6cc5a9 and 00da65e.

📒 Files selected for processing (1)
  • osidb/sync_manager.py

@Jincxz Jincxz removed the request for review from a team April 23, 2026 16:15
@Jincxz Jincxz marked this pull request as draft April 23, 2026 16:15
@Jincxz Jincxz force-pushed the OSIDB-4914-fix-collector-rescheduler branch from 00da65e to c4f00f5 Compare April 23, 2026 17:19
@Jincxz Jincxz marked this pull request as ready for review April 23, 2026 17:55
@Jincxz Jincxz requested a review from a team April 23, 2026 17:55
Copy link
Copy Markdown
Contributor

@JakubFrejlach JakubFrejlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change LGTM! Just a thought about another possible bottleneck.

Comment thread osidb/sync_manager.py
f"Started but not scheduled, this should NEVER happen"
)
continue
sync_managers = SyncManager.objects.filter(name=cls.__name__)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The move from Python-level filtering to DB queries is a solid improvement, but I think there's still a potential bottleneck: if many sync managers match the reschedule conditions, we're still calling reschedule() for each one individually (multiple DB queries + a broker call per sync manager). On stage, where we're seeing OOM kills due to a large volume of failed sync managers, this could still be problematic.

Would it make sense to cap each reschedule check to a reasonable batch size? Since most collectors run every minute, we'd eventually process all pending reschedules across successive runs.

To prevent starvation (the same rows getting picked every time while others wait indefinitely), we could order candidates by last_scheduled_dt ascending, oldest-waiting items get rescheduled first, and once rescheduled their timestamp moves to the back of the queue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea. Scheduling in batches makes sense. Will convert the rescheduler into a capped queue-like system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

technical For PRs that introduce changes not worthy of a CHANGELOG entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants