OSIDB-4914: Update rescheduler's querying logic in sync manager#1251
OSIDB-4914: Update rescheduler's querying logic in sync manager#1251
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughSummary by CodeRabbit
Walkthroughcheck_for_reschedules now detects candidates via database queryset filters (using Q/F) and reschedules by sync_id. It defines three reschedule querysets and uses an inner helper with a shared Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@osidb/sync_manager.py`:
- Around line 325-327: The filters for stale-run rescheduling currently exclude
rows where last_scheduled_dt is NULL (e.g., started_not_scheduled) because
comparisons like last_scheduled_dt__lt=F(...) don't match NULL; update the
started_not_finished and started_failed_not_permanently filter clauses to
explicitly include NULLs by using Q(last_scheduled_dt__isnull=True) |
Q(last_scheduled_dt__lt=F(...)) (referencing last_scheduled_dt, last_started_dt,
last_failed_dt, F and Q) so managers with last_scheduled_dt=None are
rescheduled, and add unit tests that create managers with last_scheduled_dt=None
plus stale last_started_dt or last_failed_dt to verify they get rescheduled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 364c5ae6-c98c-4d22-8f34-4d00d822ea57
📒 Files selected for processing (1)
osidb/sync_manager.py
| started_not_scheduled = sync_managers.filter( | ||
| last_scheduled_dt__isnull=True, last_started_dt__isnull=False | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Read the relevant section of the file
wc -l osidb/sync_manager.pyRepository: RedHatProductSecurity/osidb
Length of output: 97
🏁 Script executed:
#!/bin/bash
# Check the specific lines mentioned in the review
sed -n '310,400p' osidb/sync_manager.py | head -100Repository: RedHatProductSecurity/osidb
Length of output: 3836
🏁 Script executed:
#!/bin/bash
# Find related test files for this sync_manager module
find . -type f -name "*test*sync*" -o -name "*sync*test*" | grep -E "\.py$"Repository: RedHatProductSecurity/osidb
Length of output: 157
🏁 Script executed:
#!/bin/bash
# Check imports in sync_manager.py to confirm F() and Q() availability
head -50 osidb/sync_manager.pyRepository: RedHatProductSecurity/osidb
Length of output: 1852
🏁 Script executed:
#!/bin/bash
# Read the test file to see if there are tests covering NULL last_scheduled_dt
head -100 osidb/tests/test_sync_manager.pyRepository: RedHatProductSecurity/osidb
Length of output: 3194
🏁 Script executed:
#!/bin/bash
# Get more context around the filters, specifically the complete started_not_finished and started_failed_not_permanently blocks
sed -n '360,410p' osidb/sync_manager.pyRepository: RedHatProductSecurity/osidb
Length of output: 2178
🏁 Script executed:
#!/bin/bash
# Search for any tests that specifically test started_not_scheduled or NULL scheduling scenarios
grep -n "started_not_scheduled\|last_scheduled_dt.*None\|last_scheduled_dt.*null" osidb/tests/test_sync_manager.pyRepository: RedHatProductSecurity/osidb
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Search for any test methods that cover reschedule behavior with NULL scheduling
grep -n "test_.*reschedule\|def test_" osidb/tests/test_sync_manager.py | head -30Repository: RedHatProductSecurity/osidb
Length of output: 419
🏁 Script executed:
#!/bin/bash
# Check if there are any test cases that create SyncManager with NULL last_scheduled_dt
grep -B 5 -A 10 "last_scheduled_dt" osidb/tests/test_sync_manager.pyRepository: RedHatProductSecurity/osidb
Length of output: 4549
🏁 Script executed:
#!/bin/bash
# Check if there are any check_for_reschedules tests or tests for the reschedule logic
grep -n "check_for_reschedules\|started_not_finished\|started_failed_not_permanently" osidb/tests/test_sync_manager.pyRepository: RedHatProductSecurity/osidb
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Get the full test file length to see how comprehensive it is
wc -l osidb/tests/test_sync_manager.pyRepository: RedHatProductSecurity/osidb
Length of output: 107
🏁 Script executed:
#!/bin/bash
# Check for any TODO or OSIDB-3131 references in the codebase
grep -r "OSIDB-3131\|started_not_scheduled" --include="*.py"Repository: RedHatProductSecurity/osidb
Length of output: 308
🏁 Script executed:
#!/bin/bash
# Verify the actual field definitions to confirm NULL is allowed on last_scheduled_dt
grep -B 2 -A 2 "last_scheduled_dt" osidb/sync_manager.py | head -20Repository: RedHatProductSecurity/osidb
Length of output: 827
🏁 Script executed:
#!/bin/bash
# Let's verify Django's behavior with F() expressions and NULL values by examining database field properties
# and confirm the logic around the filters more carefully
sed -n '320,410p' osidb/sync_manager.pyRepository: RedHatProductSecurity/osidb
Length of output: 3828
🏁 Script executed:
#!/bin/bash
# Let me create a quick verification test to understand Django F() behavior with NULL
cat > /tmp/test_django_null.py << 'EOF'
"""
Django F() expression behavior with NULL values:
In SQL, comparisons involving NULL always return UNKNOWN/NULL:
- NULL < X evaluates to NULL (not TRUE or FALSE)
- When used in a WHERE clause, NULL is treated as FALSE
- So rows with NULL values are excluded from the result set
Example:
SELECT * FROM table WHERE col < 10
Will NOT return rows where col IS NULL, even if NULL should logically be "less than" 10
Solution:
SELECT * FROM table WHERE col IS NULL OR col < 10
This explicitly includes the NULL case
In Django ORM:
Model.objects.filter(field__lt=F('other_field'))
Will exclude rows where field IS NULL
Model.objects.filter(Q(field__isnull=True) | Q(field__lt=F('other_field')))
Will include rows where field IS NULL OR the comparison is true
"""
print("Django F() with NULL behavior:")
print("- NULL < X returns NULL (excluded from WHERE clause)")
print("- Need explicit Q(field__isnull=True) | Q(field__lt=F(...)) to include NULLs")
EOF
cat /tmp/test_django_null.pyRepository: RedHatProductSecurity/osidb
Length of output: 1037
🏁 Script executed:
#!/bin/bash
# Search for the MAX_RUN_LENGTH and FAIL_RESCHEDULE_DELAY constants to understand the timing context
grep -n "MAX_RUN_LENGTH\|FAIL_RESCHEDULE_DELAY\|MAX_SCHEDULE_DELAY" osidb/sync_manager.py | head -20Repository: RedHatProductSecurity/osidb
Length of output: 1366
Handle NULL schedules in the stale-run reschedule filters.
Line 325 explicitly identifies the known bad state where a manager has last_started_dt but no last_scheduled_dt. Those rows are logged but never rescheduled: the later last_scheduled_dt__lt=F(...) filters exclude NULL values, so a long-running or failed manager stuck in that state will remain unscheduled instead of being rescheduled.
In SQL, NULL < X evaluates to NULL (falsy in WHERE clauses), so rows with last_scheduled_dt IS NULL never match filters like last_scheduled_dt__lt=F("last_started_dt"). This leaves stale managers in a permanent hung state.
Use Q(last_scheduled_dt__isnull=True) | Q(last_scheduled_dt__lt=F(...)) in both started_not_finished and started_failed_not_permanently filters to explicitly include NULL cases.
Also add a test case for last_scheduled_dt=None with stale last_started_dt or last_failed_dt to ensure rescheduling coverage.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@osidb/sync_manager.py` around lines 325 - 327, The filters for stale-run
rescheduling currently exclude rows where last_scheduled_dt is NULL (e.g.,
started_not_scheduled) because comparisons like last_scheduled_dt__lt=F(...)
don't match NULL; update the started_not_finished and
started_failed_not_permanently filter clauses to explicitly include NULLs by
using Q(last_scheduled_dt__isnull=True) | Q(last_scheduled_dt__lt=F(...))
(referencing last_scheduled_dt, last_started_dt, last_failed_dt, F and Q) so
managers with last_scheduled_dt=None are rescheduled, and add unit tests that
create managers with last_scheduled_dt=None plus stale last_started_dt or
last_failed_dt to verify they get rescheduled.
b6cc5a9 to
00da65e
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
osidb/sync_manager.py (1)
310-406: Materializing full model rows partially defeats the stated OOM goal.The PR description targets OOM reduction, but each filtered QuerySet is still fully loaded into memory when iterated by the inner
reschedulehelper (Django caches the result list by default). For the two reason-constant branches (scheduled_not_started,started_not_finished) onlysync_idis used, and forstarted_failed_not_permanentlyonlysync_idandlast_consecutive_failuresare needed.Consider projecting just what's needed and/or streaming with
.iterator()to avoid holding entireSyncManagerinstances in memory:♻️ Suggested refinement
- def reschedule(sync_managers, msg=None, msg_fn=None): - for sync_manager in sync_managers: - sync_id = sync_manager.sync_id - if sync_id in rescheduled: - continue - - reason = msg_fn(sync_manager) if msg_fn is not None else msg - cls.reschedule(sync_id, reason) - rescheduled.add(sync_id) + def reschedule(sync_managers, msg=None, msg_fn=None): + for sm in sync_managers.iterator(): + sync_id = sm["sync_id"] if isinstance(sm, dict) else sm.sync_id + if sync_id in rescheduled: + continue + reason = msg_fn(sm) if msg_fn is not None else msg + cls.reschedule(sync_id, reason) + rescheduled.add(sync_id)And at the call sites, narrow the fields, e.g.:
reschedule( scheduled_not_started.values_list("sync_id", flat=True).iterator(), "Sync did not start after MAX_SCHEDULE_DELAY", ) ... reschedule( started_failed_not_permanently.values("sync_id", "last_consecutive_failures").iterator(), msg_fn=lambda sm: f"Failed {sm['last_consecutive_failures']} times", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@osidb/sync_manager.py` around lines 310 - 406, The QuerySets (scheduled_not_started, started_not_finished, started_failed_not_permanently) are being fully materialized when iterated by reschedule; to avoid OOM, project only the needed fields and stream results (use .values_list("sync_id", flat=True).iterator() for branches that only need sync_id, and .values("sync_id","last_consecutive_failures").iterator() for the branch that needs the failure count). Update the reschedule helper (reschedule) to accept these streamed iterables and handle both simple sync_id values and dict/row objects (adjust msg_fn use to read sm for model, sm['last_consecutive_failures'] for dicts or support tuples) so we don't load full SyncManager instances into memory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@osidb/sync_manager.py`:
- Around line 310-406: The QuerySets (scheduled_not_started,
started_not_finished, started_failed_not_permanently) are being fully
materialized when iterated by reschedule; to avoid OOM, project only the needed
fields and stream results (use .values_list("sync_id", flat=True).iterator() for
branches that only need sync_id, and
.values("sync_id","last_consecutive_failures").iterator() for the branch that
needs the failure count). Update the reschedule helper (reschedule) to accept
these streamed iterables and handle both simple sync_id values and dict/row
objects (adjust msg_fn use to read sm for model, sm['last_consecutive_failures']
for dicts or support tuples) so we don't load full SyncManager instances into
memory.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: cc923a12-12e2-44e0-853e-c1c4634fc514
📒 Files selected for processing (1)
osidb/sync_manager.py
00da65e to
c4f00f5
Compare
JakubFrejlach
left a comment
There was a problem hiding this comment.
Change LGTM! Just a thought about another possible bottleneck.
| f"Started but not scheduled, this should NEVER happen" | ||
| ) | ||
| continue | ||
| sync_managers = SyncManager.objects.filter(name=cls.__name__) |
There was a problem hiding this comment.
The move from Python-level filtering to DB queries is a solid improvement, but I think there's still a potential bottleneck: if many sync managers match the reschedule conditions, we're still calling reschedule() for each one individually (multiple DB queries + a broker call per sync manager). On stage, where we're seeing OOM kills due to a large volume of failed sync managers, this could still be problematic.
Would it make sense to cap each reschedule check to a reasonable batch size? Since most collectors run every minute, we'd eventually process all pending reschedules across successive runs.
To prevent starvation (the same rows getting picked every time while others wait indefinitely), we could order candidates by last_scheduled_dt ascending, oldest-waiting items get rescheduled first, and once rescheduled their timestamp moves to the back of the queue.
There was a problem hiding this comment.
That sounds like a good idea. Scheduling in batches makes sense. Will convert the rescheduler into a capped queue-like system.
Adjusted the logic behind sync manager's rescheduling.
The original code checked every sync manager, of a particular class, to determine if they should be rescheduled. The new logic filters out each conditional branch for rescheduling instead. This should minimize the OOM issue and allow the collectors to run.
The
rescheduledset is there to prevent re-reschedules in the case that the conditions overlap.