Skip to content

fix(grouping): cap event duration at 24h for continuous-monitoring deployments#1268

Open
mihow wants to merge 5 commits intomainfrom
fix/group-images-24h-max-duration
Open

fix(grouping): cap event duration at 24h for continuous-monitoring deployments#1268
mihow wants to merge 5 commits intomainfrom
fix/group-images-24h-max-duration

Conversation

@mihow
Copy link
Copy Markdown
Collaborator

@mihow mihow commented Apr 23, 2026

Summary

group_images_into_events currently splits images into events only when it sees a gap larger than max_time_gap (default 2h). For continuous-monitoring deployments (no nightly quiet period), the stream never has a 2h gap, so every image in the deployment coalesces into a single multi-month event.

Observed in production on deployment 489 (Camera 263): one "event" spanning Apr 24 → Jul 23 2023 (90 days) containing 355k source_images. The downstream UPDATE on main_sourceimage.event_id for this mega-event also caused stuck PG backends (the UPDATE touches 355k rows × 10 indexes).

Fix

Add a max_event_duration kwarg to group_datetimes_by_gap and thread it through group_images_into_events with a default of 24 hours. When set, a group is also split once its span would exceed the cap, even when no gap triggers a split. None disables the cap (backward compatible for existing callers of the utility).

Also removed the stale inline # @TODO this event grouping needs testing. Still getting events over 24 hours comment now that events over 24h are prevented by construction.

Follow-up correctness fixes (commit 637afce2)

The first round of this PR only handled the gap/duration split itself. Reviewing the regroup path against the production deployment surfaced two pre-existing data-correctness bugs that the cap reliably triggers, so they are addressed here rather than as separate PRs:

  1. Occurrence.event_id realignment. Occurrences are bound to an event once at creation time (Detection.associate_new_occurrence, Pipeline.save_results both read source_image.event) and were never re-derived afterward. Without a refresh, a deployment regrouped under the 24h cap kept every occurrence pointing at its original (pre-cap) event regardless of when its detections actually fired, breaking every Occurrence.event-keyed query (the occur_det_proj_evt index, Event.occurrences related-name, the event_ids= filter chain at models.py:4232–4477). Now realigned via Occurrence.event_id = first_detection.source_image.event_id. The pre- and post-refresh event holders are unioned into touched_event_pks so update_calculated_fields_for_events covers both occurrence-losers and -gainers.

  2. Deployment-level cache refresh. The async regroup_events task never went through Deployment.save's calculated-fields refresh, so the deployment list (occurrences_count, taxa_count, captures_count, detections_count) showed pre-regroup numbers until the next save touched the deployment. Replaced the manual events_count-only update at the end of group_images_into_events with a full deployment.update_calculated_fields(save=True) call (which itself uses update_calculated_fields=False so it does not re-enter the regroup path).

Test plan

  • Doctest in ami/utils/dates.py for the continuous-monitoring case: 3 days of 5-minute-interval timestamps with no gap > 120 min produces 1 group without the cap, 3 groups (each ≤24h) with max_event_duration=24h.
  • TestImageGrouping.test_continuous_monitoring_capped_at_24_hours — fresh grouping path, asserts exactly 3 daily events ≤24h.
  • TestImageGrouping.test_regrouping_existing_long_event_refreshes_cached_fields — re-grouping path, asserts cached captures_count per event is consistent post-regroup and no captures are orphaned.
  • TestImageGrouping.test_regrouping_realigns_occurrence_event_id — covers the detection→source_image→event chain post-regroup, plus per-event occurrences_count consistency against the live get_occurrences_count helper.
  • Existing test_grouping (3 discrete nights) unchanged — the cap only fires when the gap-based split fails to trigger.
  • Full local test pass: ami.main.tests + ami.exports.tests + ami.ml.tests → 249 tests, OK (2 skipped).
  • python -m doctest ami/utils/dates.py → 28 attempted, 0 failed.
  • black + isort + flake8 + pyupgrade + django-upgrade pre-commit hooks pass.
  • CI test run on merge target.

How to verify post-deploy

After merging and deploying, run on a continuous-monitoring deployment with existing detections/occurrences (e.g. deployment 489) and check each of the three correctness layers in turn. Trigger the regroup via the admin regroup_events action, the regroup_events celery task, or Deployment.save(regroup_async=False).

import datetime
from django.db import models
from ami.main.models import Deployment, Event, SourceImage, Occurrence, Detection

dep = Deployment.objects.get(pk=<deployment_pk>)

# 1. Cap: no event spans > 24h.
over_cap = Event.objects.filter(
    deployment=dep,
    start__lt=models.F("end") - datetime.timedelta(hours=24),
).count()
assert over_cap == 0, f"{over_cap} events still exceed 24h"

# 2. Per-event captures_count: cached counter matches actual captures
#    (the original "stale cache on reused mega-event" bug).
for ev in Event.objects.filter(deployment=dep):
    actual = SourceImage.objects.filter(event=ev).count()
    assert ev.captures_count == actual, (
        f"event {ev.pk}: cached captures_count={ev.captures_count} actual={actual}"
    )

# 3. Occurrence.event_id realignment: every occurrence's event matches its
#    first detection's source_image.event. Sample a few; full pass on a
#    357k-image deployment is fine but takes a minute.
for occ in Occurrence.objects.filter(deployment=dep).order_by("?")[:200]:
    first = (
        Detection.objects.filter(occurrence=occ)
        .select_related("source_image")
        .order_by("source_image__timestamp")
        .first()
    )
    if first is None:
        continue
    assert occ.event_id == first.source_image.event_id, (
        f"occurrence {occ.pk}: stale event_id={occ.event_id} "
        f"(expected {first.source_image.event_id})"
    )

# 4. Deployment cache refresh: list-view counters reflect the regroup.
dep.refresh_from_db()
print(dep.events_count, dep.captures_count, dep.occurrences_count, dep.taxa_count)
# Expect: events_count ≈ days covered (was 1 mega-event), other counts
# unchanged in absolute terms (occurrences/taxa weren't added/removed,
# just re-bucketed across events).

UI spot-checks worth doing:

  • Sessions list / Events list: each event should show a duration ≤ 24h, with non-zero occurrences_count and taxa_count distributed across the daily events (pre-regroup all the count was on the single mega-event).
  • Occurrence detail view: the event link / breadcrumb should point at the daily event whose date matches the occurrence's first detection, not the original mega-event date.
  • Deployment list: occurrences_count and taxa_count columns should match the project summary view (which queries Occurrence directly and is unaffected by caching).

Follow-ups (not in this PR)

This PR intentionally stays narrow: one new kwarg, one default, plus the data-correctness refreshes that the cap necessarily triggers on existing deployments. Can be rebased on top of #904 when that lands, or #904 can be rebased on top of this.

Deployment notes

After merge + deploy, regrouping the affected deployment (e.g. via the admin regroup_events action or Deployment.save(regroup_async=False)) will split the existing multi-month event into daily events. Existing events with the same group_by date (the first day of the old mega-event) will be reused and pruned by delete_empty_events if emptied.

Heads-up on existing >24h events. Any deployment that today has events >24h — for any reason, not just the continuous-monitoring pathology — will get those events split on the next regroup. In practice everything we've seen >24h in production is itself an error state (the symptom this PR fixes, or unusual gap configurations), so this is the right behavior. Worth flagging if any project owner has a downstream report keyed on "1 event = 1 capture session" expecting multi-day events to stay intact. The audit warning at audit_event_lengths will stop firing for these deployments after the regroup, which is the intended end state.

Summary by CodeRabbit

  • New Features

    • Event grouping enforces a configurable maximum duration per event (default 24 hours), splitting long continuous streams into multiple events.
  • Bug Fixes

    • Fixed stale event metadata after regrouping so event start/end and capture counts are refreshed correctly when captures are reassigned.
    • Realigned Occurrence.event_id from each occurrence's first detection's source_image.event so event-scoped occurrence queries reflect the post-regroup grouping.
    • Refreshed deployment-level cached counts (occurrences_count, taxa_count, etc.) at the end of regrouping so the deployment list view stays consistent after async regroups.
  • Tests

    • Added tests verifying segmentation over multi-day streams, correct metadata refresh after regrouping, and Occurrence.event_id realignment with per-event occurrences_count consistency.

…ployments

`group_images_into_events` splits images into events when there is a gap
larger than `max_time_gap` (default 2h). For continuous-monitoring
deployments the stream never has a 2h quiet period, so every image in
the deployment coalesces into a single multi-month event. Observed on
production deployment 489: one event spanning Apr 24 → Jul 23 2023
with 355k source_images.

Add a `max_event_duration` kwarg to `group_datetimes_by_gap` and thread
it through `group_images_into_events` with a default of 24 hours. When
set, a group is also split once its span would exceed the cap, even if
no gap triggers the split. `None` disables the cap (backward compatible).

Also removes the stale `# @todo this event grouping needs testing.
Still getting events over 24 hours` comment now that events over 24h
are prevented by construction.

Tests: doctest for the continuous-monitoring case in `dates.py`, plus
a Django test that creates 3 days of 10-minute-interval captures and
asserts grouping produces >=3 events each <=24h.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 23, 2026 18:42
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 23, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit a2e4bb3
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/69ec34374d70a50008f3094e

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 23, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit a2e4bb3
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/69ec343760d0290008e29633

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@mihow has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 27 minutes and 26 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 27 minutes and 26 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: adfcae24-13cf-4307-aecc-f16257ba97f8

📥 Commits

Reviewing files that changed from the base of the PR and between 637afce and a2e4bb3.

📒 Files selected for processing (2)
  • ami/main/models.py
  • ami/main/tests.py
📝 Walkthrough

Walkthrough

group_images_into_events gains an optional max_event_duration (default 24h) forwarded into group_datetimes_by_gap; grouping now splits by inter-capture gap or by a group's total span exceeding the duration cap. Regrouping re-tracks affected events and refreshes their calculated cached fields and deployment counts.

Changes

Cohort / File(s) Summary
Event grouping core logic
ami/main/models.py, ami/utils/dates.py
Added DEFAULT_MAX_EVENT_DURATION and a max_event_duration parameter; group_datetimes_by_gap now also splits when a group's span would exceed the cap; group_images_into_events forwards the cap, tracks events touched by capture reassignment and occurrence realignment, and recomputes cached event/deployment fields.
Tests
ami/main/tests.py
Added deterministic continuous-capture test data and tests asserting continuous streams are split into daily events under the 24h cap; regression test creates a mega-event then reruns grouping with the cap to verify splitting, occurrence→event realignment, and refreshed per-event cached counts.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Grouper as "group_images_into_events\n(function)"
    participant Dates as "group_datetimes_by_gap\n(util)"
    participant DB as "Database\n(Events, SourceImages, Occurrences, Detections)"
    participant Updater as "update_calculated_fields_for_events\n(deployment.update_calculated_fields)"

    Grouper->>DB: query SourceImage timestamps for deployment
    Grouper->>Dates: call group_datetimes_by_gap(timestamps, max_time_gap, max_event_duration)
    Dates-->>Grouper: return grouped timestamp buckets
    Grouper->>DB: reassign SourceImage.event_id per group (capture reassignment)
    Grouper->>DB: collect event PKs touched before/after reassignment
    Grouper->>DB: update Occurrence.event_id using detections' source_image__event_id (subquery)
    Grouper->>DB: re-track any additionally touched events
    Grouper->>Updater: call update_calculated_fields_for_events(touched_event_pks)
    Updater-->>DB: persist refreshed event/deployment cached fields
    Updater-->>Grouper: confirmation
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I hopped through timestamps, split the endless day,

Twenty-four hours now keep long events at bay,
I nudged old occurrences to find fresh nests,
Refreshed the counters, tucked each event to rest,
A little rabbit cheers for tidy time-play!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: capping event duration at 24h for continuous-monitoring deployments, matching the primary objective to prevent multi-month events.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description is comprehensive and well-structured, following the provided template with all required sections.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/group-images-24h-max-duration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
ami/main/tests.py (1)

239-269: LGTM — targeted regression test for the 24h cap.

The test cleanly exercises the new max_event_duration path: a gap-free 3-day stream at 10-minute intervals asserts both (a) the grouping produces ≥3 events (not a single mega-event) and (b) no event spans more than 24h. A few minor, non-blocking notes you may want to address:

  • datetime.datetime(2023, 4, 24, 3, 22, 38) is naive; if this project runs with USE_TZ=True you'll likely get RuntimeWarning: DateTimeField ... received a naive datetime. Consider datetime.timezone.utc to silence it.
  • SourceImage.objects.create(...) omits project; this relies on the deployment → project propagation (as seen elsewhere in this file, e.g., _create_test_source_image sets project= explicitly). Worth confirming that's happening on save for this code path so the test isn't accidentally exercising a project=None capture.
  • Given the 3-day span / 24h cap is deterministic, assertEqual(len(events), 3) would be a stronger regression guard than >= 3.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/main/tests.py` around lines 239 - 269, The test
test_continuous_monitoring_capped_at_24_hours uses a naive datetime and omits
the SourceImage.project while asserting only >=3 events; update the timestamp to
be timezone-aware (e.g., add tzinfo=datetime.timezone.utc or use Django timezone
utilities) when constructing start, ensure each SourceImage.objects.create
includes the correct project (e.g., project=self.deployment.project or use the
helper _create_test_source_image to preserve project propagation), and tighten
the assertion to assertEqual(len(events), 3) to lock the expected 3-day split
produced by group_images_into_events with max_event_duration set to 24h.
ami/main/models.py (1)

1370-1389: Minor: audit_event_lengths 24h warning is now effectively unreachable under the default cap.

With DEFAULT_MAX_EVENT_DURATION = 24h, group_datetimes_by_gap will split any group whose span exceeds 24h, so events_over_24_hours at Line 1373 should always be 0 for normal callers. This isn't wrong (it still protects against the max_event_duration=None path and legacy data), but consider either:

  • Tying the threshold to the cap that was actually used (pass max_event_duration through to audit_event_lengths and compare against that), or
  • Leaving a short comment noting the warning now primarily catches legacy/pre-cap events until a full regroup is run.

Non-blocking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/main/models.py` around lines 1370 - 1389, The 24h check in
audit_event_lengths is effectively unreachable when DEFAULT_MAX_EVENT_DURATION
is 24h because group_datetimes_by_gap already splits spans >24h; update
audit_event_lengths to accept and use the actual max_event_duration (e.g., add a
parameter max_event_duration and compare events_over_duration using that value
instead of a hardcoded 24h) or add a short clarifying comment in
audit_event_lengths noting this check only catches legacy or
max_event_duration=None cases; reference audit_event_lengths,
DEFAULT_MAX_EVENT_DURATION, and group_datetimes_by_gap when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ami/main/models.py`:
- Around line 1370-1389: The 24h check in audit_event_lengths is effectively
unreachable when DEFAULT_MAX_EVENT_DURATION is 24h because
group_datetimes_by_gap already splits spans >24h; update audit_event_lengths to
accept and use the actual max_event_duration (e.g., add a parameter
max_event_duration and compare events_over_duration using that value instead of
a hardcoded 24h) or add a short clarifying comment in audit_event_lengths noting
this check only catches legacy or max_event_duration=None cases; reference
audit_event_lengths, DEFAULT_MAX_EVENT_DURATION, and group_datetimes_by_gap when
making the change.

In `@ami/main/tests.py`:
- Around line 239-269: The test test_continuous_monitoring_capped_at_24_hours
uses a naive datetime and omits the SourceImage.project while asserting only >=3
events; update the timestamp to be timezone-aware (e.g., add
tzinfo=datetime.timezone.utc or use Django timezone utilities) when constructing
start, ensure each SourceImage.objects.create includes the correct project
(e.g., project=self.deployment.project or use the helper
_create_test_source_image to preserve project propagation), and tighten the
assertion to assertEqual(len(events), 3) to lock the expected 3-day split
produced by group_images_into_events with max_event_duration set to 24h.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d7c794e7-5950-401a-aedb-8f2f7f300c3d

📥 Commits

Reviewing files that changed from the base of the PR and between ae14c7b and ce0c53d.

📒 Files selected for processing (3)
  • ami/main/models.py
  • ami/main/tests.py
  • ami/utils/dates.py

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a duration cap to timestamp grouping so continuous-monitoring deployments don’t collapse into multi-month “mega-events”, reducing downstream DB churn and operational risk during regrouping.

Changes:

  • Add max_event_duration support to group_datetimes_by_gap (with doctest coverage for gap-free streams).
  • Thread max_event_duration into group_images_into_events with a default 24h cap.
  • Add a Django test asserting continuous monitoring is split into events with duration ≤ 24h.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
ami/utils/dates.py Adds max_event_duration splitting logic + doctest for continuous streams.
ami/main/models.py Introduces a 24h default cap for group_images_into_events and passes it into the date grouping utility.
ami/main/tests.py Adds a regression test for continuous-monitoring deployments producing capped-duration events.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ami/main/models.py
Comment thread ami/main/tests.py Outdated
Addresses Copilot review feedback on #1268.

When regrouping a deployment that already has events, `group_images_into_events` reuses any existing event whose `group_by` date matches the first day of a new group. Before this change, a reused event was saved once early in the loop (while still holding its pre-regroup captures) and never refreshed after later iterations moved captures away to new events. Result: the reused event's cached `start` / `end` / `captures_count` / `detections_count` / `occurrences_count` stayed stale — e.g. a 90-day mega-event being re-grouped under a 24h cap would physically hold only ~24h of captures but still report the 90-day span.

Fix: track every event touched by grouping (both newly-created / reused in this pass, and any event that lost captures to the UPDATE), then call `update_calculated_fields_for_events(pks=...)` in a single pass after the loop to refresh cached fields against each event's final capture set.

This narrowly scoped fix coexists with #904, which is expected to rework the `group_by`-based reuse path more thoroughly.

Tests:
- New `test_regrouping_existing_long_event_refreshes_cached_fields`: creates 3 days of continuous captures, groups once with the cap disabled (one mega-event), then re-groups with the 24h cap and asserts (a) no event exceeds 24h and (b) `sum(captures_count)` across events equals the total image count. Avoids asserting on `group_by` specifics so it remains valid under #904's refactor.
- Existing `test_continuous_monitoring_capped_at_24_hours` refactored to share a `_populate_continuous_captures` helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mihow
Copy link
Copy Markdown
Collaborator Author

mihow commented Apr 24, 2026

Claude says: Pushed 847b1c4 with the follow-up fix + regression test. Summary:

  • Track every event touched by grouping (newly created / reused AND any event that lost captures), then call update_calculated_fields_for_events(pks=...) once after the loop to reconcile cached start / end / captures_count / detections_count / occurrences_count against each event's final capture set.
  • New test test_regrouping_existing_long_event_refreshes_cached_fields exercises the regroup-existing-events path: creates 3 days of continuous captures, groups once with the cap disabled (asserts one mega-event >24h), then re-groups with the 24h cap and asserts (a) no event exceeds 24h and (b) sum(captures_count) across events equals total image count. The latter is what catches the stale-counter regression.
  • Test asserts only on observable cap+refresh behavior — no group_by or reuse-mechanics assertions — so it stays valid under Fix incorrect session grouping  #904's refactor.

Re the CodeRabbit nits: USE_TZ = False in config/settings/base.py, so the naive-datetime warning isn't actually raised in this project. The project= omission in SourceImage.objects.create(...) mirrors the existing create_captures fixture pattern, so keeping it consistent here.

Resolved the two Copilot threads since the feedback is addressed.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ami/main/tests.py`:
- Around line 255-273: The test test_continuous_monitoring_capped_at_24_hours
currently asserts len(events) >= 3 which allows over-splitting to pass; change
the assertion to require the exact expected count (e.g., assert len(events) ==
3) so that continuous captures over 3 days with
self._populate_continuous_captures(days=3, interval_minutes=10) and
group_images_into_events(..., max_event_duration=datetime.timedelta(hours=24))
must produce exactly three daily events; update the assertion line in the test
to compare against the input days (or compute expected_days) rather than using a
>= check.
- Around line 274-319: The test
test_regrouping_existing_long_event_refreshes_cached_fields currently only
checks the sum of event.captures_count which can mask per-event stale caches;
change it to assert per-event cached counts by iterating
Event.objects.filter(deployment=self.deployment) and for each event compare
event.captures_count to the actual related capture count (e.g.,
event.captures.count() or equivalent queryset count) after calling
group_images_into_events, failing with a message referencing event.pk when they
differ; keep the existing duration assertions and total_captures check but add
this per-event check immediately after re-grouping.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 68d610bf-87b4-4591-8c45-0fce12236182

📥 Commits

Reviewing files that changed from the base of the PR and between ce0c53d and 847b1c4.

📒 Files selected for processing (2)
  • ami/main/models.py
  • ami/main/tests.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • ami/main/models.py

Comment thread ami/main/tests.py
Comment thread ami/main/tests.py
- Change `>= 3` to `== 3` in both new tests — 432 captures over 3 days
  at 10 min intervals must yield exactly 3 daily events under the 24h cap.
  Catches over-splitting regressions that `>= 3` would miss.
- Add per-event cached-count check after regroup: compare each event's
  cached `captures_count` against its actual related SourceImage count.
  Offsetting errors can hide stale counters in a sum-only check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mihow
Copy link
Copy Markdown
Collaborator Author

mihow commented Apr 25, 2026

Claude says: Validated end-to-end on staging against a production-shaped data set.

Setup

Cloned the project + deployment + S3 storage source from a continuous-monitoring prod deployment onto staging (no captures copied — staging synced fresh from S3 against a narrowed regex). Result: 357,490 source images spanning Apr–Jul 2023, including 355,148 captures in a contiguous Apr 24 → Jul 23 stream that reproduces the exact pathology this PR addresses.

Before fix (ae14c7b0, main)

regroup_events produced 6 events, including one 90-day, 355,148-capture mega-event (Apr 24 → Jul 23). The task hit the 1h celery soft-time-limit during post-grouping work and was killed.

After fix (ad2ea6ff, this branch)

regroup_events produced 99 events, 0 exceeding 24h:

  • All 357,490 captures assigned, no orphans.
  • Σ event.captures_count == 357,490 — per-event cached counters refreshed correctly across reused events (the case the touched-events tracking targets).
  • Task completed in 89 seconds, vs. main's 1h timeout.

Spot check across the dataset:

Position start end duration captures
First 2023-04-11 21:36 21:41 5m 2
2nd 2023-04-12 03:07 Apr-13 03:06 24h 311
2nd-last full day 2023-07-22 04:30 Jul-23 04:30 24h 6,808
Last (trailing) 2023-07-23 04:31 05:30 59m 297

All durations ≤ 24h. The cap is strict-greater-than (> max_event_duration), so events spanning exactly 24h are allowed; the next capture past the boundary opens a new event.

Why the timing flipped

Main spent its time recomputing cached fields on a single 355k-capture event — each Event.save() recomputes start/end/captures_count from related captures. The fix splits the same captures across ~90 events, so each save touches a small set, and the post-loop update_calculated_fields_for_events bulk-updates all touched events in one pass.

Out of scope of this validation

  • Detection / Classification cascade behavior under the bulk event_id update — staging deployment was synced metadata-only.
  • The "redundant retry on celery kill" pathology from the original prod incident (separate concern, not addressed by this PR).

…hes after regroup

Occurrences are bound to an event once at creation time (from
detection.source_image.event in Detection.associate_new_occurrence and
Pipeline.save_results) and were never re-derived afterward. Without this
refresh, a deployment regrouped under the 24h cap kept every occurrence
pointing at its original (pre-cap) event regardless of when its
detections actually fired, breaking every Occurrence.event-keyed query
(occur_det_proj_evt index, Event.occurrences related-name, event_ids=
filter chain).

Realign Occurrence.event_id from each occurrence's first detection's
source_image.event_id, and union the pre- and post-refresh event holders
into touched_event_pks so the existing update_calculated_fields_for_events
call covers both occurrence-losers and -gainers.

Also swap the manual events_count-only update at the end of
group_images_into_events for a full Deployment.update_calculated_fields()
call. The async regroup_events task never went through Deployment.save's
calculated-fields refresh, so the deployment list (occurrences_count,
taxa_count, captures_count, detections_count) showed pre-regroup numbers
until the next save touched the deployment.

Adds test_regrouping_realigns_occurrence_event_id covering the
detection->source_image->event chain post-regroup plus per-event
occurrences_count consistency against the live get_occurrences_count
helper that the cache itself uses.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow
Copy link
Copy Markdown
Collaborator Author

mihow commented Apr 25, 2026

Claude says: Pushed 637afce2 extending this PR beyond the cap itself to address two pre-existing data-correctness bugs that the cap will reliably trigger on the affected production deployment (which already has detections + occurrences):

  1. Occurrence.event_id realignment. Pre-fix: occurrences were bound to an event once at creation time and never re-derived, so post-regroup every occurrence kept pointing at its original (pre-cap) event. Now: realigned via Occurrence.event_id = first_detection.source_image.event_id after the SourceImage reassignment, with both pre- and post-refresh event holders unioned into touched_event_pks so update_calculated_fields_for_events covers both occurrence-losers and -gainers (catches the case where an occurrence moves off an event that wasn't otherwise touched this regroup).

  2. Deployment-level cache refresh. The async regroup_events celery task never goes through Deployment.save's calculated-fields path, so the deployment list view (occurrences_count, taxa_count, ...) showed pre-regroup numbers until the next save touched the deployment. Replaced the manual events_count-only update at the end of group_images_into_events with a full deployment.update_calculated_fields(save=True) call.

Added test_regrouping_realigns_occurrence_event_id covering the detection→source_image→event chain post-regroup plus per-event occurrences_count consistency. The PR body now includes a "How to verify post-deploy" section with a copy-paste shell-runnable verification script for each correctness layer plus suggested UI spot-checks.

Local run: ami.main.tests + ami.exports.tests + ami.ml.tests → 249 tests, OK (2 skipped). Awaiting CI.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
ami/main/models.py (1)

1497-1501: Make Occurrence→Event realignment deterministic on timestamp ties.

order_by("source_image__timestamp") alone can pick different rows when multiple detections share the same timestamp. Add stable tie-breakers so regrouping is deterministic.

Suggested change
     deployment_occurrences.update(
         event_id=models.Subquery(
             Detection.objects.filter(occurrence_id=models.OuterRef("pk"))
-            .order_by("source_image__timestamp")
+            .order_by("source_image__timestamp", "source_image_id", "pk")
             .values("source_image__event_id")[:1]
         )
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/main/models.py` around lines 1497 - 1501, The Subquery for event_id
currently orders only by "source_image__timestamp", which can yield
nondeterministic results when timestamps tie; update the Ordering in the
Detection Subquery (the Detection.objects.filter(...).order_by(...).values(...
)[:1] used to build event_id) to include stable tie-breakers such as
"source_image__timestamp", "source_image__id" and/or "id" (detection PK) so that
rows with identical timestamps are deterministically ordered; keep the existing
timestamp first and append one or two unique fields (e.g., "source_image__id",
"id") to the order_by to ensure repeatable realignment.
ami/main/tests.py (2)

362-366: Nit: bbox should use absolute pixel coordinates for consistency with project convention.

Based on learnings from PR #1131, Detection.bbox / BoundingBox values in this repo are absolute pixel coordinates, not normalized [0–1] floats. The values don't affect this test's assertions, but using e.g. [10, 10, 20, 20] keeps test fixtures consistent with the canonical schema.

Based on learnings: "Detection.bbox/BoundingBox values use an absolute pixel coordinate space (not normalized [0–1] floats)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/main/tests.py` around lines 362 - 366, The test creates a Detection with
a normalized bbox ([0.1,0.1,0.2,0.2]) but project convention uses absolute pixel
coordinates; update the Detection.objects.create call (the Detection model's
bbox field in this test) to use absolute pixel coordinates instead (for example
[10, 10, 20, 20]) so the fixture matches the canonical BoundingBox schema and
other tests.

404-411: Optional: simplify the day-0 event exclusion.

occurrences[0] was already refreshed inside the loop above (L390), so the embedded subquery is unnecessary and a little hard to read. A direct scalar works:

♻️ Proposed simplification
-        non_day0_events = Event.objects.filter(deployment=self.deployment).exclude(
-            pk=Occurrence.objects.filter(pk=occurrences[0].pk).values("event_id")[:1]
-        )
-        assert Occurrence.objects.filter(event__in=non_day0_events).count() >= 1
+        day0_event_id = occurrences[0].event_id
+        non_day0_events = Event.objects.filter(deployment=self.deployment).exclude(pk=day0_event_id)
+        assert Occurrence.objects.filter(event__in=non_day0_events).count() >= 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/main/tests.py` around lines 404 - 411, The exclusion subquery is
unnecessary and hard to read — replace the nested
Occurrence.objects.filter(...).values("event_id")[:1] with the scalar event id
already available on occurrences[0] (e.g., use occurrences[0].event_id or
occurrences[0].event.pk) when building non_day0_events; keep the rest of the
logic the same
(Event.objects.filter(deployment=self.deployment).exclude(pk=<scalar_event_id>)).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ami/main/models.py`:
- Line 1488: A comment contains an EN DASH character in the text "event_ids=
filters at models.py:4232–4477" which triggers RUF003; replace the en dash (–)
with a standard ASCII hyphen (-) so the comment reads "event_ids= filters at
models.py:4232-4477" to satisfy the linter (update the comment in models.py near
the block referencing event_ids to remove the non-ASCII dash).

In `@ami/main/tests.py`:
- Around line 345-359: The test incorrectly selects targets using
captures_per_day = len(captures) // 3 which can fall exactly on 24h event
boundaries and produce uneven event grouping; change the target selection in
this test that calls group_images_into_events to pick captures by timestamp
offset instead of by index (e.g. find the first SourceImage whose timestamp is
strictly > start + N*24h for N=0,1,2) so each target is unambiguously in the
next day/event, and update the downstream expectations/assertions to match the
new, robust selection; refer to the captures list, the timestamp attribute on
SourceImage, and the group_images_into_events invocation to locate where to
change selection logic and assertions.

---

Nitpick comments:
In `@ami/main/models.py`:
- Around line 1497-1501: The Subquery for event_id currently orders only by
"source_image__timestamp", which can yield nondeterministic results when
timestamps tie; update the Ordering in the Detection Subquery (the
Detection.objects.filter(...).order_by(...).values(... )[:1] used to build
event_id) to include stable tie-breakers such as "source_image__timestamp",
"source_image__id" and/or "id" (detection PK) so that rows with identical
timestamps are deterministically ordered; keep the existing timestamp first and
append one or two unique fields (e.g., "source_image__id", "id") to the order_by
to ensure repeatable realignment.

In `@ami/main/tests.py`:
- Around line 362-366: The test creates a Detection with a normalized bbox
([0.1,0.1,0.2,0.2]) but project convention uses absolute pixel coordinates;
update the Detection.objects.create call (the Detection model's bbox field in
this test) to use absolute pixel coordinates instead (for example [10, 10, 20,
20]) so the fixture matches the canonical BoundingBox schema and other tests.
- Around line 404-411: The exclusion subquery is unnecessary and hard to read —
replace the nested Occurrence.objects.filter(...).values("event_id")[:1] with
the scalar event id already available on occurrences[0] (e.g., use
occurrences[0].event_id or occurrences[0].event.pk) when building
non_day0_events; keep the rest of the logic the same
(Event.objects.filter(deployment=self.deployment).exclude(pk=<scalar_event_id>)).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 27f3f08e-9160-46f4-ad70-e1b7c4455a24

📥 Commits

Reviewing files that changed from the base of the PR and between 847b1c4 and 637afce.

📒 Files selected for processing (2)
  • ami/main/models.py
  • ami/main/tests.py

Comment thread ami/main/models.py Outdated
Comment thread ami/main/tests.py Outdated
- ``models.py:1488``: replace EN DASH (–) with ASCII hyphen so Ruff RUF003
  doesn't flag the comment in strict environments.
- ``tests.py:test_regrouping_realigns_occurrence_event_id``: pick targets
  at mid-day offsets (12h / 36h / 60h) instead of by index. The cap's
  strict ``>`` keeps the capture at exactly 24h offset in the previous
  event, so index-based selection (``captures[len // 3]``) lands two of
  three targets in the same event. Mid-day offsets sit well inside each
  daily event's window, away from any boundary, so the test now actually
  exercises 3-way distribution as it claimed.
- Tighten the realignment assertion to require exactly 3 distinct
  ``event_id`` values across the three occurrences (the corrected
  invariant), replacing the looser ``>= 1`` non-day-0 check.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants