Skip to content

fix(jobs): fixes for concurrent ML processing jobs#1261

Merged
mihow merged 15 commits intomainfrom
fix/jobs-row-lock-drop-select-for-update
Apr 23, 2026
Merged

fix(jobs): fixes for concurrent ML processing jobs#1261
mihow merged 15 commits intomainfrom
fix/jobs-row-lock-drop-select-for-update

Conversation

@mihow
Copy link
Copy Markdown
Collaborator

@mihow mihow commented Apr 21, 2026

Summary

Tuning PR for async_api row-lock and keep-up contention surfaced under
concurrent ML jobs (issue #1256). Scope is what we can safely change while
PR #1259 (the proper JobLog append-only child table) and #1256 option D
(atomic jsonb_set progress update) land separately.

  1. Drop select_for_update() in _update_job_progress — the worker-side
    serialization point. Keeps transaction.atomic() and the max() guard
    that prevents progress regression.

  2. Add JOB_LOG_PERSIST_ENABLED feature flag (default True) — gates
    the per-record UPDATE in JobLogHandler.emit. Setting to False is a
    time-bounded escape hatch until PR Refactor job logging to use separate table #1259's append-only JobLog child
    table is in place.

  3. Narrow _update_job_progress's save to update_fields — limits the
    write to progress, status, finished_at, updated_at so a concurrent
    worker's mutation of logs / progress.errors isn't clobbered by a
    stale read-modify-write. Addresses PR review feedback.

  4. Wire NATS_TASK_TTR env var, default 300s — the ack-wait timeout in
    nats_queue.py was reading settings.NATS_TASK_TTR via getattr with a
    30 fallback, but the setting was never declared. Effective ack-wait was
    pinned at 30s everywhere, which is too tight for 24 MB images on GPU
    cold-start. Initially raised to 60s, then to 300s (commit 259c6592) to
    match what staging/demo are actually running — 60s was still tight for
    cold-start GPU pipelines on large images. max_deliver (prod=5, demo=2)
    bounds how long a genuinely crashed worker can hold tasks invisibly.
    See "NATS redelivery amplifies Django contention" below.

  5. Docs + script: docs/claude/debugging/row-lock-contention-reproduction.md
    and scripts/load_test_result_endpoint.py for local reproduction.

  6. Fair worker polling via updated_at — then swapped for random shuffle
    (commits 014b8c0c, 7445f6ea). Worker polling through
    ?ids_only=1&incomplete_only=1 was returning jobs in insertion order, so
    concurrent pollers deterministically converged on the same head-of-queue
    job and later jobs starved. Initial fix was ORDER BY updated_at, pk
    leveraging the fact that updated_at freshens on every progress/log
    write, so active jobs drop to the bottom and starved jobs bubble to the
    top. That has a degenerate case at startup (freshly queued jobs share
    near-identical timestamps, tie-broken by pk, so simultaneous polls
    still pick the same job). Replaced with ORDER BY RANDOM() which gives
    probabilistic disjoint assignment across concurrent pollers without
    writing a poll-stamp column. Fairness is probabilistic rather than
    strictly oldest-first; strict fairness is a follow-up (dedicated
    last_polled_at column with UPDATE ... RETURNING semantics).

  7. Default ?ids_only=1 to limit=1 (commit 7b7dfa5d). Worker
    polling used to get a paginated batch of 10 jobs and drain them
    serially in for job_id in jobs: _process_job(job_id) (see
    ami-data-companion/trapdata/antenna/worker.py:106). The cached list
    caused later jobs in the batch to starve until the earlier jobs
    finished. Changing the ids_only response to length-1 forces a re-poll
    per job, which re-samples the random order every iteration. Callers
    that genuinely want a batch can still pass ?limit=N explicitly —
    back-compat preserved for any non-polling ids_only consumers.
    Implemented by overriding JobViewSet.paginator to return a
    LimitOffsetPagination with default_limit=1 when ids_only=1 is
    set and no explicit limit/page_size is in the query string.
    Combined with the random ordering above, each worker poll becomes an
    independent "pick any unfinished job" draw.

  8. Heartbeat on idle ?ids_only=1 polls (commits 44d4c322, 695f7616).
    _mark_pipeline_pull_services_seen only fires from /tasks/ and
    /result/ — i.e., from workers with active work — so a worker sitting
    on GET /jobs/?ids_only=1 between jobs had no heartbeat path and would
    age past PROCESSING_SERVICE_LAST_SEEN_MAX (60s), flipping its UI
    status to offline despite still polling.

    The initial commit (44d4c32) added a project-scoped heartbeat
    (_mark_async_services_seen_for_project). On demo that heartbeat never
    fired: the real ADC worker request is
    GET /jobs/?pipeline__slug__in=<slugs>&ids_only=1&... with no
    project_id in the query string (one worker serves pipelines across
    multiple projects and has no single project to nominate), so
    get_active_project() returned None and the dispatch silently
    short-circuited.

    Follow-up commit 695f7616 adds a sibling path —
    _mark_async_services_seen_for_pipelines +
    update_async_services_seen_for_pipelines — that derives scope from
    the pipeline__slug__in values in the query and bumps
    last_seen / last_seen_live on every async ProcessingService
    whose pipelines match. The list() heartbeat branch now prefers
    the pipeline-slug path when those slugs are present and falls back to
    project scope for callers that do send ?project_id=. Same Redis gate
    (HEARTBEAT_THROTTLE_SECONDS = 30s), keyed on the sorted slug set
    (heartbeat:list:pipelines:<sorted-slugs>) so concurrent pollers for
    the same pipelines share a single dispatch per window.

    Verified on demo post-deploy: last_seen_live flipped from false
    true on all three async worker entries (gpu-02, gpu-03, gpu-04)
    within one heartbeat window. Narrowing the scope via
    AMI_ANTENNA_SERVICE_NAME (header-based, precise per-service
    identity) is tracked as Narrow idle-poll heartbeat scope via X-Processing-Service-Name header #1264.

  9. Disable cachalot around process_nats_pipeline_result (commit 826e2bb2).
    The pipeline-result Celery task runs inside a hot UPDATE path on
    jobs_job and triggers cachalot's post-write invalidation across a
    large set of tables on every call. Wrapping the task body in
    cachalot_disabled() skips those invalidations for this specific code
    path — the task doesn't query the cached tables anyway, so there's
    nothing to invalidate. Cuts per-task overhead in concurrent-result
    bursts. No behavior change to callers.

Reviewer takeaway addressed in 9cee66d8: the ?ids_only=1 branch of
list() (random ordering, limit=1 paginator override, heartbeat dispatch)
has quietly turned this endpoint into a claim-next-job call. Correct shape
is a dedicated GET /jobs/next/ action, tracked as #1265. The three
overrides now carry ⚠️ TEMPORARY HACK — remove by 2026-04-24 markers in
code with back-references to #1265, so the expiration stays visible until
/next ships and ADC migrates.

Related follow-up issue on the worker side:
ami-data-companion#143
— once the /jobs endpoint is out of the way, starvation moves one
layer down to the NATS-task consumption pattern: a worker that has
reserved tasks from job A via max_ack_pending then serializes job B's
batch leaves A's reservations idle until ack_wait expires and NATS
redelivers. Orthogonal to this PR.

Context

Under concurrent async_api result processing, three paths all mutate the
same jobs_job row:

  1. ML celery workers — select_for_update in _update_job_progress.
    (addressed by commit 1 here)
  2. JobLogHandler.emit — UPDATE on jobs_job.logs per log line. The dominant
    bottleneck after commit 1 lands, because the view's per-iter
    job.logger.info("Queued pipeline result: …") fires once per result in a
    batched POST body, stacking N UPDATEs in one ATOMIC_REQUESTS tx.
    (escape hatch via flag in commit 2; proper fix in PR Refactor job logging to use separate table #1259)
  3. ATOMIC_REQUESTS holding the /result and /tasks view tx across broker
    RPCs. (out of scope — future PR with @transaction.non_atomic_requests)

Commit 1 alone shifts the bottleneck from #1 to #2 rather than eliminating
it (observed: blocker_chain grows from ~10 deep to ~56, queue still backs
up). The flag in commit 2 provides a way to actually get the workshop
through the weekend while PR #1259 goes through review.

NATS redelivery amplifies Django contention

With ack_wait=30s pinned, any task that took longer than 30s to finish
(S3 download + GPU cold-start + batch inference + synchronous POST) would
be redelivered by NATS before the original worker got a chance to ack.
Observed in the most recent 6-job load test: one 100-image job had 100
of 100 tasks redelivered
— every task ran twice end-to-end.

The feedback loop that matters for this PR: each redelivered task produces
a second POST to /api/v2/jobs/<id>/result/, which enqueues a second
process_nats_pipeline_result celery task, which contends for the same
jobs_job row as the first. So a short ack-wait directly multiplies the
very row-lock load that commits 1-3 are trying to relieve — turning a
throughput shortfall on the ML side into doubled Django-side contention.

Wiring NATS_TASK_TTR and raising the default (60s → 300s across the
branch, final value landed in 259c6592) cuts the redelivery rate;
deployments seeing cold-start cost on 24 MB images can still bump further
via env var. Demo runs the code default (300s).

Evidence

py-spy before commit 1: 7/8 ForkPoolWorker threads stuck at

wait (psycopg/connection.py:957)
...
_update_job_progress (ami/jobs/tasks.py:493)   ← select_for_update
process_nats_pipeline_result (ami/jobs/tasks.py:224)

py-spy after commit 1, flag still True: same threads now stuck at

save (ami/jobs/models.py:1088)
emit (ami/jobs/models.py:359)
_update_job_progress (ami/jobs/tasks.py:570)   and from
result (ami/jobs/views.py:392)
tasks (ami/jobs/views.py:330)

Staging-equivalent deployment, 5 concurrent async_api jobs (10/100/500
images, same pipeline):

signal baseline commit 1 only commit 1 + flag=False
pg_stat_activity blocker-chain depth ~10 56 0
idle_in_tx holding UPDATE jobs_job.logs for 60s+ yes yes no
ml_results queue under load climbs unboundedly climbs to 200+, does not drain peaks ~1150, drains to 0
5-job end state at 500s stuck / silent FAILURE 4/5 still at 0% after 210s 5/5 SUCCESS

Local reproduction (dev compose, WEB_CONCURRENCY=1, 8 ml fork workers,
batched POST test at 50 results × 10 concurrent):

flag blocker_chain POST success (120s timeout) p95 latency
true (default) 37 0/10 120s+
false 1 (transient) 20/20 ~5s

6-job concurrent retest on demo (commits 1-3, flag=False, NATS_TASK_TTR=30
i.e. unwired): all 6 jobs eventually SUCCESS, confirming commits 1-3 are
correctness-safe under concurrent load. First job solo ran at 1.4 img/s
(356s for 500); subsequent concurrent jobs ran 2.3-11× slower with high
redelivery (j172 = 100/100 tasks redelivered, j171 = 37/100, j173 = 93/500).
Motivates commit 4.

Reproduction steps, prereqs, the Postgres query, and the load script are in
docs/claude/debugging/row-lock-contention-reproduction.md. A
single-result-per-POST curl loop does not reproduce the pathology;
batching results per POST is load-bearing because the per-iter
job.logger.info inside the view is where the UPDATE stacking happens.

Trade-offs

Commit 1 — count drift under race. The max() guard still prevents
progress-percentage regression between concurrent workers. Accumulated
counts (detections, classifications, captures) can drift by one batch if
two workers race the read-modify-write cycle. Cosmetic: underlying
Detection and Classification rows are written authoritatively by
save_results(), which runs before _update_job_progress. A fully
atomic version with jsonb_set and server-side arithmetic is possible as a
follow-up (#1256 option D).

Commit 2 — UI log feed goes blank while flag is off. Container stdout
still captures every line via the module-level app logger, so ops
observability is unchanged. Only the per-job UI log view loses new
entries for the duration the flag is disabled. Default stays True so
existing deployments aren't affected.

Commit 4 — longer ack-wait delays crash detection. If a worker truly
dies mid-task, the NATS_TASK_TTR seconds are wasted before redelivery.
At the 300s default this is still acceptable for GPU cold-start on 24 MB
images — the redelivery-amplification cost dominates the crash-detection
latency under the workloads we see. Deployments with smaller payloads or
stricter freshness SLAs can tune down via env. max_deliver (prod=5,
demo=2) is the cap that prevents infinite retries.

Rollout plan

  1. Land this PR on main; keep the flag default True and NATS_TASK_TTR
    default 300. Matches what demo/staging are already running.
  2. Deployments hitting the contention pattern set
    JOB_LOG_PERSIST_ENABLED=false on their Django container env and
    restart. Revert by unsetting or switching to true.
  3. Deployments with extreme cold-start cost can bump NATS_TASK_TTR
    further via env; the 300s default is sized for the 24 MB image +
    H100 cold-start case we've been testing.
  4. When PR Refactor job logging to use separate table #1259 lands, the default-True log path migrates to the new
    JobLog table write; the flag can be removed in a cleanup PR.
  5. When feat(job): refactor job logging so it isn't a bottleneck #1256 option D (atomic jsonb_set progress update) lands, commit 1's
    count-drift trade-off goes away.

Tested / verified

Unit tests (local, docker compose -f docker-compose.ci.yml run --rm django python manage.py test ami.jobs.tests.test_jobs):

  • 49/49 tests pass, including the new TestListEndpointHeartbeat
    class covering both the project-scoped and pipeline-slug heartbeat
    paths (7 cases). Also green on CI for the latest SHA.

Local reproduction (dev compose, WEB_CONCURRENCY=1, 8 ml fork workers,
scripts/load_test_result_endpoint.py at 50 results × 10 concurrent):

  • Commit 1 alone shifts contention from select_for_update to
    JobLogHandler.emit; blocker_chain grows from ~10 to ~37. 0/10
    batched POSTs succeed in 120s.
  • Commit 1 + JOB_LOG_PERSIST_ENABLED=false: blocker_chain ≤ 1
    transient, 20/20 batched POSTs succeed, p95 ~5s. Numbers match the
    evidence table above.

Staging-equivalent deployment (Beluga), 5 concurrent async_api jobs
(10/100/500 images, same pipeline):

  • Baseline (main): 4/5 jobs still at 0% after 210s, idle_in_tx
    holding UPDATE jobs_job.logs for 60s+, ml_results queue climbs
    unboundedly.
  • This branch with flag=False: 5/5 SUCCESS in ~500s, queue peaks
    ~1150 then drains to 0, no idle_in_tx holders. Confirms commits 1-3
    are correctness-safe under concurrent load.

Demo env (demo.antenna.insectai.org):

  • 6-job concurrent retest on demo pre-NATS_TASK_TTR (j171-j176,
    mixed sizes 10/10/100/100/500/500): all 6 jobs eventually SUCCESS
    despite high redelivery (j172 = 100/100 tasks redelivered). Motivated
    commit 4.
  • Branch deployed to demo, all six containers recreated on SHA
    9cee66d8. Verified via curl demo.antenna.insectai.org/api/v2/ml/processing_services/?project_id=27
    that last_seen_live flipped from false (stuck for hours) → true
    on all three async worker entries (gpu-02, gpu-03, gpu-04) within one
    heartbeat window post-deploy — confirms the pipeline-slug heartbeat
    path fires for the real ADC worker request shape.

antenna-dev-serbia + antenna-dev-arctia (full-stack dev benches running
the branch end-to-end):

Not yet verified:

  • Overnight soak under real ADC load on demo.
  • Production rollout with JOB_LOG_PERSIST_ENABLED flipped on (prod
    will default True; the flag-off path is covered by the staging
    5-job test above, not by a prod run).

Related

Summary by CodeRabbit

  • New Features

    • Added a configuration flag to toggle per-job log persistence to the database.
    • Added a CLI tool to load-test the job result endpoint.
    • Adjusted job list behavior: ID-only queries get a tighter default pagination.
  • Performance

    • Reduced database row-lock contention during job progress updates.
    • Prevented unnecessary post-write cache invalidation during pipeline result handling.
  • Documentation

    • Added a runbook for reproducing and debugging row-lock contention and testing the log-persistence flag.

…w-lock contention

Under concurrent async_api result processing, every ML result task
queued a contending exclusive lock on the jobs_job row via
`select_for_update()`, stacking behind gunicorn view threads also
holding the row under ATOMIC_REQUESTS (from per-iter job.logger.info
writes in /result).

Observed in a validation environment under 5 concurrent async_api jobs
(2x10 + 2x100 + 1x500 images, one pipeline): py-spy stacks of
celeryworker_ml showed 7 of 8 ForkPoolWorker threads stuck in
`psycopg wait` -> `_update_job_progress (tasks.py:493)`. pg_stat_activity
consistently showed 17-20 backends "idle in transaction" holding
`SELECT ... FOR UPDATE OF jobs_job` for 0.5-5+ seconds at a time.

The `max()` guard below the dropped lock still prevents progress
regression: two racing workers that both read `progress=0.4` and both
compute a new value still write monotonically via `max(existing, new)`.
The trade-off is that accumulated counts (detections, classifications,
captures) can drift by one batch under a read-modify-write race --
cosmetic only, since the underlying Detection/Classification rows are
written authoritatively by `save_results` before this function runs.

Refs: #1256 (log refactor proposal; this PR targets
the parallel progress-update path, not the log-write path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 21, 2026 02:06
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 21, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 9cee66d
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/69e966eed2111c000813e463

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 21, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 9cee66d
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/69e966eec38c430009e4056f

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Decorated the Celery task with @cachalot_disabled(), reduced DB locking in job progress updates and limited saved fields, added a runtime flag to disable per-line job log persistence, introduced NATS task TTR setting, added a load-test script, and added a row-lock contention runbook.

Changes

Cohort / File(s) Summary
Job Progress / Task
ami/jobs/tasks.py
process_nats_pipeline_result wrapped with @cachalot_disabled(); _update_job_progress stops using select_for_update() (uses Job.objects.get(pk=...) inside transaction.atomic()), and job.save() narrowed to update_fields=["progress","status","finished_at","updated_at"].
Job Log Persistence & Settings
ami/jobs/models.py, config/settings/base.py
Added JOB_LOG_PERSIST_ENABLED env flag (default True) and gating in JobLogHandler.emit() to skip per-line jobs_job.logs updates when disabled; always emit to stdout. Added NATS_TASK_TTR setting.
Load Testing Script
scripts/load_test_result_endpoint.py
New CLI main() to concurrently POST batched results to /api/v2/jobs/{job_id}/result/, measure latencies and success counts; includes make_body and fire_one.
Views / Pagination
ami/jobs/views.py
JobViewSet.get_queryset adjusts excludes and conditionally orders; added paginator property to override pagination for list when ?ids_only=1 and no explicit limit/page_size.
Documentation / Runbook
docs/claude/debugging/row-lock-contention-reproduction.md
New runbook reproducing jobs_job row-lock contention, steps to test JOB_LOG_PERSIST_ENABLED=false, before/after signatures, and monitoring guidance.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as "Django API"
    participant Worker as "Celery Worker"
    participant DB as "Postgres DB"
    participant Cachalot as "Cachalot Cache"

    Client->>API: POST /api/v2/jobs/{id}/result/
    API->>Worker: enqueue process_nats_pipeline_result(payload)
    Worker->>Cachalot: enter cachalot_disabled() (suspend post-write invalidation)
    Worker->>DB: BEGIN transaction
    Worker->>DB: SELECT Job (no SELECT FOR UPDATE)
    alt JOB_LOG_PERSIST_ENABLED == true
        Worker->>DB: UPDATE Job.logs and progress/status fields
    else
        Worker->>DB: UPDATE Job.progress/status fields only
    end
    DB-->>Worker: COMMIT
    Worker->>Cachalot: exit cachalot_disabled()
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

backend, ml

Poem

🐇
I hopped the cache and skipped a write,
kept rows unblocked through day and night.
A quiet flag to calm the log parade,
a script that hurls requests in cavalcade—
carrots munched, the locks all fade! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'fix(jobs): fixes for concurrent ML processing jobs' is directly related to the main changes in the changeset, which address row-lock contention and concurrency issues in job processing.
Description check ✅ Passed PR description comprehensively covers all required template sections with detailed context, evidence, trade-offs, rollout plan and testing verification.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/jobs-row-lock-drop-select-for-update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce Postgres row-lock contention during concurrent async job result processing by removing a blocking SELECT ... FOR UPDATE from _update_job_progress, keeping the surrounding transaction.atomic() and the monotonic-progress guard logic.

Changes:

  • Replaces Job.objects.select_for_update().get(pk=job_id) with Job.objects.get(pk=job_id) in _update_job_progress.
  • Adds an explanatory comment documenting the contention context and trade-offs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ami/jobs/tasks.py
Comment thread ami/jobs/tasks.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ami/jobs/tasks.py (1)

503-571: ⚠️ Potential issue | 🟠 Major

Avoid stale full-row saves after removing the row lock.

With the plain get() on Line 504, two workers can read the same old progress before either commits. The max() guard on Lines 527-529 then compares against stale in-memory state, so the later job.save() on Line 571 can overwrite a newer 100%/SUCCESS update with an older partial stage. This makes the race broader than count drift; progress, status, and finished_at can regress.

Consider switching this path to an optimistic compare-and-retry update, or a server-side atomic JSONB update, so each writer recomputes against the latest committed row without blocking on select_for_update().

One possible optimistic retry shape
+    max_attempts = 5
+    for attempt in range(max_attempts):
-    with transaction.atomic():
-        job = Job.objects.get(pk=job_id)
+        with transaction.atomic():
+            job = Job.objects.get(pk=job_id)
+            loaded_updated_at = job.updated_at
 
         # ... compute progress/status from the freshly loaded job ...
 
-        job.save()
+            update_values = {
+                "progress": job.progress,
+                "updated_at": datetime.datetime.now(),
+            }
+            if job.progress.is_complete():
+                update_values.update(
+                    status=job.status,
+                    finished_at=job.finished_at,
+                )
+
+            updated = Job.objects.filter(
+                pk=job_id,
+                updated_at=loaded_updated_at,
+            ).update(**update_values)
+            if updated:
+                break
+    else:
+        logger.warning("Failed to update job %s progress after concurrent retries", job_id)
+        return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/jobs/tasks.py` around lines 503 - 571, The current plain
Job.objects.get(pk=job_id) allows a stale-read and a later job.save() can
overwrite newer progress; change this to an optimistic compare-and-retry update:
read the Job and existing_stage (as you already do via existing_stage =
job.progress.get_stage(stage)), compute the new progress/state and the JSONB
state_params, then attempt a conditional update via the ORM (e.g.
Job.objects.filter(pk=job_id,
progress__<stage_path>__progress=existing_progress).update(progress=<new_json>,
status=<new_status>, finished_at=..., ...) and only fall back to re-reading and
retrying (a few times) if the update returns 0 rows; keep using the same symbols
(job, existing_stage, existing_progress, passed_progress, state_params,
progress_percentage, complete_state, job.save) so you update against the latest
committed row instead of blindly calling job.save(), or alternatively implement
a server-side JSONB update that atomically merges counts into job.progress.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@ami/jobs/tasks.py`:
- Around line 503-571: The current plain Job.objects.get(pk=job_id) allows a
stale-read and a later job.save() can overwrite newer progress; change this to
an optimistic compare-and-retry update: read the Job and existing_stage (as you
already do via existing_stage = job.progress.get_stage(stage)), compute the new
progress/state and the JSONB state_params, then attempt a conditional update via
the ORM (e.g. Job.objects.filter(pk=job_id,
progress__<stage_path>__progress=existing_progress).update(progress=<new_json>,
status=<new_status>, finished_at=..., ...) and only fall back to re-reading and
retrying (a few times) if the update returns 0 rows; keep using the same symbols
(job, existing_stage, existing_progress, passed_progress, state_params,
progress_percentage, complete_state, job.save) so you update against the latest
committed row instead of blindly calling job.save(), or alternatively implement
a server-side JSONB update that atomically merges counts into job.progress.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec3cbb4b-e80d-4503-8932-f82254224fe4

📥 Commits

Reviewing files that changed from the base of the PR and between 0a87a3a and 5067744.

📒 Files selected for processing (1)
  • ami/jobs/tasks.py

mihow added 2 commits April 20, 2026 20:04
…ntion

Under concurrent async_api load the per-record UPDATE on jobs_job.logs
becomes the dominant bottleneck after the select_for_update fix in the
previous commit. Observed on demo: 5 concurrent jobs produced a 56-deep
pg_blocking_pids chain, all waiting on UPDATE jobs_job SET logs = ...
from either /result (view, under ATOMIC_REQUESTS) or _update_job_progress
(ml worker).

Short-circuit emit to the module-level logger only (container stdout).
The per-job UI log feed goes blank while this hotfix is active; revisit
once PR #1259 lands the append-only JobLog child table.
Replace the unconditional noop of JobLogHandler.emit from the previous
commit with a feature flag (default True, preserves existing behavior).
Deployments hitting row-lock contention on jobs_job can set
JOB_LOG_PERSIST_ENABLED=false to short-circuit the per-record UPDATE
until PR #1259's append-only JobLog table is in place.

Validated locally (dev compose, WEB_CONCURRENCY=1, 8 ml fork workers,
batched POSTs at 50 results × 10 concurrent):

  flag=true:  blocker_chain=37, 0/10 POSTs complete (120s timeout)
  flag=false: blocker_chain=1 (trivial), 20/20 POSTs complete, p95=5.5s

Also add:
- scripts/load_test_result_endpoint.py — standalone batched-POST driver.
  A single-result-per-POST curl loop does NOT reproduce the contention
  because it skips the per-iter job.logger.info call inside one
  ATOMIC_REQUESTS tx. Batching N>1 results per POST is load-bearing.
- docs/claude/debugging/row-lock-contention-reproduction.md — full
  runbook: pathology, prereqs, steps, pg_stat_activity queries,
  before/after signal table, flag usage.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/load_test_result_endpoint.py`:
- Around line 14-18: The script currently accepts the auth token as a positional
argument and lacks validation for numeric flags and the host URL; update the CLI
parsing in load_test_result_endpoint.py (e.g., in main or parse_args) to make
token optional as a --token flag and prefer reading it from an environment
variable or secure prompt when not provided (do not keep it in argv or
prints/logs), validate that --batch, --concurrency, and --rounds are integers >
0 and fail fast with a clear error message and non-zero exit if not, and
validate --host with a strict HTTP(S) URL regex/urllib.parse check before
calling urlopen (reject non-http(s) schemes); also ensure any logging or
exception messages never include the token string.
- Around line 42-56: The fire_one function currently returns (idx, status,
elapsed) and uses -1 for generic exceptions; change it to return a fourth
element carrying error details (e.g., (idx, status, elapsed, error_str)) by
catching exceptions as e in both urllib.error.HTTPError and the generic except
and using str(e) or repr(e) (use empty string or None when there is no error).
Then update the result unpacking where results are processed (previously
unpacking idx, status, elapsed) to accept the extra value (e.g., idx, status,
elapsed, err) and handle or log err for diagnostics without altering the
happy-path behavior. Ensure type hints and any downstream uses expect the new
fourth element.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6d21f484-b804-4141-8bdc-ef058fa0a4a2

📥 Commits

Reviewing files that changed from the base of the PR and between 5067744 and 3e20041.

📒 Files selected for processing (4)
  • ami/jobs/models.py
  • config/settings/base.py
  • docs/claude/debugging/row-lock-contention-reproduction.md
  • scripts/load_test_result_endpoint.py
✅ Files skipped from review due to trivial changes (1)
  • docs/claude/debugging/row-lock-contention-reproduction.md

Comment thread scripts/load_test_result_endpoint.py
Comment thread scripts/load_test_result_endpoint.py Outdated
mihow pushed a commit that referenced this pull request Apr 21, 2026
Replaces the jobs_job.logs JSON-field UPDATE path in JobLogHandler.emit
with an INSERT on a new JobLog child table. Under concurrent async_api
load every log line triggered UPDATE jobs_job SET logs = ... on the
shared job row; inside ATOMIC_REQUESTS a single batched /result POST
stacked N such UPDATEs in one tx, blocking ML workers on the same row
(issue #1256).

- JobLog model + migration 0021_joblog (FK job, level, message, created_at)
- JobLogHandler.emit -> JobLog.objects.create (gated by JOB_LOG_PERSIST_ENABLED
  flag introduced in PR #1261)
- JobSerializer.logs -> SerializerMethodField that reads JobLog rows on
  detail endpoint and falls back to the legacy jobs_job.logs JSON for
  jobs created before this table existed. List endpoint keeps the cheap
  legacy shape to avoid N+1.
- No UI changes, no new endpoint. On-wire {stdout, stderr} shape preserved.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow changed the title fix(jobs): drop select_for_update in _update_job_progress to break row-lock contention fix(jobs): disable job logging behind feature flag Apr 21, 2026
Under concurrent async_api load the ML result task is a pure write path:
save_results inserts Detection/Classification rows and _update_job_progress
UPDATEs jobs_job. Cachalot's post-write invalidation added ~2.5s of
overhead per task in production measurement (issue #1256 Path 4), which
throttles the ml_results queue drain when ADC POSTs arrive faster than
Celery can clear them.

Nothing inside the task benefits from the query cache — every read is of
a freshly-written row — so skipping invalidation is strictly a throughput
win. Decorator stack: @celery_app.task wraps the cachalot-wrapped
function, so Celery sees the context manager enter/exit on each task run.

This was prototyped on demo during the 2026-04-20 investigation but
never committed (demo was reset to clean main before PR #1261 was
opened). Restoring it here closes the last documented ML-path
throughput regression.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
ami/jobs/tasks.py (1)

245-246: Minor: Job.objects.get(pk=job_id) happens twice in the task path.

Line 245 fetches job right after _update_job_progress already fetched it inside its atomic block (line 514). Not a correctness issue, but since _update_job_progress runs every batch in hot async_api paths, consider either returning the refreshed job from _update_job_progress or deferring the fetch until after the SREM on line 287 — one less round-trip per result. Deferrable; flagging only as optional cleanup in line with the PR's throughput focus.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/jobs/tasks.py` around lines 245 - 246, The code calls
Job.objects.get(pk=job_id) twice; to avoid an extra DB round-trip, modify the
_update_job_progress function to return the refreshed Job instance and use that
returned job in process_pipeline_result (or alternatively remove the early
Job.objects.get in process_pipeline_result and defer fetching the Job until
after the Redis SREM operation), referencing the existing _update_job_progress
and the current Job.objects.get usage so you reuse the refreshed job rather than
performing a second query.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ami/jobs/tasks.py`:
- Around line 502-514: The current removal of select_for_update() risks
overwriting unrelated JSON fields on the Job.progress object because job.save()
writes all fields; update the _update_job_progress flow to re-read the latest
job.progress (and specifically progress.errors) inside the transaction.atomic()
block and then call job.save(update_fields=["progress", "status",
"finished_at"]) instead of a full save to limit last-write-wins to only the
intended fields; this change should reference the Job.objects.get(pk=job_id)
read, the subsequent mutation of job.progress, and the job.save() call, and
ensure _reconcile_lost_images’ appends to progress.errors are reconciled by
reading progress.errors after acquiring the atomic block before serializing.
- Around line 162-171: The decorator `@cachalot_disabled`() is invalid because
cachalot_disabled is only a context manager; remove the decorator from the
process_nats_pipeline_result definition and wrap the function body with the
cachalot_disabled context (i.e., use with cachalot_disabled: around the existing
logic inside process_nats_pipeline_result), ensuring you import
cachalot_disabled from django_cachalot if not already present and preserve the
function signature and return behavior.

---

Nitpick comments:
In `@ami/jobs/tasks.py`:
- Around line 245-246: The code calls Job.objects.get(pk=job_id) twice; to avoid
an extra DB round-trip, modify the _update_job_progress function to return the
refreshed Job instance and use that returned job in process_pipeline_result (or
alternatively remove the early Job.objects.get in process_pipeline_result and
defer fetching the Job until after the Redis SREM operation), referencing the
existing _update_job_progress and the current Job.objects.get usage so you reuse
the refreshed job rather than performing a second query.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a1dfd256-87a8-4ef7-966f-1cac1eeb957a

📥 Commits

Reviewing files that changed from the base of the PR and between 3e20041 and 826e2bb.

📒 Files selected for processing (1)
  • ami/jobs/tasks.py

Comment thread ami/jobs/tasks.py
Comment thread ami/jobs/tasks.py
- tasks.py: remove reference to non-existent planning doc; point at
  issue #1256 and PR #1261 instead (Copilot finding).
- load_test_result_endpoint.py: move token to $ANTENNA_TOKEN env var
  (or optional --token) to keep secrets out of argv/shell history;
  validate batch/concurrency/rounds > 0; require http(s) scheme on
  --host; capture error type+reason per-request instead of returning
  bare -1 (CodeRabbit findings).
- docs: update the repro runbook to match the new CLI shape.

Co-Authored-By: Claude <noreply@anthropic.com>
mihow pushed a commit that referenced this pull request Apr 21, 2026
Replaces the jobs_job.logs JSON-field UPDATE path in JobLogHandler.emit
with an INSERT on a new JobLog child table. Under concurrent async_api
load every log line triggered UPDATE jobs_job SET logs = ... on the
shared job row; inside ATOMIC_REQUESTS a single batched /result POST
stacked N such UPDATEs in one tx, blocking ML workers on the same row
(issue #1256).

- JobLog model + migration 0021_joblog (FK job, level, message, created_at)
- JobLogHandler.emit -> JobLog.objects.create (gated by JOB_LOG_PERSIST_ENABLED
  flag introduced in PR #1261)
- JobSerializer.logs -> SerializerMethodField that reads JobLog rows on
  detail endpoint and falls back to the legacy jobs_job.logs JSON for
  jobs created before this table existed. List endpoint keeps the cheap
  legacy shape to avoid N+1.
- No UI changes, no new endpoint. On-wire {stdout, stderr} shape preserved.

Co-Authored-By: Claude <noreply@anthropic.com>
mihow pushed a commit that referenced this pull request Apr 21, 2026
Replaces the jobs_job.logs JSON-field UPDATE path in JobLogHandler.emit
with an INSERT on a new JobLog child table. Under concurrent async_api
load every log line triggered UPDATE jobs_job SET logs = ... on the
shared job row; inside ATOMIC_REQUESTS a single batched /result POST
stacked N such UPDATEs in one tx, blocking ML workers on the same row
(issue #1256).

- JobLog model + migration 0021_joblog (FK job, level, message, created_at)
- JobLogHandler.emit -> JobLog.objects.create (gated by JOB_LOG_PERSIST_ENABLED
  flag introduced in PR #1261)
- JobSerializer.logs -> SerializerMethodField that reads JobLog rows on
  detail endpoint and falls back to the legacy jobs_job.logs JSON for
  jobs created before this table existed. List endpoint keeps the cheap
  legacy shape to avoid N+1.
- No UI changes, no new endpoint. On-wire {stdout, stderr} shape preserved.

Co-Authored-By: Claude <noreply@anthropic.com>
Previously `job.save()` (no update_fields) wrote the entire row, which under
concurrent async_api load meant a stale read-modify-write could clobber
unrelated fields mutated by a concurrent worker — most notably
`progress.errors` (appended by `_reconcile_lost_images`) and `logs`
(appended by JobLogHandler.emit when persistence is enabled).

Narrow the write to the fields this function actually mutates
(progress, status, finished_at) plus updated_at so auto_now still fires.
The underlying read-modify-write race remains — this only limits blast
radius, not the race itself. The counter-drift and `max()` guard semantics
discussed in the block comment above are unchanged.

Address PR #1261 CodeRabbit review feedback on ami/jobs/tasks.py:514.

Co-Authored-By: Claude <noreply@anthropic.com>
mihow pushed a commit that referenced this pull request Apr 21, 2026
Replaces the jobs_job.logs JSON-field UPDATE path in JobLogHandler.emit
with an INSERT on a new JobLog child table. Under concurrent async_api
load every log line triggered UPDATE jobs_job SET logs = ... on the
shared job row; inside ATOMIC_REQUESTS a single batched /result POST
stacked N such UPDATEs in one tx, blocking ML workers on the same row
(issue #1256).

- JobLog model + migration 0021_joblog (FK job, level, message, created_at)
- JobLogHandler.emit -> JobLog.objects.create (gated by JOB_LOG_PERSIST_ENABLED
  flag introduced in PR #1261)
- JobSerializer.logs -> SerializerMethodField that reads JobLog rows on
  detail endpoint and falls back to the legacy jobs_job.logs JSON for
  jobs created before this table existed. List endpoint keeps the cheap
  legacy shape to avoid N+1.
- No UI changes, no new endpoint. On-wire {stdout, stderr} shape preserved.

Co-Authored-By: Claude <noreply@anthropic.com>
nats_queue.py already reads getattr(settings, "NATS_TASK_TTR", 30), but
the setting was never declared in base.py — so the ack_wait was pinned
at 30s everywhere regardless of env. Wire it through and raise the
default to 60s: 30s is too tight for 24 MB images + GPU cold start +
synchronous result POST, and caused spurious NATS redeliveries under
concurrent load.

Per-env override via NATS_TASK_TTR (demo bumps this further to 120s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mihow pushed a commit that referenced this pull request Apr 21, 2026
Replaces the jobs_job.logs JSON-field UPDATE path in JobLogHandler.emit
with an INSERT on a new JobLog child table. Under concurrent async_api
load every log line triggered UPDATE jobs_job SET logs = ... on the
shared job row; inside ATOMIC_REQUESTS a single batched /result POST
stacked N such UPDATEs in one tx, blocking ML workers on the same row
(issue #1256).

- JobLog model + migration 0021_joblog (FK job, level, message, created_at)
- JobLogHandler.emit -> JobLog.objects.create (gated by JOB_LOG_PERSIST_ENABLED
  flag introduced in PR #1261)
- JobSerializer.logs -> SerializerMethodField that reads JobLog rows on
  detail endpoint and falls back to the legacy jobs_job.logs JSON for
  jobs created before this table existed. List endpoint keeps the cheap
  legacy shape to avoid N+1.
- No UI changes, no new endpoint. On-wire {stdout, stderr} shape preserved.

Co-Authored-By: Claude <noreply@anthropic.com>
mihow and others added 3 commits April 21, 2026 02:43
Workers polling GET /api/v2/jobs/?ids_only=1&incomplete_only=1 currently
get jobs in PK order (oldest first), so N concurrent pollers all race to
the same first job. Under the workshop load (4 pollers × 4 concurrent
jobs) this caused the drain-one-job-at-a-time pattern observed in the
6-job retest: j192 starved at 0% for ~500s while all pollers hammered
j191, and NATS nearly exhausted max_deliver=2 on j192.

Order ids_only responses by updated_at ASC (least-recently-touched
first). updated_at freshens on every progress write, log write, or
status transition, so actively-worked jobs drop to the bottom and
starved jobs bubble to the top — rotation without a dedicated
last_polled_at field.

Fairness is a property of the serving endpoint, not of the worker.
Third-party workers with their own consumption patterns should still
see evenly-distributed handoffs; keeping the scheduling logic in
Antenna rather than ADC preserves that guarantee.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker polling (?ids_only=1) now hands back one least-recently-touched
job per request rather than the default paginated batch of 10. Concurrent
workers were caching the full list and draining it serially, which
starved later jobs until the cache drained. Forcing a re-poll per job
lets the updated_at fairness sort rotate work across jobs every
iteration.

Callers who want a batch can still pass ?limit=N or ?page_size=N
explicitly — back-compat preserved for the few non-polling ids_only
consumers.

No worker-side change required: existing clients iterate a list of zero
or more and work identically with a list of length one.
The updated_at sort had a degenerate case: freshly queued jobs all share
near-identical updated_at (ties broken by pk), so simultaneous pollers
deterministically pick the same "oldest" job. Replacing the sort with
ORDER BY RANDOM() gives probabilistic disjoint assignment across
concurrent pollers without writing a poll-stamp column.

Combined with the limit=1 default on ?ids_only=1, each worker poll is
an independent "pick any unfinished job" draw. Fairness is probabilistic
rather than strictly oldest-first, but that's an acceptable trade given
we want to avoid write-on-read.

If we later decide to enforce strict fairness, the right follow-up is a
dedicated last_polled_at column with UPDATE ... RETURNING semantics, not
a return to the updated_at sort.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ami/jobs/views.py`:
- Around line 256-263: The fairness logic that orders by updated_at in the
ids-only polling path (inside the view handling self.action == "list" with
url_boolean_param(..., "ids_only")) breaks when JOB_LOG_PERSIST_ENABLED is False
because updated_at won't be refreshed by log writes; modify the handler so that
when ids_only is requested you perform an explicit lightweight poll
marker/reservation update (e.g., touch a last_polled_at or reservation timestamp
on the Job row or a dedicated PollMarker record) before returning IDs, or
alternatively gate the updated_at ordering behind JOB_LOG_PERSIST_ENABLED and
document that rotational fairness only applies when job-log persistence is
enabled; ensure you update the same queryset logic
(jobs.order_by("updated_at","pk")) and any transaction/locking around the
reservation to avoid races.
- Around line 268-285: The check that decides whether to force a single-item
paginator mistakenly looks for "page_size" in request.query_params; update the
conditional in the paginator initialization branch to check for "limit" instead
of "page_size" so LimitOffsetPaginationWithPermissions (and any DRF
LimitOffsetPagination-based pagination_class) behaves correctly when callers
pass ?ids_only=1&limit=N. Specifically, modify the conditional that sets
self._paginator (the block that creates LimitOffsetPaginationWithPermissions and
sets default_limit = 1) to verify "limit" not "page_size", keeping all other
logic (self.action == "list", url_boolean_param(self.request, "ids_only",
default=False), and "limit" not in self.request.query_params) intact and
referencing pagination_class/self._paginator as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 08ec2549-c2e2-4464-a0bd-7bf6cfa1035f

📥 Commits

Reviewing files that changed from the base of the PR and between 39f734a and 7b7dfa5.

📒 Files selected for processing (1)
  • ami/jobs/views.py

Comment thread ami/jobs/views.py Outdated
Comment thread ami/jobs/views.py
Existing test_list_jobs_with_ids_only now passes ?limit=10 explicitly to
assert the "list all" behavior. New test_list_jobs_ids_only_pops_one
covers the new default where ?ids_only=1 returns one job to support
pop()-style polling handoff across workers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mihow mihow changed the title fix(jobs): disable job logging behind feature flag fix(jobs): fixes for concurrent ML processing jobs Apr 21, 2026
The pipeline-scoped heartbeat in _mark_pipeline_pull_services_seen only
fires when a worker hits /tasks/ or /result/ on an active job, so workers
idling on GET /jobs/?ids_only=1 between jobs had no heartbeat path. Their
last_seen would age past PROCESSING_SERVICE_LAST_SEEN_MAX and the UI
would flip them offline despite being actively online.

Add _mark_async_services_seen_for_project + matching celery task that
marks every async ProcessingService attached to the polling project as
seen. Same Redis throttle pattern as the existing heartbeat
(HEARTBEAT_THROTTLE_SECONDS = 30s), scoped by project rather than
(pipeline, project) since the list endpoint has no pipeline context.
mihow and others added 2 commits April 21, 2026 17:03
The 60s default worked on push-mode services but was still too tight for
pull-mode ADC workers running cold-start GPU pipelines on 24 MB images
(staging observed a nontrivial redelivery rate at 60s; demo held steady
only after being bumped to 120s, then 300s). Raise the default to 300s
so new deployments inherit the ceiling that staging/demo actually run
at. max_deliver still bounds how long a genuinely crashed worker can
hold tasks invisibly.
…s sent

The project-scoped heartbeat added in 44d4c32 never fires for the actual
ADC worker request: ``GET /api/v2/jobs/?pipeline__slug__in=...&ids_only=1``
does not carry a ``project_id``, so ``get_active_project()`` returns None
and the helper short-circuits. Result: async processing services remained
``last_seen_live=null`` forever on demo despite workers polling every 5s.

The commit's original rationale — that the list endpoint has no pipeline
context — was wrong: the worker's filter slugs are right there in the
query string. Add a sibling path that derives scope from those slugs:

- ``update_async_services_seen_for_pipelines(pipeline_slugs)`` celery task
  bumps ``last_seen`` on async services whose pipelines match any slug.
- ``_mark_async_services_seen_for_pipelines`` throttles dispatch via a
  Redis key scoped on the sorted slug set.
- The ``list()`` heartbeat branch prefers the pipeline-slug path when
  those slugs are present, and falls back to the project-scoped path for
  callers that send ``?project_id=``.

Tests cover the real worker shape (no project_id, slugs in query) and the
task's filter semantics against sync + async + unrelated services.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread ami/jobs/views.py
Comment thread config/settings/base.py
Comment thread ami/jobs/tasks.py
Comment thread ami/jobs/tasks.py Outdated
Comment thread ami/jobs/tasks.py Outdated
Comment thread ami/jobs/views.py Outdated
Response to reviewer comments on the async_api concurrency tuning PR:

- views.py: mark the three `ids_only=1` overrides (random ordering,
  paginator default_limit=1, heartbeat dispatch) as temporary — they exist
  only because list() is doubling as a claim-next-job endpoint. A dedicated
  `/next` action is tracked as #1265; comments carry a hard-coded removal
  target of 2026-04-24 to keep the expiration visible.
- views.py: rewrite the `_mark_async_services_seen_for_pipelines` and
  `_mark_async_services_seen_for_project` docstrings so the contract is
  obvious — they are Redis-throttled wrappers around the matching celery
  task. No DB work in the wrapper.
- tasks.py: add TODO on both `update_async_services_seen_*` tasks pointing
  at #1194 (client-ID auth), so the "one poller falsely marks its peers
  live" invariant-break is visible in source, not just in the PR body.
- tasks.py: fix the stale comment in `_update_job_progress` — the previous
  wording claimed the narrowed save() avoids overwriting `updated_at`, but
  `updated_at` is in `update_fields` intentionally (Django skips auto_now
  when update_fields is provided).
- config/settings/base.py: add a one-line plain-language purpose for
  `JOB_LOG_PERSIST_ENABLED` so the flag reads at-a-glance.

No behavior changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mihow mihow merged commit ae14c7b into main Apr 23, 2026
7 checks passed
@mihow mihow deleted the fix/jobs-row-lock-drop-select-for-update branch April 23, 2026 00:41
mihow pushed a commit that referenced this pull request Apr 23, 2026
Replaces the jobs_job.logs JSON-field UPDATE path in JobLogHandler.emit
with an INSERT on a new JobLog child table. Under concurrent async_api
load every log line triggered UPDATE jobs_job SET logs = ... on the
shared job row; inside ATOMIC_REQUESTS a single batched /result POST
stacked N such UPDATEs in one tx, blocking ML workers on the same row
(issue #1256).

- JobLog model + migration 0021_joblog (FK job, level, message, created_at)
- JobLogHandler.emit -> JobLog.objects.create (gated by JOB_LOG_PERSIST_ENABLED
  flag introduced in PR #1261)
- JobSerializer.logs -> SerializerMethodField that reads JobLog rows on
  detail endpoint and falls back to the legacy jobs_job.logs JSON for
  jobs created before this table existed. List endpoint keeps the cheap
  legacy shape to avoid N+1.
- No UI changes, no new endpoint. On-wire {stdout, stderr} shape preserved.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants