-
Notifications
You must be signed in to change notification settings - Fork 13
fix(jobs): fixes for concurrent ML processing jobs #1261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
5067744
fix(jobs): drop select_for_update in _update_job_progress to break ro…
mihow 173d1d4
fix(jobs): short-circuit JobLogHandler.emit to break log-UPDATE conte…
mihow 3e20041
feat(jobs): add JOB_LOG_PERSIST_ENABLED flag + local repro docs
mihow 826e2bb
perf(jobs): wrap process_nats_pipeline_result with cachalot_disabled
mihow 9d15813
chore(jobs): address PR #1261 review feedback
mihow ef78fb5
fix(jobs): narrow job.save() to update_fields in _update_job_progress
mihow 39f734a
feat(settings): wire NATS_TASK_TTR env var with 60s default
mihow 014b8c0
feat(jobs): fair worker polling via last-touched ordering
mihow 7b7dfa5
feat(jobs): default ids_only to limit=1 for pop()-style polling
mihow 7445f6e
fix(jobs): randomize ids_only ordering instead of updated_at sort
mihow aeca6c1
test(jobs): update ids_only test for pop()-style default
mihow 44d4c32
feat(jobs): heartbeat on idle ids_only polls
mihow 259c659
feat(settings): raise NATS_TASK_TTR default to 300s
mihow 695f761
fix(jobs): heartbeat on ids_only polls when only pipeline__slug__in i…
mihow 9cee66d
docs(jobs): address PR #1261 review feedback
mihow File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
172 changes: 172 additions & 0 deletions
172
docs/claude/debugging/row-lock-contention-reproduction.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,172 @@ | ||
| # Reproducing the `jobs_job` row-lock contention locally | ||
|
|
||
| Runbook for reproducing, on a local dev stack, the row-lock contention that | ||
| affects concurrent `async_api` ML jobs. Context: issue #1256, PR #1261, and | ||
| PR #1259 (complementary `JobLog` table refactor). | ||
|
|
||
| **Why this matters.** Naive repro attempts with a `curl` loop that fires one | ||
| result per POST (`{"results": [{...}]}`) do NOT trigger the pathology. They | ||
| only exercise the worker-side `select_for_update` path, which is fixed once | ||
| PR #1261 lands. The dominant remaining bottleneck is per-result logging | ||
| inside `ATOMIC_REQUESTS` — to see it locally you need **batched POSTs** that | ||
| match the real ADC shape (`AMI_LOCALIZATION_BATCH_SIZE=4`, | ||
| `AMI_CLASSIFICATION_BATCH_SIZE=150`). | ||
|
|
||
| ## The pathology | ||
|
|
||
| Two mutating paths UPDATE the `jobs_job` row for every log line written via | ||
| `job.logger.info(...)`: | ||
|
|
||
| 1. **View path** (`ami/jobs/views.py` — `result` and `tasks` actions): the | ||
| per-iteration `job.logger.info("Queued pipeline result: ...")` inside the | ||
| POST body loop runs under `ATOMIC_REQUESTS`. A single batched POST with N | ||
| results therefore stacks N UPDATEs on `jobs_job.logs` inside one tx that | ||
| doesn't commit until the view returns. Every other writer on the same row | ||
| (other worker tasks, other POST handlers) blocks behind it. | ||
| 2. **Worker path** (`ami/jobs/tasks.py` — `_update_job_progress`): each | ||
| `process_nats_pipeline_result` celery task calls `_update_job_progress`, | ||
| which emits its own log lines, each triggering another UPDATE on the same | ||
| row. | ||
|
|
||
| The smoking gun in `pg_stat_activity`: | ||
|
|
||
| - Root blocker: a backend `state = idle in transaction`, last query | ||
| `UPDATE "jobs_job" SET "logs" = ...`, held for many seconds. | ||
| - Waiters: dozens of backends with `wait_event_type = Lock`, | ||
| `wait_event = tuple` or `transactionid`, all on the same row. | ||
|
|
||
| ## Prereqs | ||
|
|
||
| - Local antenna stack up via the standard dev compose | ||
| (`docker compose up -d`) with postgres, redis, rabbitmq, nats, django, | ||
| celeryworker, and celeryworker_ml healthy. | ||
| - A job in a running state (any `async_api` job with `status = STARTED` will | ||
| do — the view accepts results regardless of whether real tasks exist). | ||
| - An auth token for a user with permission to POST to | ||
| `/api/v2/jobs/{id}/result/`. | ||
| - Python 3.10+ on the host (the load-test script uses stdlib only). | ||
|
|
||
| ## Scripts | ||
|
|
||
| - `scripts/load_test_result_endpoint.py` — fires concurrent batched POSTs. | ||
| - `ami/jobs/management/commands/chaos_monkey.py` — adjacent tooling for | ||
| `async_api` chaos scenarios; covered in `chaos-scenarios.md`. | ||
|
|
||
| ## Step-by-step | ||
|
|
||
| ### 1. Grab an auth token and a target job | ||
|
|
||
| From a shell on the host: | ||
|
|
||
| ```bash | ||
| docker compose exec -T django python manage.py shell <<'PY' | ||
| from rest_framework.authtoken.models import Token | ||
| from ami.users.models import User | ||
| from ami.jobs.models import Job | ||
|
|
||
| u = User.objects.filter(is_staff=True).first() | ||
| t, _ = Token.objects.get_or_create(user=u) | ||
| print("TOKEN=", t.key) | ||
|
|
||
| j = Job.objects.filter(status="STARTED", dispatch_mode="async_api").first() | ||
| if j is None: | ||
| # Any running job works — create one if there isn't one. | ||
| # Adjust project/collection/pipeline PKs to your local data. | ||
| print("No running async_api job found; create one via the UI or shell.") | ||
| else: | ||
| print("JOB_ID=", j.pk) | ||
| PY | ||
| ``` | ||
|
|
||
| If no running job exists, create one with whatever project/collection/pipeline | ||
| are seeded locally. The view does not need real tasks queued behind the | ||
| job — it only needs the job row to accept result POSTs. | ||
|
|
||
| ### 2. Fire batched POSTs | ||
|
|
||
| ```bash | ||
| python scripts/load_test_result_endpoint.py <JOB_ID> <TOKEN> \ | ||
| --batch 50 --concurrency 10 --rounds 3 | ||
| ``` | ||
|
|
||
| `--batch 50` puts 50 `PipelineResultsError` entries in each POST body. Any | ||
| batch size >1 will stack UPDATEs; 50 is a comfortable reproduction size | ||
| because it makes each POST's tx hold long enough for others to pile up. | ||
| `--concurrency 10` fires 10 parallel POSTs per wave. `--rounds 3` fires | ||
| three back-to-back waves. | ||
|
|
||
| ### 3. Monitor Postgres during the test | ||
|
|
||
| In a second shell: | ||
|
|
||
| ```bash | ||
| docker exec <postgres-container> psql -U <user> -d <db> <<'SQL' | ||
| -- Scalars | ||
| SELECT count(*) AS idle_in_tx | ||
| FROM pg_stat_activity | ||
| WHERE datname = current_database() AND state = 'idle in transaction'; | ||
|
|
||
| SELECT count(*) AS blocker_chain | ||
| FROM pg_stat_activity blocked | ||
| JOIN pg_stat_activity blocking | ||
| ON blocking.pid = ANY(pg_blocking_pids(blocked.pid)) | ||
| WHERE blocked.wait_event_type = 'Lock' | ||
| AND blocked.datname = current_database(); | ||
|
|
||
| -- Top offenders | ||
| SELECT state, wait_event, | ||
| substring(query, 1, 80), | ||
| EXTRACT(EPOCH FROM now() - xact_start) AS xact_age_s | ||
| FROM pg_stat_activity | ||
| WHERE datname = current_database() | ||
| AND state != 'idle' | ||
| AND (state = 'idle in transaction' OR wait_event_type = 'Lock') | ||
| ORDER BY xact_start NULLS LAST | ||
| LIMIT 20; | ||
| SQL | ||
| ``` | ||
|
|
||
| ### 4. Before/after signatures | ||
|
|
||
| Measured on a local dev stack with WEB_CONCURRENCY=1 (gunicorn default) and | ||
| 8 celery ML-fork workers, batch=50, concurrency=10. | ||
|
|
||
| | Signal | PR #1261 only (`JOB_LOG_PERSIST_ENABLED=true`) | PR #1261 + flag off (`JOB_LOG_PERSIST_ENABLED=false`) | | ||
| |---|---|---| | ||
| | `blocker_chain` count | 30+ | 0–1 (transient) | | ||
| | `idle_in_tx` count | 8–10 | 0 | | ||
| | Root-blocker query | `UPDATE jobs_job SET logs = ...` held 2–60s | transient `SELECT`s only | | ||
| | POST success (10 concurrent × 50-result batch, 120s timeout) | 0/10 (all timeout) | 10/10 | | ||
| | p95 POST latency | 120s+ | ~5s | | ||
|
|
||
| ## The feature flag | ||
|
|
||
| Setting `JOB_LOG_PERSIST_ENABLED=false` (env var on the Django container) | ||
| causes `JobLogHandler.emit` to write only to the container stdout logger and | ||
| skip the per-record UPDATE on `jobs_job.logs`. The per-job UI log feed | ||
| stops receiving new entries while the flag is off; container stdout still | ||
| captures everything. | ||
|
|
||
| Default is `true` — existing deployments keep their current behavior. The | ||
| flag is intended as a time-bounded escape hatch until the append-only | ||
| `JobLog` child table from PR #1259 is in place. | ||
|
|
||
| To test the flag locally, append `JOB_LOG_PERSIST_ENABLED=false` to the | ||
| django env file used by your compose (e.g. `.envs/.local/.django`) and | ||
| recreate the django container (`docker compose up -d --force-recreate | ||
| django`). Verify from a shell: | ||
|
|
||
| ```bash | ||
| docker compose exec -T django python -c \ | ||
| "from django.conf import settings; print(settings.JOB_LOG_PERSIST_ENABLED)" | ||
| ``` | ||
|
|
||
| ## Related | ||
|
|
||
| - Issue #1256 — full contention analysis with path breakdown. | ||
| - PR #1261 — drops `select_for_update` in `_update_job_progress`; adds the | ||
| `JOB_LOG_PERSIST_ENABLED` flag; this runbook. | ||
| - PR #1259 — append-only `JobLog` child table. When merged, the flag can be | ||
| removed in favor of a cutover to the new write path. | ||
| - `docs/claude/debugging/chaos-scenarios.md` — adjacent chaos tooling for | ||
| NATS redelivery and retry-path validation. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| #!/usr/bin/env python3 | ||
| """Fire concurrent batched POSTs against ``POST /api/v2/jobs/{id}/result/``. | ||
|
|
||
| Reproduces the row-lock contention pathology described in | ||
| ``docs/claude/debugging/row-lock-contention-reproduction.md``. Each POST body | ||
| contains N fake ``PipelineResultsError`` entries so the per-result | ||
| ``job.logger.info(...)`` call inside ``ATOMIC_REQUESTS`` stacks N UPDATEs on | ||
| ``jobs_job.logs`` in a single view transaction — the shape real ADC workers | ||
| produce (``AMI_LOCALIZATION_BATCH_SIZE=4``, ``AMI_CLASSIFICATION_BATCH_SIZE=150``). | ||
|
|
||
| A single-result-per-POST loop does NOT reproduce the contention. Batching | ||
| is load-bearing. | ||
|
|
||
| Usage: | ||
|
|
||
| python scripts/load_test_result_endpoint.py <job_id> <token> \\ | ||
| [--batch 50] [--concurrency 10] [--rounds 3] \\ | ||
| [--host http://localhost:8000] | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| Dependencies: Python 3.10+, stdlib only. | ||
| """ | ||
| import argparse | ||
| import concurrent.futures | ||
| import json | ||
| import time | ||
| import urllib.error | ||
| import urllib.request | ||
| import uuid | ||
|
|
||
|
|
||
| def make_body(batch_size: int, prefix: str) -> bytes: | ||
| results = [ | ||
| { | ||
| "reply_subject": f"{prefix}.r{i}.{uuid.uuid4().hex[:8]}", | ||
| "result": {"error": "load-test", "image_id": f"img-{prefix}-{i}"}, | ||
| } | ||
| for i in range(batch_size) | ||
| ] | ||
| return json.dumps({"results": results}).encode() | ||
|
|
||
|
|
||
| def fire_one(url: str, token: str, body: bytes, idx: int) -> tuple[int, int, float]: | ||
| req = urllib.request.Request( | ||
| url, | ||
| data=body, | ||
| headers={"Authorization": f"Token {token}", "Content-Type": "application/json"}, | ||
| method="POST", | ||
| ) | ||
| t0 = time.time() | ||
| try: | ||
| with urllib.request.urlopen(req, timeout=120) as resp: | ||
| return (idx, resp.status, time.time() - t0) | ||
| except urllib.error.HTTPError as e: | ||
| return (idx, e.code, time.time() - t0) | ||
| except Exception: | ||
| return (idx, -1, time.time() - t0) | ||
|
coderabbitai[bot] marked this conversation as resolved.
Outdated
|
||
|
|
||
|
|
||
| def main(): | ||
| ap = argparse.ArgumentParser(description=__doc__.splitlines()[0]) | ||
| ap.add_argument("job_id", type=int, help="Target Job.pk (must be in a running state)") | ||
| ap.add_argument("token", help="DRF auth Token for a user with result-POST permission") | ||
| ap.add_argument("--batch", type=int, default=50, help="results per POST body (default 50)") | ||
| ap.add_argument("--concurrency", type=int, default=10, help="parallel POSTs per round (default 10)") | ||
| ap.add_argument("--rounds", type=int, default=3, help="how many waves to fire (default 3)") | ||
| ap.add_argument("--host", default="http://localhost:8000", help="API host (default localhost:8000)") | ||
| args = ap.parse_args() | ||
|
|
||
| url = f"{args.host}/api/v2/jobs/{args.job_id}/result/" | ||
| print(f"url={url} batch={args.batch} concurrency={args.concurrency} rounds={args.rounds}") | ||
|
|
||
| t_start = time.time() | ||
| for round_idx in range(args.rounds): | ||
| with concurrent.futures.ThreadPoolExecutor(max_workers=args.concurrency) as ex: | ||
| futures = [ | ||
| ex.submit(fire_one, url, args.token, make_body(args.batch, f"r{round_idx}_{i}"), i) | ||
| for i in range(args.concurrency) | ||
| ] | ||
| results = [f.result() for f in concurrent.futures.as_completed(futures)] | ||
| good = sum(1 for _, s, _ in results if s == 200) | ||
| latencies = sorted([lat for _, _, lat in results]) | ||
| p50 = latencies[len(latencies) // 2] | ||
| p95 = latencies[int(len(latencies) * 0.95)] | ||
| print( | ||
| f"round {round_idx}: ok={good}/{args.concurrency} " | ||
| f"p50={p50:.2f}s p95={p95:.2f}s elapsed={time.time() - t_start:.1f}s" | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.