fix(celery): update worker concurrency defaults by mihow · Pull Request #1228 · RolnickLab/antenna

mihow · 2026-04-14T17:36:28Z

Summary

Add explicit CELERY_WORKER_CONCURRENCY = env.int("CELERY_WORKER_CONCURRENCY", default=8) to config/settings/base.py, next to the existing CELERY_WORKER_PREFETCH_MULTIPLIER / CELERY_WORKER_ENABLE_PREFETCH_COUNT_REDUCTION block.
Overridable per deployment via the CELERY_WORKER_CONCURRENCY env var.
Production should override to 16 — committed as an example in .envs/.production/.django-example with guidance.
Local dev can lower to 2–4 on memory-constrained laptops — guidance added to .envs/.local/.django.

Why

The default celery worker concurrency when the setting is unset is os.cpu_count(). On the current production celery worker host (8 cores) this means an 8-process prefork pool. The dominant tasks on the antenna queue — process_nats_pipeline_result and create_detection_images — are DB/Redis-bound rather than CPU-bound: each task spends most of its time waiting on postgres/pgbouncer and Redis round-trips, not crunching numbers.

Direct observation during a high-throughput async_api job:

Queue ingress rate ~640/min (two active jobs with 280–900 images each; redelivery amplifies ingress further when any of the bugs around async_api jobs killed by transient Redis errors during update_state — RedisError and "state actually missing" are conflated into a single fatal path #1219/fix(cache): enable SO_KEEPALIVE on django-redis cache connections #1221 trigger)
Single-consumer drain rate ~283/min at 8-way concurrency
Net accumulation ~360/min; the queue grew into the tens of thousands of messages over ~30 min
Container-level utilisation at 8-way concurrency: ~1.6% CPU, ~1.3 GiB of 29 GiB resident. Plenty of headroom.
RabbitMQ consumer_utilisation on the antenna queue: ~0.0016, i.e. the single AMQP consumer's prefetch window is fully occupied essentially all the time. This is the "worker pool too small" signature, not a broker-side issue.

Raising the prefork pool size directly addresses the bottleneck. A hotfix override of 16 was applied in production via the env var ahead of this PR and confirmed to drain the backlog on the active jobs.

Why default=8, prod override=16

The previous revision of this PR used default=16 everywhere. Review flagged the blast radius across local dev laptops and the smaller staging/demo VMs — 16 prefork workers = 16 separate Python processes with imports + persistent DB connections, which is a real memory bump where it isn't needed.

The current default of 8 is Celery's typical out-of-the-box behaviour on an 8-core host (matches os.cpu_count() there) and a safer memory footprint everywhere else. Production overrides to 16 via env var — committed as an example in .envs/.production/.django-example with the rationale inline. This keeps the bottleneck fix in place where it matters without making every other environment pay for it.

16 is the smallest power-of-2 step above 8 that roughly matches the empirical gap between ingress and drain on the production incident that motivated this PR, without risking pgbouncer saturation. A larger default can be considered once we have measured postgres connection-pool headroom (see "what we still need to verify" below).

What this does not change

Prefetch multiplier stays at 1 — that was already set and fairness behaviour is unchanged.
Routing / queue topology is unchanged. Splitting the antenna queue into a dedicated "ingest fast path" vs "housekeeping / status-check" queue is a larger follow-up, filed separately.
Pool class stays prefork. Switching to gevent for this queue may give much higher effective concurrency on an IO-bound workload, but every task on this queue would need to be audited for gevent-safety (blocking C extensions, thread-locals in PyTorch paths, etc.) first. Out of scope here.

Deploy note

The production host is already running with CELERY_WORKER_CONCURRENCY=16 in its env file from the earlier hotfix — lowering the settings default to 8 does not regress prod. Confirm the env var is still set at next deploy; if for any reason it isn't, prod will drop to 8 until the env is fixed.

What we still need to verify

Postgres / pgbouncer connection pool usage after deploy — 16 prefork workers × persistent connections should be well within pgbouncer's default_pool_size, but worth confirming under load.
Whether the 16-prod override is still the right ceiling, or whether larger backlogs warrant raising it (gated on pgbouncer headroom).
Whether this change exposes any new memory-pressure pattern at peak load (current --max-tasks-per-child=100 / --max-memory-per-child=2 GiB already bound each process).

async_api jobs killed by transient Redis errors during update_state — RedisError and "state actually missing" are conflated into a single fatal path #1219 — code-path brittleness that lets a single transient Redis error mark an active job FAILURE and delete state (independent of this PR).
fix(cache): enable SO_KEEPALIVE on django-redis cache connections #1221 — django-redis cache connection keepalive fix that reduces how often the async_api jobs killed by transient Redis errors during update_state — RedisError and "state actually missing" are conflated into a single fatal path #1219 path triggers (independent of this PR).
fix: retry on Redis blips instead of failing the job #1231 — code-path fix for async_api jobs killed by transient Redis errors during update_state — RedisError and "state actually missing" are conflated into a single fatal path #1219 (retry transient Redis errors instead of failing the job).

Summary by CodeRabbit

New Features
- Added a configurable worker concurrency setting to control parallel background task processing (default: 8, adjustable via environment variable).
Documentation
- Included example environment entries for local (commented example: 4) and production (example: 16) configurations.

The default celery worker concurrency (os.cpu_count()) underutilises the worker pool for process_nats_pipeline_result and create_detection_images, which are DB/Redis-bound rather than CPU-bound. On a prefork pool sized to CPU count, the pool is idle most of the time while the antenna queue backlogs during high-throughput NATS async_api jobs. Override via CELERY_WORKER_CONCURRENCY env var per deployment; 16 is the new default.

netlify · 2026-04-14T17:36:35Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`07a1c2b`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/69ded9afa62465000866203d

netlify · 2026-04-14T17:36:35Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`07a1c2b`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/69ded9af7866ed00072dfad6

coderabbitai · 2026-04-14T17:36:44Z

📝 Walkthrough

Walkthrough

Added a new Celery worker concurrency setting exposed via an environment variable; default is read as 8, local env provides a commented example, and the production example sets it to 16.

Changes

Cohort / File(s)	Summary
Celery config `config/settings/base.py`	Added `CELERY_WORKER_CONCURRENCY = env.int("CELERY_WORKER_CONCURRENCY", default=8)` — new module-level setting controlling Celery prefork worker pool size.
Local env example `.envs/.local/.django`	Added a commented-out `CELERY_WORKER_CONCURRENCY=4` line with explanatory comments (no runtime change unless uncommented).
Production env example `.envs/.production/.django-example`	Added `CELERY_WORKER_CONCURRENCY=16` to the production example file to suggest a higher default for production deployments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

carlosgjs

Poem

"🐰 I nibble configs in the night,
A concurrency carrot gleams so bright,
Defaults at eight, examples show four and sixteen,
Hopping queues kept calm and keen,
Cheers to tuned tasks and dreams unseen!"

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description comprehensively covers the summary, list of changes, motivation, deployment notes, and related issues as specified in the template, with clear rationale for design decisions.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The title 'fix(celery): update worker concurrency defaults' accurately summarizes the main change—adding and configuring CELERY_WORKER_CONCURRENCY settings across environments.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/celery-worker-concurrency

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR adjusts the default Celery worker prefork pool size by introducing an explicit CELERY_WORKER_CONCURRENCY setting in the Django base settings, while keeping it overridable per deployment via an environment variable.

Changes:

Add CELERY_WORKER_CONCURRENCY = env.int("CELERY_WORKER_CONCURRENCY", default=16) to config/settings/base.py.
Document rationale and override behavior inline next to existing worker prefetch settings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

🧹 Nitpick comments (1)

config/settings/base.py (1)
401-401: Consider documenting CELERY_WORKER_CONCURRENCY in env templates/runbooks.

Optional, but adding it to .env.example/deployment docs will make per-environment tuning easier (especially smaller staging/demo stacks).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/settings/base.py` at line 401, Add documentation for the
CELERY_WORKER_CONCURRENCY environment variable (used where
CELERY_WORKER_CONCURRENCY = env.int("CELERY_WORKER_CONCURRENCY", default=16)) to
the project's environment templates and deployment/runbook, e.g., update
.env.example and relevant runbooks to include the variable name, its purpose
(controls Celery worker concurrency), allowed values, and the default of 16,
plus a note recommending smaller values for staging/demo and guidance for tuning
per-environment.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@config/settings/base.py`:
- Line 401: Add documentation for the CELERY_WORKER_CONCURRENCY environment
variable (used where CELERY_WORKER_CONCURRENCY =
env.int("CELERY_WORKER_CONCURRENCY", default=16)) to the project's environment
templates and deployment/runbook, e.g., update .env.example and relevant
runbooks to include the variable name, its purpose (controls Celery worker
concurrency), allowed values, and the default of 16, plus a note recommending
smaller values for staging/demo and guidance for tuning per-environment.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e68d66df-4cb2-4c7f-af0f-480c5272b1a4

📥 Commits

Reviewing files that changed from the base of the PR and between 1c6be7a and 44dd942.

📒 Files selected for processing (1)

config/settings/base.py

Drop the default from 16 to 8 so local dev, staging, and demo stacks don't spawn 16 prefork workers by default. Production keeps 16 via the existing CELERY_WORKER_CONCURRENCY env var (hotfix already in place). Also add guidance + example values to .envs/.local/.django and .envs/.production/.django-example so per-environment tuning is discoverable.

mihow · 2026-04-15T00:21:21Z

Claude says: Addressed the nitpick about env-template documentation in 07a1c2b2. Added CELERY_WORKER_CONCURRENCY with guidance to both .envs/.local/.django (suggesting 2-4 on memory-constrained dev laptops) and .envs/.production/.django-example (recommending 16 for prod, with rationale inline). Also took the opportunity to lower the settings default from 16 → 8 after review feedback about blast radius on smaller deployments; prod keeps its 16 via the existing env var override.

Copilot AI review requested due to automatic review settings April 14, 2026 17:36

mihow mentioned this pull request Apr 14, 2026

Split the celery antenna queue into workload classes (ingest / housekeeping) #1229

Open

Copilot AI reviewed Apr 14, 2026

View reviewed changes

mihow mentioned this pull request Apr 14, 2026

Investigate switching ingest celery pool from prefork to gevent #1230

Open

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

mihow changed the title ~~fix(celery): bump worker concurrency default to 16~~ fix(celery): update worker concurrency defaults Apr 15, 2026

mihow merged commit 32af89e into main Apr 15, 2026
7 checks passed

mihow deleted the fix/celery-worker-concurrency branch April 15, 2026 00:22

mihow mentioned this pull request Apr 15, 2026

fix(settings): make DATA_UPLOAD_MAX_MEMORY_SIZE env-configurable #1224

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(celery): update worker concurrency defaults#1228

fix(celery): update worker concurrency defaults#1228
mihow merged 2 commits intomainfrom
fix/celery-worker-concurrency

mihow commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

netlify Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

mihow commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihow commented Apr 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Why default=8, prod override=16

What this does not change

Deploy note

What we still need to verify

Related

Summary by CodeRabbit

Uh oh!

netlify Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

netlify Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mihow commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

netlify Bot commented Apr 14, 2026 •

edited

Loading

netlify Bot commented Apr 14, 2026 •

edited

Loading

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading