fix(celery): update worker concurrency defaults#1228
Conversation
The default celery worker concurrency (os.cpu_count()) underutilises the worker pool for process_nats_pipeline_result and create_detection_images, which are DB/Redis-bound rather than CPU-bound. On a prefork pool sized to CPU count, the pool is idle most of the time while the antenna queue backlogs during high-throughput NATS async_api jobs. Override via CELERY_WORKER_CONCURRENCY env var per deployment; 16 is the new default.
✅ Deploy Preview for antenna-ssec canceled.
|
✅ Deploy Preview for antenna-preview canceled.
|
📝 WalkthroughWalkthroughAdded a new Celery worker concurrency setting exposed via an environment variable; default is read as 8, local env provides a commented example, and the production example sets it to 16. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR adjusts the default Celery worker prefork pool size by introducing an explicit CELERY_WORKER_CONCURRENCY setting in the Django base settings, while keeping it overridable per deployment via an environment variable.
Changes:
- Add
CELERY_WORKER_CONCURRENCY = env.int("CELERY_WORKER_CONCURRENCY", default=16)toconfig/settings/base.py. - Document rationale and override behavior inline next to existing worker prefetch settings.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
config/settings/base.py (1)
401-401: Consider documentingCELERY_WORKER_CONCURRENCYin env templates/runbooks.Optional, but adding it to
.env.example/deployment docs will make per-environment tuning easier (especially smaller staging/demo stacks).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/settings/base.py` at line 401, Add documentation for the CELERY_WORKER_CONCURRENCY environment variable (used where CELERY_WORKER_CONCURRENCY = env.int("CELERY_WORKER_CONCURRENCY", default=16)) to the project's environment templates and deployment/runbook, e.g., update .env.example and relevant runbooks to include the variable name, its purpose (controls Celery worker concurrency), allowed values, and the default of 16, plus a note recommending smaller values for staging/demo and guidance for tuning per-environment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@config/settings/base.py`:
- Line 401: Add documentation for the CELERY_WORKER_CONCURRENCY environment
variable (used where CELERY_WORKER_CONCURRENCY =
env.int("CELERY_WORKER_CONCURRENCY", default=16)) to the project's environment
templates and deployment/runbook, e.g., update .env.example and relevant
runbooks to include the variable name, its purpose (controls Celery worker
concurrency), allowed values, and the default of 16, plus a note recommending
smaller values for staging/demo and guidance for tuning per-environment.
Drop the default from 16 to 8 so local dev, staging, and demo stacks don't spawn 16 prefork workers by default. Production keeps 16 via the existing CELERY_WORKER_CONCURRENCY env var (hotfix already in place). Also add guidance + example values to .envs/.local/.django and .envs/.production/.django-example so per-environment tuning is discoverable.
|
Claude says: Addressed the nitpick about env-template documentation in 07a1c2b2. Added |
Summary
CELERY_WORKER_CONCURRENCY = env.int("CELERY_WORKER_CONCURRENCY", default=8)toconfig/settings/base.py, next to the existingCELERY_WORKER_PREFETCH_MULTIPLIER/CELERY_WORKER_ENABLE_PREFETCH_COUNT_REDUCTIONblock.CELERY_WORKER_CONCURRENCYenv var..envs/.production/.django-examplewith guidance..envs/.local/.django.Why
The default celery worker concurrency when the setting is unset is
os.cpu_count(). On the current production celery worker host (8 cores) this means an 8-process prefork pool. The dominant tasks on theantennaqueue —process_nats_pipeline_resultandcreate_detection_images— are DB/Redis-bound rather than CPU-bound: each task spends most of its time waiting on postgres/pgbouncer and Redis round-trips, not crunching numbers.Direct observation during a high-throughput
async_apijob:async_apijobs killed by transient Redis errors duringupdate_state—RedisErrorand "state actually missing" are conflated into a single fatal path #1219/fix(cache): enable SO_KEEPALIVE on django-redis cache connections #1221 trigger)consumer_utilisationon the antenna queue: ~0.0016, i.e. the single AMQP consumer's prefetch window is fully occupied essentially all the time. This is the "worker pool too small" signature, not a broker-side issue.Raising the prefork pool size directly addresses the bottleneck. A hotfix override of 16 was applied in production via the env var ahead of this PR and confirmed to drain the backlog on the active jobs.
Why default=8, prod override=16
The previous revision of this PR used
default=16everywhere. Review flagged the blast radius across local dev laptops and the smaller staging/demo VMs — 16 prefork workers = 16 separate Python processes with imports + persistent DB connections, which is a real memory bump where it isn't needed.The current default of 8 is Celery's typical out-of-the-box behaviour on an 8-core host (matches
os.cpu_count()there) and a safer memory footprint everywhere else. Production overrides to 16 via env var — committed as an example in.envs/.production/.django-examplewith the rationale inline. This keeps the bottleneck fix in place where it matters without making every other environment pay for it.16 is the smallest power-of-2 step above 8 that roughly matches the empirical gap between ingress and drain on the production incident that motivated this PR, without risking pgbouncer saturation. A larger default can be considered once we have measured postgres connection-pool headroom (see "what we still need to verify" below).
What this does not change
1— that was already set and fairness behaviour is unchanged.antennaqueue into a dedicated "ingest fast path" vs "housekeeping / status-check" queue is a larger follow-up, filed separately.prefork. Switching togeventfor this queue may give much higher effective concurrency on an IO-bound workload, but every task on this queue would need to be audited for gevent-safety (blocking C extensions, thread-locals in PyTorch paths, etc.) first. Out of scope here.Deploy note
The production host is already running with
CELERY_WORKER_CONCURRENCY=16in its env file from the earlier hotfix — lowering the settings default to 8 does not regress prod. Confirm the env var is still set at next deploy; if for any reason it isn't, prod will drop to 8 until the env is fixed.What we still need to verify
default_pool_size, but worth confirming under load.--max-tasks-per-child=100/--max-memory-per-child=2 GiBalready bound each process).Related
async_apijobs killed by transient Redis errors duringupdate_state—RedisErrorand "state actually missing" are conflated into a single fatal path #1219 — code-path brittleness that lets a single transient Redis error mark an active job FAILURE and delete state (independent of this PR).async_apijobs killed by transient Redis errors duringupdate_state—RedisErrorand "state actually missing" are conflated into a single fatal path #1219 path triggers (independent of this PR).async_apijobs killed by transient Redis errors duringupdate_state—RedisErrorand "state actually missing" are conflated into a single fatal path #1219 (retry transient Redis errors instead of failing the job).Summary by CodeRabbit
New Features
Documentation