Docker healthcheck and auto-restart for Celery workers#1024
Docker healthcheck and auto-restart for Celery workers#1024
Conversation
✅ Deploy Preview for antenna-preview canceled.
|
📝 WalkthroughWalkthroughAdds Celery healthcheck scripts and ensures images include procps; updates Celery worker startup controls (CELERY_DEBUG, CELERY_NO_RELOAD); adds healthchecks and autoheal service entries across compose files; exposes debug port mapping for celeryworker in local compose. Changes
Sequence Diagram(s)sequenceDiagram
participant Docker as Docker Engine
participant Container as Celery Worker Container
participant Celery as Celery CLI
participant Broker as Broker (Redis/RabbitMQ)
participant Autoheal as Autoheal Service
loop healthcheck interval
Docker->>Container: run /celery/healthcheck.sh
Container->>Celery: celery inspect ping --destination "celery@$(hostname)" --timeout 10
alt ping succeeds
Celery-->>Container: exit 0 (healthy)
Container-->>Docker: return healthy
else ping fails
Celery-->>Container: exit non‑zero (unhealthy)
Container-->>Docker: return unhealthy
end
end
Docker->>Autoheal: report unhealthy (label=autoheal)
alt repeated failures
Autoheal->>Docker: request restart
Docker->>Container: restart container
else healthy
Autoheal-->>Autoheal: continue monitoring
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
compose/local/django/celery/healthcheck.sh (1)
15-19: Consider a more specific process pattern.The pattern
"celery.*worker"could potentially match unrelated processes. Consider using a more specific pattern like"celery -A config.celery_app worker"to ensure you're matching the actual worker process.-if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then +if ! pgrep -f "celery -A config.celery_app worker" > /dev/null 2>&1; then echo "ERROR: Celery worker process not found" >&2 exit 1 ficompose/production/django/celery/healthcheck.sh (1)
15-19: Consider a more specific process pattern.The pattern
"celery.*worker"could potentially match unrelated processes. Consider using a more specific pattern like"celery -A config.celery_app worker"to ensure you're matching the actual worker process.-if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then +if ! pgrep -f "celery -A config.celery_app worker" > /dev/null 2>&1; then echo "ERROR: Celery worker process not found" >&2 exit 1 fidocker-compose.production.yml (2)
46-53: Consider increasing celerybeat start_period to match celeryworker.The 30-second start period may be insufficient for Celery Beat to fully initialize and begin task scheduling. Celeryworker uses 60 seconds; consider matching this for consistency, or document why Beat requires faster readiness.
Additionally, the
pgrepcheck only validates process existence—it does not detect if Beat is actively scheduling tasks. For production, consider adding a secondary check (e.g., verifying Beat has logged recent task scheduling activity) if available in the healthcheck script.
72-81: Pin autoheal image version and document privileged socket access.The autoheal service uses
latesttag, which risks introducing breaking changes on image updates. For production stability, pin to a specific version:- image: willfarrell/autoheal:latest + image: willfarrell/autoheal:1.2.0Additionally, the
/var/run/docker.sockmount grants significant privileges to the autoheal container (ability to restart any container). Ensure this security consideration is documented in deployment notes or README for operators.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
compose/local/django/Dockerfile(2 hunks)compose/local/django/celery/healthcheck.sh(1 hunks)compose/local/django/celery/worker/start(1 hunks)compose/production/django/Dockerfile(2 hunks)compose/production/django/celery/healthcheck.sh(1 hunks)compose/production/django/celery/worker/start(1 hunks)docker-compose.production.yml(2 hunks)docker-compose.worker.yml(1 hunks)docker-compose.yml(1 hunks)
🔇 Additional comments (12)
compose/production/django/celery/worker/start (1)
7-26: LGTM! Excellent resource protection and documentation.The worker protections are well-configured:
--max-tasks-per-child=50prevents memory leaks from accumulating across tasks--max-memory-per-child=4000000(4GB) provides a reasonable ceiling for ML workloads- Comments clearly explain the interaction between healthcheck, autoheal, and restart policies
compose/local/django/celery/worker/start (1)
15-28: Verify the watchfiles command syntax.The conditional startup logic is well-structured, but please verify the watchfiles invocation on Line 27 is correct:
exec watchfiles --filter python celery.__main__.main --args '-A config.celery_app worker -l INFO ...'Based on watchfiles documentation, the typical syntax is
watchfiles [module] [args]. Confirm that:
- The module path
celery.__main__.mainis correctly specified- The
--argsflag properly passes all arguments to Celery- The command successfully reloads the worker when Python files change
Consider testing this locally with a file change to ensure auto-reload triggers as expected.
compose/production/django/Dockerfile (2)
48-49: LGTM! Correctly adds procps for healthcheck.The procps package provides
pgrepwhich is used by the healthcheck script to verify the Celery worker process is running.
86-88: LGTM! Properly installs healthcheck script.The healthcheck directory is copied and made executable, enabling Docker's healthcheck mechanism to monitor worker status.
docker-compose.worker.yml (2)
28-35: LGTM! Well-configured healthcheck parameters.The healthcheck configuration is appropriate:
- 30s interval provides frequent checks without overhead
- 15s timeout allows for slow responses during high load
- 3 retries (90s total) prevents premature restarts from transient issues
- 60s start_period accommodates worker initialization
37-46: Container names are unique and correctly environment-scoped—no changes needed.The verification confirms the container names across compose files are distinct:
ami_local_redis(local),ami_worker_autoheal(worker), andami_production_autoheal(production). The naming convention prevents collisions even if files were deployed on the same host. The autoheal service's global label-based restart behavior is the standard expected behavior for the autoheal image—confirm this is intentional for your use case.compose/local/django/Dockerfile (2)
44-45: LGTM! Correctly adds procps for healthcheck.Consistent with the production Dockerfile, this ensures
pgrepis available for the healthcheck script.
79-81: LGTM! Properly installs healthcheck script.The healthcheck directory is correctly copied and made executable for local development.
docker-compose.yml (2)
89-107: LGTM! Healthcheck configuration is appropriate for local development.The healthcheck is properly configured with the same parameters as production. Note that unlike
docker-compose.worker.ymlanddocker-compose.production.yml, this local compose file:
- Does not include an autoheal service
- Does not set
restart: alwayspolicy- Does not have the
autoheal=truelabelThis is appropriate for local development where developers may want manual control over container restarts for debugging purposes.
93-100: Clear documentation for debugging options.The commented environment variables provide helpful guidance for enabling debugpy remote debugging and controlling auto-reload behavior.
docker-compose.production.yml (2)
32-39: Celeryworker healthcheck configuration is sound.The script-based approach allows for sophisticated health validation beyond simple process checks. The 60-second start period provides adequate buffer for Celery worker initialization, and the 30-second check interval with 3 retries (90-second total to unhealthy) balances responsiveness with stability.
Ensure
/celery/healthcheck.shexists in the production Django image and handles cases where the Redis broker is unavailable (per PR summary, it should work without broker connectivity).
63-70: Flower healthcheck configuration is appropriate.The HTTP endpoint check is reliable and the timing (30s interval, 5s timeout, 3 retries) is well-suited for Flower's responsiveness characteristics. Internal port 5555 is correctly used despite the service being exposed on 5550 to the host.
| # Check 2: Can we connect to Redis (the broker)? | ||
| # Use redis-cli if available, otherwise skip | ||
| if command -v redis-cli > /dev/null 2>&1; then | ||
| if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then | ||
| echo "ERROR: Cannot connect to Redis broker" >&2 | ||
| exit 1 | ||
| fi | ||
| fi |
There was a problem hiding this comment.
Fix Redis broker URL parsing.
Line 24 uses ${CELERY_BROKER_URL:-redis} directly as the hostname argument to redis-cli, but CELERY_BROKER_URL is typically a full URL like redis://redis:6379/0, not just a hostname. This will cause the healthcheck to fail when the environment variable is set.
Apply this diff to properly extract the hostname from the broker URL:
# Check 2: Can we connect to Redis (the broker)?
# Use redis-cli if available, otherwise skip
if command -v redis-cli > /dev/null 2>&1; then
- if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then
+ REDIS_HOST=$(echo "${CELERY_BROKER_URL:-redis://redis:6379}" | sed -E 's|^redis://([^:/@]+).*|\1|')
+ if ! timeout 5 redis-cli -h "${REDIS_HOST}" ping > /dev/null 2>&1; then
echo "ERROR: Cannot connect to Redis broker" >&2
exit 1
fi
fiThis also adds a 5-second timeout to prevent the healthcheck from hanging indefinitely if Redis is unresponsive.
🤖 Prompt for AI Agents
In compose/local/django/celery/healthcheck.sh around lines 21 to 28, the script
currently passes CELERY_BROKER_URL (a full URL) directly to redis-cli -h which
fails; instead parse CELERY_BROKER_URL to extract host and port (strip scheme
like redis://, remove any userinfo before @, cut off any path/query after first
/, then split host[:port] into host and port), default to host "redis" and port
"6379" if missing, and call redis-cli with both -h host -p port and a 5-second
timeout (redis-cli -h <host> -p <port> -t 5 ping) so the healthcheck correctly
connects and won’t hang.
| # Check 2: Can we connect to Redis (the broker)? | ||
| # Use redis-cli if available, otherwise skip | ||
| if command -v redis-cli > /dev/null 2>&1; then | ||
| if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then | ||
| echo "ERROR: Cannot connect to Redis broker" >&2 | ||
| exit 1 | ||
| fi | ||
| fi |
There was a problem hiding this comment.
Fix Redis broker URL parsing.
Line 24 uses ${CELERY_BROKER_URL:-redis} directly as the hostname argument to redis-cli, but CELERY_BROKER_URL is typically a full URL like redis://redis:6379/0, not just a hostname. This will cause the healthcheck to fail when the environment variable is set.
Apply this diff to properly extract the hostname from the broker URL:
# Check 2: Can we connect to Redis (the broker)?
# Use redis-cli if available, otherwise skip
if command -v redis-cli > /dev/null 2>&1; then
- if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then
+ REDIS_HOST=$(echo "${CELERY_BROKER_URL:-redis://redis:6379}" | sed -E 's|^redis://([^:/@]+).*|\1|')
+ if ! timeout 5 redis-cli -h "${REDIS_HOST}" ping > /dev/null 2>&1; then
echo "ERROR: Cannot connect to Redis broker" >&2
exit 1
fi
fiThis also adds a 5-second timeout to prevent the healthcheck from hanging indefinitely if Redis is unresponsive.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Check 2: Can we connect to Redis (the broker)? | |
| # Use redis-cli if available, otherwise skip | |
| if command -v redis-cli > /dev/null 2>&1; then | |
| if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then | |
| echo "ERROR: Cannot connect to Redis broker" >&2 | |
| exit 1 | |
| fi | |
| fi | |
| # Check 2: Can we connect to Redis (the broker)? | |
| # Use redis-cli if available, otherwise skip | |
| if command -v redis-cli > /dev/null 2>&1; then | |
| REDIS_HOST=$(echo "${CELERY_BROKER_URL:-redis://redis:6379}" | sed -E 's|^redis://([^:/@]+).*|\1|') | |
| if ! timeout 5 redis-cli -h "${REDIS_HOST}" ping > /dev/null 2>&1; then | |
| echo "ERROR: Cannot connect to Redis broker" >&2 | |
| exit 1 | |
| fi | |
| fi |
🤖 Prompt for AI Agents
In compose/production/django/celery/healthcheck.sh around lines 21 to 28, the
script currently passes ${CELERY_BROKER_URL:-redis} directly to redis-cli which
fails when CELERY_BROKER_URL is a full URL (e.g. redis://redis:6379/0); update
the script to parse CELERY_BROKER_URL to extract host (and optionally port)
using shell string manipulation or a simple URL parse (fall back to "redis" host
if unset), then call redis-cli with -h <host> and -p <port> as appropriate and
include a connection timeout (e.g. --connect-timeout 5 or use redis-cli -t 5) so
the healthcheck fails fast on unresponsive Redis.
Code reviewFound 1 issue:
antenna/compose/production/django/celery/worker/start Lines 22 to 26 in ae584bc The current main branch includes this flag (added in #1041). This PR appears to predate that change and the rewritten scripts drop it. The same applies to all three code paths in the local start script. Note: several other issues scored just below the reporting threshold (75/100), all related to the PR branch predating subsequent merged PRs (#1041 RabbitMQ migration, #1048 debug conventions, #1051 worker limits). A rebase onto main would resolve most of them and surface the remaining conflicts for resolution. In particular, the healthcheck scripts check Redis connectivity ( 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
✅ Deploy Preview for antenna-ssec canceled.
|
There was a problem hiding this comment.
Pull request overview
Adds container-level health monitoring and automated recovery for Celery services to reduce manual intervention when workers/related services become unhealthy.
Changes:
- Introduces
/celery/healthcheck.shand wires Docker healthchecks for celeryworker (local + production), plus celerybeat/flower in production. - Adds an
autohealsidecar in worker/production compose files to restart containers marked unhealthy. - Updates Celery worker start scripts to support env-controlled debugpy and reload behavior, and keeps per-child limits configurable.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| docker-compose.yml | Adds local celeryworker healthcheck and exposes debugpy port. |
| docker-compose.worker.yml | Adds worker healthcheck + autoheal sidecar for worker VM deployments. |
| docker-compose.production.yml | Adds healthchecks for celeryworker/beat/flower + autoheal sidecar. |
| compose/production/django/celery/worker/start | Refactors worker startup command formatting; keeps max-tasks/max-memory settings. |
| compose/production/django/celery/healthcheck.sh | Adds production worker healthcheck script. |
| compose/production/django/Dockerfile | Installs procps and copies /celery healthcheck scripts into the image. |
| compose/local/django/celery/worker/start | Moves debug/reload behavior into script via env vars. |
| compose/local/django/celery/healthcheck.sh | Adds local worker healthcheck script. |
| compose/local/django/Dockerfile | Installs procps and copies /celery healthcheck scripts into the image. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| MAX_TASKS_PER_CHILD=100 | ||
| MAX_MEMORY_PER_CHILD=2097152 # 2 GiB in KB |
There was a problem hiding this comment.
The worker protections described in the PR summary (--max-tasks-per-child=50 / --max-memory-per-child=4000000) don’t match the values configured here (100 / 2097152). Please reconcile the PR description and the actual configured limits so operators don’t deploy with unintended settings.
| volumes: | ||
| - ./data/flower/:/data/ | ||
| healthcheck: | ||
| test: ["CMD-SHELL", "curl -f http://localhost:5555/ || exit 1"] |
There was a problem hiding this comment.
The Flower healthcheck uses curl, but the production Django image doesn’t install curl (no curl package in the Dockerfile). As written, this healthcheck will immediately fail and mark Flower unhealthy; either install curl (or use an available tool like wget/python) for the check.
| test: ["CMD-SHELL", "curl -f http://localhost:5555/ || exit 1"] | |
| test: ["CMD-SHELL", "python -c 'import urllib.request, sys; urllib.request.urlopen(\"http://localhost:5555/\")' || exit 1"] |
| volumes: | ||
| - /var/run/docker.sock:/var/run/docker.sock |
There was a problem hiding this comment.
Mounting /var/run/docker.sock into the autoheal container grants it root-equivalent control over the Docker host. If this is required, consider pinning the image by digest and documenting/limiting where this compose file is used to reduce the blast radius.
| volumes: | ||
| - /var/run/docker.sock:/var/run/docker.sock |
There was a problem hiding this comment.
Mounting /var/run/docker.sock into the autoheal container grants it root-equivalent control over the Docker host. If this is required, consider pinning the image by digest and documenting/limiting where this compose file is used to reduce the blast radius.
| healthcheck: | ||
| test: ["CMD-SHELL", "/celery/healthcheck.sh"] | ||
| interval: 30s # Check every 30 seconds | ||
| timeout: 15s # Healthcheck must complete within 15s (ping timeout is 10s + overhead) |
There was a problem hiding this comment.
The inline comment on the healthcheck timeout mentions a ‘ping timeout’, but /celery/healthcheck.sh doesn’t perform any ping/response check—only a process check. Please update the comment (or the script) so the configuration matches what’s actually being measured.
| timeout: 15s # Healthcheck must complete within 15s (ping timeout is 10s + overhead) | |
| timeout: 15s # Healthcheck script must complete within 15s |
| # This script checks if the Celery worker process is running and responsive. | ||
| # It uses two checks: | ||
| # 1. Process check - is celery worker process running? | ||
| # 2. Redis connectivity - can we connect to the broker? |
There was a problem hiding this comment.
The script text says it checks whether the worker is ‘running and responsive’, but the implemented check is just pgrep for a matching process. This will still pass for SIGSTOP’d/paused workers or workers that are deadlocked and not processing tasks; consider adding a responsiveness check (e.g., celery -A config.celery_app inspect ping with a timeout) and/or verifying the process state via ps.
| - "autoheal=true" # Enable autoheal to restart this container when unhealthy | ||
|
|
||
| autoheal: | ||
| image: willfarrell/autoheal:latest |
There was a problem hiding this comment.
Using willfarrell/autoheal:latest makes deployments non-deterministic and risks unexpected changes when the image updates. Pin this to a specific version tag (or digest) to keep production behavior reproducible.
| image: willfarrell/autoheal:latest | |
| image: willfarrell/autoheal:1.2.0 |
| - "autoheal=true" | ||
|
|
||
| autoheal: | ||
| image: willfarrell/autoheal:latest |
There was a problem hiding this comment.
Using willfarrell/autoheal:latest makes deployments non-deterministic and risks unexpected changes when the image updates. Pin this to a specific version tag (or digest) to keep production behavior reproducible.
| image: willfarrell/autoheal:latest | |
| image: willfarrell/autoheal:1.2.0 |
| # This script checks if the Celery worker process is running and responsive. | ||
| # It uses two checks: | ||
| # 1. Process check - is celery worker process running? | ||
| # 2. Redis connectivity - can we connect to the broker? | ||
| # | ||
| # When used with the autoheal container, unhealthy workers will be | ||
| # automatically restarted. | ||
|
|
||
| set -e | ||
|
|
||
| # Check 1: Is the celery worker process running? | ||
| if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then | ||
| echo "ERROR: Celery worker process not found" >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Check 2: Can we connect to Redis (the broker)? | ||
| # Use redis-cli if available, otherwise skip | ||
| if command -v redis-cli > /dev/null 2>&1; then | ||
| if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then | ||
| echo "ERROR: Cannot connect to Redis broker" >&2 | ||
| exit 1 | ||
| fi | ||
| fi | ||
|
|
There was a problem hiding this comment.
The Redis connectivity check is currently miswired: CELERY_BROKER_URL is an AMQP RabbitMQ URL in this repo, and the image doesn’t install redis-cli/redis-tools, so this check is ineffective/misconfigured. Either remove it, or replace it with a broker-appropriate connectivity check (or install the right client and pass a host/port rather than a URL).
| # This script checks if the Celery worker process is running and responsive. | |
| # It uses two checks: | |
| # 1. Process check - is celery worker process running? | |
| # 2. Redis connectivity - can we connect to the broker? | |
| # | |
| # When used with the autoheal container, unhealthy workers will be | |
| # automatically restarted. | |
| set -e | |
| # Check 1: Is the celery worker process running? | |
| if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then | |
| echo "ERROR: Celery worker process not found" >&2 | |
| exit 1 | |
| fi | |
| # Check 2: Can we connect to Redis (the broker)? | |
| # Use redis-cli if available, otherwise skip | |
| if command -v redis-cli > /dev/null 2>&1; then | |
| if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then | |
| echo "ERROR: Cannot connect to Redis broker" >&2 | |
| exit 1 | |
| fi | |
| fi | |
| # This script checks if the Celery worker process is running. | |
| # When used with the autoheal container, unhealthy workers will be | |
| # automatically restarted. | |
| set -e | |
| # Check: Is the celery worker process running? | |
| if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then | |
| echo "ERROR: Celery worker process not found" >&2 | |
| exit 1 | |
| fi |
| # Check 2: Can we connect to Redis (the broker)? | ||
| # Use redis-cli if available, otherwise skip | ||
| if command -v redis-cli > /dev/null 2>&1; then | ||
| if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then | ||
| echo "ERROR: Cannot connect to Redis broker" >&2 |
There was a problem hiding this comment.
The Redis connectivity check is wired to CELERY_BROKER_URL, but in this repo CELERY_BROKER_URL is an AMQP RabbitMQ URL (e.g. amqp://...), not a Redis host, and the image doesn’t install redis-cli (so this branch will never run). Either remove this check, or implement a broker check appropriate for RabbitMQ / parse host+port correctly and install the needed client tool.
mihow
left a comment
There was a problem hiding this comment.
Review: Direction and Approach
The worker protections (--max-tasks-per-child, --max-memory-per-child), autoheal pattern, and local dev improvements (CELERY_DEBUG, CELERY_NO_RELOAD) are solid additions. The layered defense approach is well-thought-out.
Key gap: healthcheck strength
The healthcheck currently checks process existence (pgrep) and stopped state (ps stat). This catches crashed workers and SIGSTOP'd processes, but not the harder failure mode this PR describes: stuck workers (deadlocked, stale broker connection, frozen on I/O).
A worker with a stale Redis/RabbitMQ connection will still have a running process and pass the current healthcheck. From docker ps, everything looks healthy while jobs pile up unprocessed.
Suggested upgrade: Add celery inspect ping as a second check. It sends a control message through the broker and waits for the worker to respond — the standard way to verify a Celery worker is actually responsive, not just alive. See inline comment on healthcheck.sh for a concrete implementation. It needs no extra dependencies (Celery and Django settings are already in the container).
Other items
- Duplicate scripts: Local and production healthcheck scripts are identical (except one comment word). Consider sharing a single script.
- Dockerfile COPY scope: The entire
celery/directory is copied into the image, including start scripts already copied individually above. Should copy onlyhealthcheck.sh. - Comment accuracy: A few comments reference "ping" behavior that isn't implemented yet. See inline suggestions.
Overall the architecture is right — the fixes are incremental.
| set -e | ||
|
|
||
| # Check: Is the celery worker process running and not stopped? | ||
| CELERY_PIDS=$(pgrep -f "celery.*worker" || true) |
There was a problem hiding this comment.
The pattern celery.*worker can match the watchfiles wrapper in local dev — its command line includes both "celery" and "worker" (e.g., watchfiles ... celery.__main__.main --args '... worker ...'). If the Celery child process crashes but watchfiles is still alive trying to restart it, pgrep matches the wrapper and reports healthy.
A more specific pattern avoids this:
| CELERY_PIDS=$(pgrep -f "celery.*worker" || true) | |
| CELERY_PIDS=$(pgrep -f "[c]elery.*-A config.celery_app worker" || true) |
The [c]elery trick also prevents pgrep from matching itself in edge cases.
| fi | ||
| done | ||
|
|
||
| # All checks passed |
There was a problem hiding this comment.
Consider adding a celery inspect ping check here. This is the main value-add — catching stuck/deadlocked workers that pass the process check. The PR description identifies this as the core problem ("workers can become stuck or unresponsive... but Docker doesn't detect this"), and pgrep alone can't solve it.
# Check 2: Is the worker actually responsive via the broker?
WORKER_NAME="celery@$(hostname)"
if ! timeout 12 celery -A config.celery_app inspect ping \
--destination "$WORKER_NAME" \
--timeout 8 > /dev/null 2>&1; then
echo "ERROR: Celery worker $WORKER_NAME not responding to ping" >&2
exit 1
fiThis uses the Django/Celery config already in the container — no extra deps needed. The --timeout 8 (Celery-level) + timeout 12 (process-level) stays within the 15s healthcheck timeout.
celery inspect ping is heavier than pgrep (roundtrip through broker), but the 30s interval is more than enough headroom.
| @@ -0,0 +1,27 @@ | |||
| #!/bin/bash | |||
| # | |||
| # Celery Worker Healthcheck Script (Production) | |||
There was a problem hiding this comment.
This file is identical to compose/local/django/celery/healthcheck.sh except for (Production) on this line. Consider sharing a single script to avoid divergence:
- Place it in
compose/shared/celery/healthcheck.shand COPY from there in both Dockerfiles - Or have the production Dockerfile COPY from the local path
The healthcheck logic is environment-agnostic, so a single source of truth makes maintenance simpler.
| # Example: 8 CPUs × 2 GiB = 16 GiB max total worker memory | ||
| # | ||
| # These options work in conjunction with the Docker healthcheck: | ||
| # - Healthcheck detects STUCK workers (not responding to ping) |
There was a problem hiding this comment.
The healthcheck currently uses pgrep — it detects dead workers (process not running) and stopped workers, not stuck ones responding to ping. If celery inspect ping is added later, this comment becomes accurate. For now:
| # - Healthcheck detects STUCK workers (not responding to ping) | |
| # - Healthcheck detects DEAD workers (process no longer running) |
|
|
||
| # Copy celery scripts directory for healthcheck | ||
| COPY ./compose/local/django/celery /celery |
There was a problem hiding this comment.
This copies the entire celery/ directory into /celery/, including worker/start, beat/start, and flower/start — which are already individually copied to /start-* on lines 67-77 above. Only the healthcheck script is needed here.
| # Copy celery scripts directory for healthcheck | |
| COPY ./compose/local/django/celery /celery | |
| # Copy healthcheck script for Docker health monitoring | |
| COPY ./compose/local/django/celery/healthcheck.sh /celery/healthcheck.sh | |
| RUN chmod +x /celery/healthcheck.sh |
| # Copy celery scripts directory for healthcheck | ||
| COPY --chown=django:django ./compose/production/django/celery /celery | ||
| RUN chmod +x /celery/healthcheck.sh |
There was a problem hiding this comment.
Same as local — copy only the healthcheck script, not the entire directory.
| # Copy celery scripts directory for healthcheck | |
| COPY --chown=django:django ./compose/production/django/celery /celery | |
| RUN chmod +x /celery/healthcheck.sh | |
| # Copy healthcheck script for Docker health monitoring | |
| COPY --chown=django:django ./compose/production/django/celery/healthcheck.sh /celery/healthcheck.sh | |
| RUN chmod +x /celery/healthcheck.sh |
| environment: | ||
| - AUTOHEAL_CONTAINER_LABEL=autoheal | ||
| - AUTOHEAL_INTERVAL=10 # Check container health every 10 seconds | ||
| - AUTOHEAL_START_PERIOD=60 # Don't restart containers for 60s after they start |
There was a problem hiding this comment.
This is the delay before the autoheal container itself starts monitoring — not a per-container grace period. Docker's start_period in the healthcheck config handles per-container startup grace.
| - AUTOHEAL_START_PERIOD=60 # Don't restart containers for 60s after they start | |
| - AUTOHEAL_START_PERIOD=60 # Wait 60s after autoheal starts before monitoring |
| environment: | ||
| - AUTOHEAL_CONTAINER_LABEL=autoheal | ||
| - AUTOHEAL_INTERVAL=10 # Check container health every 10 seconds | ||
| - AUTOHEAL_START_PERIOD=60 # Don't restart containers for 60s after they start |
There was a problem hiding this comment.
Same clarification as the worker compose — this controls the autoheal container's own startup delay.
| - AUTOHEAL_START_PERIOD=60 # Don't restart containers for 60s after they start | |
| - AUTOHEAL_START_PERIOD=60 # Wait 60s after autoheal starts before monitoring |
Adds end-to-end recovery for the failure mode observed on 2026-04-16: the celerybeat container on ami-live showed "Up 10 hours" in `docker ps` with four live PIDs and RestartCount=0, but the scheduler thread was hung after a Redis connection blip. `jobs_health_check` stopped firing and stuck job 2421 was never reaped. Changes: - celeryworker healthcheck via `celery inspect ping` (detects stuck workers that keep their PID but stop servicing the broker). - celerybeat healthcheck via a heartbeat file. Since `inspect ping` is a worker control message and DatabaseScheduler has no on-disk schedule file, we add HeartbeatDatabaseScheduler (ami/celery_schedulers.py) that overrides `tick()` to touch /tmp/beat-heartbeat on every loop. If the scheduler hangs, the file goes stale and Docker flips to unhealthy. - willfarrell/autoheal sidecar on production/staging/worker compose. Docker Compose still has no native restart-on-unhealthy (swarm-only). - docker-compose.staging.yml now gets the same healthchecks + autoheal (this is what the demo host uses; previously had zero celery healthchecks). - Worker start-script env var renamed DEBUGGER -> CELERY_DEBUG and adds CELERY_NO_RELOAD to disable watchfiles in local dev. - Production compose no longer defines `celeryworker` — workers run on dedicated hosts via docker-compose.worker.yml. This was already in the original PR direction. Supersedes the `pgrep`-based beat healthcheck in an earlier draft of this PR, which would not have caught the 2026-04-16 hang (the process was alive; the scheduler thread was not). Co-Authored-By: Claude <noreply@anthropic.com>
|
Claude says: (on behalf of @mihow, context from the celery beat hang debugged today in #1246) Picking this back up. The shape is good — this should rebase on main and land rather than start over. One load-bearing change needed, plus a few smaller notes. 1.
|
0d0cebb to
c940b3e
Compare
Summary
Adds Docker healthchecks and automatic restart for the Celery worker and beat services. Addresses the failure mode observed on 2026-04-16, where the celerybeat container showed
Up 10 hours/RestartCount=0indocker ps, but its scheduler thread was hung — thejobs_health_checkbeat task hadn't fired in 12+ hours, so stuck job 2421 was never reaped. Docker's defaultrestart: unless-stoppedonly catches process death, not scheduler/worker hangs, so we need an explicit liveness signal plus a mechanism to act on it.What this changes
Worker healthcheck —
celery inspect pingover the broker, viacompose/{local,production}/django/celery/healthcheck.sh. Catches stuck, deadlocked, and disconnected workers that a PID check would miss.Beat healthcheck — file-mtime heartbeat. Beat doesn't respond to
inspect ping(it's a worker control message), andDatabaseSchedulerhas no on-disk schedule file. Soami/celery_schedulers.pyaddsHeartbeatDatabaseScheduler, a subclass ofdjango_celery_beat'sDatabaseSchedulerthat touches/tmp/beat-heartbeatat the top of everytick(). The touch runs inside the beat process, so the file lives in the beat container's filesystem. The healthcheck script (healthcheck-beat.sh) fails if the file is older than 2 minutes. If the scheduler hangs anywhere (Redis pool lock, DB call, sync deadlock), the file goes stale and Docker flips the container tounhealthy— which would have caught the 2026-04-16 incident.Autoheal sidecar —
willfarrell/autoheal:1.2.0added to production, staging, and worker compose files. Docker Compose still has no native restart-on-unhealthy (that's swarm/k8s territory — compose issue #4826 has been open since 2017). Autoheal watches the docker socket and restarts any container labeledautoheal=truethat Docker marks unhealthy.Staging compose (
docker-compose.staging.yml) now has the same healthchecks + autoheal. Previously it had zero celery healthchecks. This is what the demo host runs.Worker start scripts
compose/production/django/celery/worker/start: multi-line format; comment explains how healthcheck,max-tasks-per-child, autoheal, andrestart: alwayslayer togethercompose/local/django/celery/worker/start: renamesDEBUGGER=1→CELERY_DEBUG=1, addsCELERY_NO_RELOAD=1to disable watchfiles--max-tasks-per-child=100,--max-memory-per-child) kept from earlier draftProduction compose —
celeryworkerremoved; workers run on dedicated hosts viadocker-compose.worker.yml. This was already the direction.Files
New:
ami/celery_schedulers.py(scheduler subclass with docstring explaining the 2026-04-16 context)compose/{local,production}/django/celery/healthcheck.sh(worker)compose/{local,production}/django/celery/healthcheck-beat.sh(beat)Modified:
compose/{local,production}/django/Dockerfile— addprocps, COPY healthcheck scriptscompose/{local,production}/django/celery/worker/startconfig/settings/base.py— switchCELERY_BEAT_SCHEDULERtoHeartbeatDatabaseSchedulerdocker-compose.yml,docker-compose.production.yml,docker-compose.worker.yml,docker-compose.staging.yml,docker-compose.override-example.ymlVerification
Healthy path:
Simulated hang (either service):
Notes
pgrep -f 'celery.*beat'for the beat healthcheck. That would not have caught the 2026-04-16 incident — the process was alive; the scheduler thread was not.start_period: 90son both healthchecks covers the slow production boot (postgres wait + New Relic agent init + DatabaseScheduler's initial sync).ami_production_autoheal,ami_worker_autoheal); the staging compose omitscontainer_nameso multiple staging instances on the same host don't collide.