Docker healthcheck and auto-restart for Celery workers by mihow · Pull Request #1024 · RolnickLab/antenna

mihow · 2025-10-30T23:35:23Z

Summary

Adds Docker healthchecks and automatic restart for the Celery worker and beat services. Addresses the failure mode observed on 2026-04-16, where the celerybeat container showed Up 10 hours / RestartCount=0 in docker ps, but its scheduler thread was hung — the jobs_health_check beat task hadn't fired in 12+ hours, so stuck job 2421 was never reaped. Docker's default restart: unless-stopped only catches process death, not scheduler/worker hangs, so we need an explicit liveness signal plus a mechanism to act on it.

What this changes

Worker healthcheck — celery inspect ping over the broker, via compose/{local,production}/django/celery/healthcheck.sh. Catches stuck, deadlocked, and disconnected workers that a PID check would miss.

Beat healthcheck — file-mtime heartbeat. Beat doesn't respond to inspect ping (it's a worker control message), and DatabaseScheduler has no on-disk schedule file. So ami/celery_schedulers.py adds HeartbeatDatabaseScheduler, a subclass of django_celery_beat's DatabaseScheduler that touches /tmp/beat-heartbeat at the top of every tick(). The touch runs inside the beat process, so the file lives in the beat container's filesystem. The healthcheck script (healthcheck-beat.sh) fails if the file is older than 2 minutes. If the scheduler hangs anywhere (Redis pool lock, DB call, sync deadlock), the file goes stale and Docker flips the container to unhealthy — which would have caught the 2026-04-16 incident.

Autoheal sidecar — willfarrell/autoheal:1.2.0 added to production, staging, and worker compose files. Docker Compose still has no native restart-on-unhealthy (that's swarm/k8s territory — compose issue #4826 has been open since 2017). Autoheal watches the docker socket and restarts any container labeled autoheal=true that Docker marks unhealthy.

Staging compose (docker-compose.staging.yml) now has the same healthchecks + autoheal. Previously it had zero celery healthchecks. This is what the demo host runs.

Worker start scripts

compose/production/django/celery/worker/start: multi-line format; comment explains how healthcheck, max-tasks-per-child, autoheal, and restart: always layer together
compose/local/django/celery/worker/start: renames DEBUGGER=1 → CELERY_DEBUG=1, adds CELERY_NO_RELOAD=1 to disable watchfiles
Worker protections (--max-tasks-per-child=100, --max-memory-per-child) kept from earlier draft

Production compose — celeryworker removed; workers run on dedicated hosts via docker-compose.worker.yml. This was already the direction.

Files

New:

ami/celery_schedulers.py (scheduler subclass with docstring explaining the 2026-04-16 context)
compose/{local,production}/django/celery/healthcheck.sh (worker)
compose/{local,production}/django/celery/healthcheck-beat.sh (beat)

Modified:

compose/{local,production}/django/Dockerfile — add procps, COPY healthcheck scripts
compose/{local,production}/django/celery/worker/start
config/settings/base.py — switch CELERY_BEAT_SCHEDULER to HeartbeatDatabaseScheduler
docker-compose.yml, docker-compose.production.yml, docker-compose.worker.yml, docker-compose.staging.yml, docker-compose.override-example.yml

Verification

Healthy path:

docker compose -f docker-compose.staging.yml build
docker compose -f docker-compose.staging.yml up -d
# After ~90s start_period:
docker ps --filter name=celery --format '{{.Names}} {{.Status}}'
# celeryworker and celerybeat should both report (healthy)

Simulated hang (either service):

docker exec <celerybeat-container> kill -STOP 1
# Within ~2 min, healthcheck flips to unhealthy, autoheal restarts the container
docker inspect <celerybeat-container> --format '{{.RestartCount}}'

Notes

Supersedes an earlier draft that used pgrep -f 'celery.*beat' for the beat healthcheck. That would not have caught the 2026-04-16 incident — the process was alive; the scheduler thread was not.
start_period: 90s on both healthchecks covers the slow production boot (postgres wait + New Relic agent init + DatabaseScheduler's initial sync).
Autoheal container names are unique per environment (ami_production_autoheal, ami_worker_autoheal); the staging compose omits container_name so multiple staging instances on the same host don't collide.

netlify · 2025-10-30T23:35:28Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`c940b3e`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/69e1dfc956d5be0008b9aa82

coderabbitai · 2025-10-30T23:35:32Z

📝 Walkthrough

Walkthrough

Adds Celery healthcheck scripts and ensures images include procps; updates Celery worker startup controls (CELERY_DEBUG, CELERY_NO_RELOAD); adds healthchecks and autoheal service entries across compose files; exposes debug port mapping for celeryworker in local compose.

Changes

Cohort / File(s)	Summary
Dockerfiles `compose/local/django/Dockerfile`, `compose/production/django/Dockerfile`	Install `procps`, copy `celery/healthcheck.sh` into image and make it executable.
Celery healthcheck scripts `compose/local/django/celery/healthcheck.sh`, `compose/production/django/celery/healthcheck.sh`	New bash scripts that run `celery -A config.celery_app inspect ping --destination "celery@$(hostname)" --timeout 10`, redirect output to /dev/null, and exit non‑zero on failure.
Local worker startup `compose/local/django/celery/worker/start`	Replaced `DEBUGGER` flag with `CELERY_DEBUG` to enable debugpy (port 5679) and added `CELERY_NO_RELOAD` to disable watchfiles auto-reload; preserved existing max‑tasks/max‑memory flags and core invocation.
Production worker startup `compose/production/django/celery/worker/start`	Refactored comments and the `exec` invocation to multi-line form; kept existing protection flags (`MAX_TASKS_PER_CHILD`, `MAX_MEMORY_PER_CHILD`) and worker args.
Local compose `docker-compose.yml`, `docker-compose.override-example.yml`, `docker-compose.worker.yml`	`docker-compose.yml`: exposed `5679:5679` for celeryworker and added healthcheck. `docker-compose.override-example.yml`: renamed `DEBUGGER=1` → `CELERY_DEBUG=1`. `docker-compose.worker.yml`: added celeryworker healthcheck, label `autoheal=true`, and an `autoheal` service.
Production compose `docker-compose.production.yml`	Removed public `celeryworker` service; added healthchecks and `autoheal=true` label to `celerybeat` and `flower`; added `autoheal` service and copied healthcheck script into production image.

Sequence Diagram(s)

sequenceDiagram
    participant Docker as Docker Engine
    participant Container as Celery Worker Container
    participant Celery as Celery CLI
    participant Broker as Broker (Redis/RabbitMQ)
    participant Autoheal as Autoheal Service

    loop healthcheck interval
        Docker->>Container: run /celery/healthcheck.sh
        Container->>Celery: celery inspect ping --destination "celery@$(hostname)" --timeout 10
        alt ping succeeds
            Celery-->>Container: exit 0 (healthy)
            Container-->>Docker: return healthy
        else ping fails
            Celery-->>Container: exit non‑zero (unhealthy)
            Container-->>Docker: return unhealthy
        end
    end

    Docker->>Autoheal: report unhealthy (label=autoheal)
    alt repeated failures
        Autoheal->>Docker: request restart
        Docker->>Container: restart container
    else healthy
        Autoheal-->>Autoheal: continue monitoring
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Set up RabbitMQ as Celery broker #1041 — modifies the same Celery worker startup script (compose/.../celery/worker/start), overlapping startup logic changes.
Add Celery worker memory and task limits for leak prevention #1051 — touches Celery worker startup flags (max tasks/memory) and relates to worker protection parameters preserved here.
Update debugging support for Django and Celery #1048 — previously added DEBUGGER-based debugpy startup; this PR refactors that to CELERY_DEBUG/CELERY_NO_RELOAD, so the two are closely related.

Suggested reviewers

carlosgjs

Poem

🐰 I hopped into the container, gave a twitchy cheer,
Healthchecks whisper "ping" so workers stay near.
If Celery naps, autoheal bounces them awake—
Containers spring to life and hop for goodness' sake! 🥕✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Docker healthcheck and auto-restart for Celery workers' directly and clearly summarizes the primary changes in the pull request.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	PR description comprehensively covers all required template sections with clear summaries, detailed explanations, testing instructions, and deployment notes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/celery-worker-healthcheck

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

compose/local/django/celery/healthcheck.sh (1)
15-19: Consider a more specific process pattern.

The pattern "celery.*worker" could potentially match unrelated processes. Consider using a more specific pattern like "celery -A config.celery_app worker" to ensure you're matching the actual worker process.
-if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then
+if ! pgrep -f "celery -A config.celery_app worker" > /dev/null 2>&1; then
     echo "ERROR: Celery worker process not found" >&2
     exit 1
 fi
compose/production/django/celery/healthcheck.sh (1)
15-19: Consider a more specific process pattern.

The pattern "celery.*worker" could potentially match unrelated processes. Consider using a more specific pattern like "celery -A config.celery_app worker" to ensure you're matching the actual worker process.
-if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then
+if ! pgrep -f "celery -A config.celery_app worker" > /dev/null 2>&1; then
     echo "ERROR: Celery worker process not found" >&2
     exit 1
 fi
docker-compose.production.yml (2)
46-53: Consider increasing celerybeat start_period to match celeryworker.

The 30-second start period may be insufficient for Celery Beat to fully initialize and begin task scheduling. Celeryworker uses 60 seconds; consider matching this for consistency, or document why Beat requires faster readiness.

Additionally, the pgrep check only validates process existence—it does not detect if Beat is actively scheduling tasks. For production, consider adding a secondary check (e.g., verifying Beat has logged recent task scheduling activity) if available in the healthcheck script.

72-81: Pin autoheal image version and document privileged socket access.

The autoheal service uses latest tag, which risks introducing breaking changes on image updates. For production stability, pin to a specific version:
-    image: willfarrell/autoheal:latest
+    image: willfarrell/autoheal:1.2.0
Additionally, the /var/run/docker.sock mount grants significant privileges to the autoheal container (ability to restart any container). Ensure this security consideration is documented in deployment notes or README for operators.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e3b9711 and ae584bc.

📒 Files selected for processing (9)

compose/local/django/Dockerfile (2 hunks)
compose/local/django/celery/healthcheck.sh (1 hunks)
compose/local/django/celery/worker/start (1 hunks)
compose/production/django/Dockerfile (2 hunks)
compose/production/django/celery/healthcheck.sh (1 hunks)
compose/production/django/celery/worker/start (1 hunks)
docker-compose.production.yml (2 hunks)
docker-compose.worker.yml (1 hunks)
docker-compose.yml (1 hunks)

🔇 Additional comments (12)

compose/production/django/celery/worker/start (1)

7-26: LGTM! Excellent resource protection and documentation.

The worker protections are well-configured:

--max-tasks-per-child=50 prevents memory leaks from accumulating across tasks

--max-memory-per-child=4000000 (4GB) provides a reasonable ceiling for ML workloads

Comments clearly explain the interaction between healthcheck, autoheal, and restart policies
compose/local/django/celery/worker/start (1)
15-28: Verify the watchfiles command syntax.

The conditional startup logic is well-structured, but please verify the watchfiles invocation on Line 27 is correct:
exec watchfiles --filter python celery.__main__.main --args '-A config.celery_app worker -l INFO ...'
Based on watchfiles documentation, the typical syntax is watchfiles [module] [args]. Confirm that:

The module path celery.__main__.main is correctly specified

The --args flag properly passes all arguments to Celery

The command successfully reloads the worker when Python files change

Consider testing this locally with a file change to ensure auto-reload triggers as expected.
compose/production/django/Dockerfile (2)

48-49: LGTM! Correctly adds procps for healthcheck.

The procps package provides pgrep which is used by the healthcheck script to verify the Celery worker process is running.

86-88: LGTM! Properly installs healthcheck script.

The healthcheck directory is copied and made executable, enabling Docker's healthcheck mechanism to monitor worker status.

docker-compose.worker.yml (2)

28-35: LGTM! Well-configured healthcheck parameters.

The healthcheck configuration is appropriate:

30s interval provides frequent checks without overhead

15s timeout allows for slow responses during high load

3 retries (90s total) prevents premature restarts from transient issues

60s start_period accommodates worker initialization

37-46: Container names are unique and correctly environment-scoped—no changes needed.

The verification confirms the container names across compose files are distinct: ami_local_redis (local), ami_worker_autoheal (worker), and ami_production_autoheal (production). The naming convention prevents collisions even if files were deployed on the same host. The autoheal service's global label-based restart behavior is the standard expected behavior for the autoheal image—confirm this is intentional for your use case.

compose/local/django/Dockerfile (2)

44-45: LGTM! Correctly adds procps for healthcheck.

Consistent with the production Dockerfile, this ensures pgrep is available for the healthcheck script.

79-81: LGTM! Properly installs healthcheck script.

The healthcheck directory is correctly copied and made executable for local development.

docker-compose.yml (2)

89-107: LGTM! Healthcheck configuration is appropriate for local development.

The healthcheck is properly configured with the same parameters as production. Note that unlike docker-compose.worker.yml and docker-compose.production.yml, this local compose file:

Does not include an autoheal service

Does not set restart: always policy

Does not have the autoheal=true label

This is appropriate for local development where developers may want manual control over container restarts for debugging purposes.

93-100: Clear documentation for debugging options.

The commented environment variables provide helpful guidance for enabling debugpy remote debugging and controlling auto-reload behavior.

docker-compose.production.yml (2)

32-39: Celeryworker healthcheck configuration is sound.

The script-based approach allows for sophisticated health validation beyond simple process checks. The 60-second start period provides adequate buffer for Celery worker initialization, and the 30-second check interval with 3 retries (90-second total to unhealthy) balances responsiveness with stability.

Ensure /celery/healthcheck.sh exists in the production Django image and handles cases where the Redis broker is unavailable (per PR summary, it should work without broker connectivity).

63-70: Flower healthcheck configuration is appropriate.

The HTTP endpoint check is reliable and the timing (30s interval, 5s timeout, 3 retries) is well-suited for Flower's responsiveness characteristics. Internal port 5555 is correctly used despite the service being exposed on 5550 to the host.

coderabbitai · 2025-10-31T00:10:30Z

+# Check 2: Can we connect to Redis (the broker)?
+# Use redis-cli if available, otherwise skip
+if command -v redis-cli > /dev/null 2>&1; then
+    if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then
+        echo "ERROR: Cannot connect to Redis broker" >&2
+        exit 1
+    fi
+fi


⚠️ Potential issue | 🔴 Critical

Fix Redis broker URL parsing.

Line 24 uses ${CELERY_BROKER_URL:-redis} directly as the hostname argument to redis-cli, but CELERY_BROKER_URL is typically a full URL like redis://redis:6379/0, not just a hostname. This will cause the healthcheck to fail when the environment variable is set.

Apply this diff to properly extract the hostname from the broker URL:

# Check 2: Can we connect to Redis (the broker)? # Use redis-cli if available, otherwise skip if command -v redis-cli > /dev/null 2>&1; then - if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then + REDIS_HOST=$(echo "${CELERY_BROKER_URL:-redis://redis:6379}" | sed -E 's|^redis://([^:/@]+).*|\1|') + if ! timeout 5 redis-cli -h "${REDIS_HOST}" ping > /dev/null 2>&1; then echo "ERROR: Cannot connect to Redis broker" >&2 exit 1 fi fi

This also adds a 5-second timeout to prevent the healthcheck from hanging indefinitely if Redis is unresponsive.

🤖 Prompt for AI Agents

In compose/local/django/celery/healthcheck.sh around lines 21 to 28, the script currently passes CELERY_BROKER_URL (a full URL) directly to redis-cli -h which fails; instead parse CELERY_BROKER_URL to extract host and port (strip scheme like redis://, remove any userinfo before @, cut off any path/query after first /, then split host[:port] into host and port), default to host "redis" and port "6379" if missing, and call redis-cli with both -h host -p port and a 5-second timeout (redis-cli -h <host> -p <port> -t 5 ping) so the healthcheck correctly connects and won’t hang.

coderabbitai · 2025-10-31T00:10:30Z

+# Check 2: Can we connect to Redis (the broker)?
+# Use redis-cli if available, otherwise skip
+if command -v redis-cli > /dev/null 2>&1; then
+    if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then
+        echo "ERROR: Cannot connect to Redis broker" >&2
+        exit 1
+    fi
+fi


⚠️ Potential issue | 🔴 Critical

Fix Redis broker URL parsing.

Line 24 uses ${CELERY_BROKER_URL:-redis} directly as the hostname argument to redis-cli, but CELERY_BROKER_URL is typically a full URL like redis://redis:6379/0, not just a hostname. This will cause the healthcheck to fail when the environment variable is set.

Apply this diff to properly extract the hostname from the broker URL:

# Check 2: Can we connect to Redis (the broker)? # Use redis-cli if available, otherwise skip if command -v redis-cli > /dev/null 2>&1; then - if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then + REDIS_HOST=$(echo "${CELERY_BROKER_URL:-redis://redis:6379}" | sed -E 's|^redis://([^:/@]+).*|\1|') + if ! timeout 5 redis-cli -h "${REDIS_HOST}" ping > /dev/null 2>&1; then echo "ERROR: Cannot connect to Redis broker" >&2 exit 1 fi fi

This also adds a 5-second timeout to prevent the healthcheck from hanging indefinitely if Redis is unresponsive.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Check 2: Can we connect to Redis (the broker)?

# Use redis-cli if available, otherwise skip

if command -v redis-cli > /dev/null 2>&1; then

if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then

echo "ERROR: Cannot connect to Redis broker" >&2

exit 1

fi

fi

# Check 2: Can we connect to Redis (the broker)?

# Use redis-cli if available, otherwise skip

if command -v redis-cli > /dev/null 2>&1; then

REDIS_HOST=$(echo "${CELERY_BROKER_URL:-redis://redis:6379}" | sed -E 's|^redis://([^:/@]+).*|\1|')

if ! timeout 5 redis-cli -h "${REDIS_HOST}" ping > /dev/null 2>&1; then

echo "ERROR: Cannot connect to Redis broker" >&2

exit 1

fi

fi

🤖 Prompt for AI Agents

In compose/production/django/celery/healthcheck.sh around lines 21 to 28, the script currently passes ${CELERY_BROKER_URL:-redis} directly to redis-cli which fails when CELERY_BROKER_URL is a full URL (e.g. redis://redis:6379/0); update the script to parse CELERY_BROKER_URL to extract host (and optionally port) using shell string manipulation or a simple URL parse (fall back to "redis" host if unset), then call redis-cli with -h <host> and -p <port> as appropriate and include a connection timeout (e.g. --connect-timeout 5 or use redis-cli -t 5) so the healthcheck fails fast on unresponsive Redis.

mihow · 2026-03-03T08:42:10Z

Code review

Found 1 issue:

--queues=antenna is missing from the production and local worker start scripts. After the RabbitMQ migration in PR Set up RabbitMQ as Celery broker #1041, workers must specify --queues=antenna to consume tasks from the correct queue. Without it, workers listen on the default celery queue while tasks are dispatched to antenna, causing all tasks to silently pile up unprocessed.

antenna/compose/production/django/celery/worker/start

Lines 22 to 26 in ae584bc

    
           exec newrelic-admin run-program celery -A config.celery_app worker \ 
        
               -l INFO \ 
        
               --max-tasks-per-child=50 \ 
        
               --max-memory-per-child=4000000

The current main branch includes this flag (added in #1041). This PR appears to predate that change and the rewritten scripts drop it. The same applies to all three code paths in the local start script.

Note: several other issues scored just below the reporting threshold (75/100), all related to the PR branch predating subsequent merged PRs (#1041 RabbitMQ migration, #1048 debug conventions, #1051 worker limits). A rebase onto main would resolve most of them and surface the remaining conflicts for resolution. In particular, the healthcheck scripts check Redis connectivity (redis-cli) but the broker is now RabbitMQ (amqp://), and the worker limit values (50 tasks / 4GB) differ from the tuned values on main (100 tasks / 2GB).

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

netlify · 2026-03-03T08:47:32Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`c940b3e`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/69e1dfc9bef2ed000969d4d9

Copilot

Pull request overview

Adds container-level health monitoring and automated recovery for Celery services to reduce manual intervention when workers/related services become unhealthy.

Changes:

Introduces /celery/healthcheck.sh and wires Docker healthchecks for celeryworker (local + production), plus celerybeat/flower in production.
Adds an autoheal sidecar in worker/production compose files to restart containers marked unhealthy.
Updates Celery worker start scripts to support env-controlled debugpy and reload behavior, and keeps per-child limits configurable.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
docker-compose.yml	Adds local celeryworker healthcheck and exposes debugpy port.
docker-compose.worker.yml	Adds worker healthcheck + autoheal sidecar for worker VM deployments.
docker-compose.production.yml	Adds healthchecks for celeryworker/beat/flower + autoheal sidecar.
compose/production/django/celery/worker/start	Refactors worker startup command formatting; keeps max-tasks/max-memory settings.
compose/production/django/celery/healthcheck.sh	Adds production worker healthcheck script.
compose/production/django/Dockerfile	Installs procps and copies `/celery` healthcheck scripts into the image.
compose/local/django/celery/worker/start	Moves debug/reload behavior into script via env vars.
compose/local/django/celery/healthcheck.sh	Adds local worker healthcheck script.
compose/local/django/Dockerfile	Installs procps and copies `/celery` healthcheck scripts into the image.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-03T08:54:46Z

 MAX_TASKS_PER_CHILD=100
 MAX_MEMORY_PER_CHILD=2097152  # 2 GiB in KB


The worker protections described in the PR summary (--max-tasks-per-child=50 / --max-memory-per-child=4000000) don’t match the values configured here (100 / 2097152). Please reconcile the PR description and the actual configured limits so operators don’t deploy with unintended settings.

Copilot · 2026-03-03T08:54:47Z

    volumes:
      - ./data/flower/:/data/
+    healthcheck:
+      test: ["CMD-SHELL", "curl -f http://localhost:5555/ || exit 1"]


The Flower healthcheck uses curl, but the production Django image doesn’t install curl (no curl package in the Dockerfile). As written, this healthcheck will immediately fail and mark Flower unhealthy; either install curl (or use an available tool like wget/python) for the check.

Suggested change

test: ["CMD-SHELL", "curl -f http://localhost:5555/ || exit 1"]

test: ["CMD-SHELL", "python -c 'import urllib.request, sys; urllib.request.urlopen(\"http://localhost:5555/\")' || exit 1"]

Copilot · 2026-03-03T08:54:47Z

+    volumes:
+      - /var/run/docker.sock:/var/run/docker.sock


Mounting /var/run/docker.sock into the autoheal container grants it root-equivalent control over the Docker host. If this is required, consider pinning the image by digest and documenting/limiting where this compose file is used to reduce the blast radius.

Copilot · 2026-03-03T08:54:48Z

+    volumes:
+      - /var/run/docker.sock:/var/run/docker.sock


Mounting /var/run/docker.sock into the autoheal container grants it root-equivalent control over the Docker host. If this is required, consider pinning the image by digest and documenting/limiting where this compose file is used to reduce the blast radius.

Copilot · 2026-03-03T08:54:48Z

+    healthcheck:
+      test: ["CMD-SHELL", "/celery/healthcheck.sh"]
+      interval: 30s      # Check every 30 seconds
+      timeout: 15s       # Healthcheck must complete within 15s (ping timeout is 10s + overhead)


The inline comment on the healthcheck timeout mentions a ‘ping timeout’, but /celery/healthcheck.sh doesn’t perform any ping/response check—only a process check. Please update the comment (or the script) so the configuration matches what’s actually being measured.

Suggested change

timeout: 15s # Healthcheck must complete within 15s (ping timeout is 10s + overhead)

timeout: 15s # Healthcheck script must complete within 15s

Copilot · 2026-03-03T08:54:49Z

+# This script checks if the Celery worker process is running and responsive.
+# It uses two checks:
+# 1. Process check - is celery worker process running?
+# 2. Redis connectivity - can we connect to the broker?


The script text says it checks whether the worker is ‘running and responsive’, but the implemented check is just pgrep for a matching process. This will still pass for SIGSTOP’d/paused workers or workers that are deadlocked and not processing tasks; consider adding a responsiveness check (e.g., celery -A config.celery_app inspect ping with a timeout) and/or verifying the process state via ps.

Copilot · 2026-03-03T08:54:49Z

+      - "autoheal=true"  # Enable autoheal to restart this container when unhealthy
+
+  autoheal:
+    image: willfarrell/autoheal:latest


Using willfarrell/autoheal:latest makes deployments non-deterministic and risks unexpected changes when the image updates. Pin this to a specific version tag (or digest) to keep production behavior reproducible.

Suggested change

image: willfarrell/autoheal:latest

image: willfarrell/autoheal:1.2.0

Copilot · 2026-03-03T08:54:49Z

+      - "autoheal=true"
+
+  autoheal:
+    image: willfarrell/autoheal:latest


Using willfarrell/autoheal:latest makes deployments non-deterministic and risks unexpected changes when the image updates. Pin this to a specific version tag (or digest) to keep production behavior reproducible.

Suggested change

image: willfarrell/autoheal:latest

image: willfarrell/autoheal:1.2.0

Copilot · 2026-03-03T08:54:50Z

+# This script checks if the Celery worker process is running and responsive.
+# It uses two checks:
+# 1. Process check - is celery worker process running?
+# 2. Redis connectivity - can we connect to the broker?
+#
+# When used with the autoheal container, unhealthy workers will be
+# automatically restarted.
+
+set -e
+
+# Check 1: Is the celery worker process running?
+if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then
+    echo "ERROR: Celery worker process not found" >&2
+    exit 1
+fi
+
+# Check 2: Can we connect to Redis (the broker)?
+# Use redis-cli if available, otherwise skip
+if command -v redis-cli > /dev/null 2>&1; then
+    if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then
+        echo "ERROR: Cannot connect to Redis broker" >&2
+        exit 1
+    fi
+fi
+


The Redis connectivity check is currently miswired: CELERY_BROKER_URL is an AMQP RabbitMQ URL in this repo, and the image doesn’t install redis-cli/redis-tools, so this check is ineffective/misconfigured. Either remove it, or replace it with a broker-appropriate connectivity check (or install the right client and pass a host/port rather than a URL).

Suggested change

# This script checks if the Celery worker process is running and responsive.

# It uses two checks:

# 1. Process check - is celery worker process running?

# 2. Redis connectivity - can we connect to the broker?

#

# When used with the autoheal container, unhealthy workers will be

# automatically restarted.

set -e

# Check 1: Is the celery worker process running?

if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then

echo "ERROR: Celery worker process not found" >&2

exit 1

fi

# Check 2: Can we connect to Redis (the broker)?

# Use redis-cli if available, otherwise skip

if command -v redis-cli > /dev/null 2>&1; then

if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then

echo "ERROR: Cannot connect to Redis broker" >&2

exit 1

fi

fi

# This script checks if the Celery worker process is running.

# When used with the autoheal container, unhealthy workers will be

# automatically restarted.

set -e

# Check: Is the celery worker process running?

if ! pgrep -f "celery.*worker" > /dev/null 2>&1; then

echo "ERROR: Celery worker process not found" >&2

exit 1

fi

Copilot · 2026-03-03T08:54:50Z

+# Check 2: Can we connect to Redis (the broker)?
+# Use redis-cli if available, otherwise skip
+if command -v redis-cli > /dev/null 2>&1; then
+    if ! redis-cli -h ${CELERY_BROKER_URL:-redis} ping > /dev/null 2>&1; then
+        echo "ERROR: Cannot connect to Redis broker" >&2


The Redis connectivity check is wired to CELERY_BROKER_URL, but in this repo CELERY_BROKER_URL is an AMQP RabbitMQ URL (e.g. amqp://...), not a Redis host, and the image doesn’t install redis-cli (so this branch will never run). Either remove this check, or implement a broker check appropriate for RabbitMQ / parse host+port correctly and install the needed client tool.

mihow

Review: Direction and Approach

The worker protections (--max-tasks-per-child, --max-memory-per-child), autoheal pattern, and local dev improvements (CELERY_DEBUG, CELERY_NO_RELOAD) are solid additions. The layered defense approach is well-thought-out.

Key gap: healthcheck strength

The healthcheck currently checks process existence (pgrep) and stopped state (ps stat). This catches crashed workers and SIGSTOP'd processes, but not the harder failure mode this PR describes: stuck workers (deadlocked, stale broker connection, frozen on I/O).

A worker with a stale Redis/RabbitMQ connection will still have a running process and pass the current healthcheck. From docker ps, everything looks healthy while jobs pile up unprocessed.

Suggested upgrade: Add celery inspect ping as a second check. It sends a control message through the broker and waits for the worker to respond — the standard way to verify a Celery worker is actually responsive, not just alive. See inline comment on healthcheck.sh for a concrete implementation. It needs no extra dependencies (Celery and Django settings are already in the container).

Other items

Duplicate scripts: Local and production healthcheck scripts are identical (except one comment word). Consider sharing a single script.
Dockerfile COPY scope: The entire celery/ directory is copied into the image, including start scripts already copied individually above. Should copy only healthcheck.sh.
Comment accuracy: A few comments reference "ping" behavior that isn't implemented yet. See inline suggestions.

Overall the architecture is right — the fixes are incremental.

mihow · 2026-03-03T09:20:53Z

+set -e
+
+# Check: Is the celery worker process running and not stopped?
+CELERY_PIDS=$(pgrep -f "celery.*worker" || true)


The pattern celery.*worker can match the watchfiles wrapper in local dev — its command line includes both "celery" and "worker" (e.g., watchfiles ... celery.__main__.main --args '... worker ...'). If the Celery child process crashes but watchfiles is still alive trying to restart it, pgrep matches the wrapper and reports healthy.

A more specific pattern avoids this:

Suggested change

CELERY_PIDS=$(pgrep -f "celery.*worker" || true)

CELERY_PIDS=$(pgrep -f "[c]elery.*-A config.celery_app worker" || true)

The [c]elery trick also prevents pgrep from matching itself in edge cases.

mihow · 2026-03-03T09:20:54Z

+    fi
+done
+
+# All checks passed


Consider adding a celery inspect ping check here. This is the main value-add — catching stuck/deadlocked workers that pass the process check. The PR description identifies this as the core problem ("workers can become stuck or unresponsive... but Docker doesn't detect this"), and pgrep alone can't solve it.

# Check 2: Is the worker actually responsive via the broker? WORKER_NAME="celery@$(hostname)" if ! timeout 12 celery -A config.celery_app inspect ping \ --destination "$WORKER_NAME" \ --timeout 8 > /dev/null 2>&1; then echo "ERROR: Celery worker $WORKER_NAME not responding to ping" >&2 exit 1 fi

This uses the Django/Celery config already in the container — no extra deps needed. The --timeout 8 (Celery-level) + timeout 12 (process-level) stays within the 15s healthcheck timeout.

celery inspect ping is heavier than pgrep (roundtrip through broker), but the 30s interval is more than enough headroom.

mihow · 2026-03-03T09:20:54Z

@@ -0,0 +1,27 @@
+#!/bin/bash
+#
+# Celery Worker Healthcheck Script (Production)


This file is identical to compose/local/django/celery/healthcheck.sh except for (Production) on this line. Consider sharing a single script to avoid divergence:

Place it in compose/shared/celery/healthcheck.sh and COPY from there in both Dockerfiles

Or have the production Dockerfile COPY from the local path

The healthcheck logic is environment-agnostic, so a single source of truth makes maintenance simpler.

mihow · 2026-03-03T09:20:54Z

 # Example: 8 CPUs × 2 GiB = 16 GiB max total worker memory
+#
+# These options work in conjunction with the Docker healthcheck:
+# - Healthcheck detects STUCK workers (not responding to ping)


The healthcheck currently uses pgrep — it detects dead workers (process not running) and stopped workers, not stuck ones responding to ping. If celery inspect ping is added later, this comment becomes accurate. For now:

Suggested change

# - Healthcheck detects STUCK workers (not responding to ping)

# - Healthcheck detects DEAD workers (process no longer running)

mihow · 2026-03-03T09:20:54Z


+# Copy celery scripts directory for healthcheck
+COPY ./compose/local/django/celery /celery


This copies the entire celery/ directory into /celery/, including worker/start, beat/start, and flower/start — which are already individually copied to /start-* on lines 67-77 above. Only the healthcheck script is needed here.

Suggested change

# Copy celery scripts directory for healthcheck

COPY ./compose/local/django/celery /celery

# Copy healthcheck script for Docker health monitoring

COPY ./compose/local/django/celery/healthcheck.sh /celery/healthcheck.sh

RUN chmod +x /celery/healthcheck.sh

mihow · 2026-03-03T09:20:54Z

+# Copy celery scripts directory for healthcheck
+COPY --chown=django:django ./compose/production/django/celery /celery
+RUN chmod +x /celery/healthcheck.sh


Same as local — copy only the healthcheck script, not the entire directory.

Suggested change

# Copy celery scripts directory for healthcheck

COPY --chown=django:django ./compose/production/django/celery /celery

RUN chmod +x /celery/healthcheck.sh

# Copy healthcheck script for Docker health monitoring

COPY --chown=django:django ./compose/production/django/celery/healthcheck.sh /celery/healthcheck.sh

RUN chmod +x /celery/healthcheck.sh

mihow · 2026-03-03T09:20:54Z

+    environment:
+      - AUTOHEAL_CONTAINER_LABEL=autoheal
+      - AUTOHEAL_INTERVAL=10      # Check container health every 10 seconds
+      - AUTOHEAL_START_PERIOD=60  # Don't restart containers for 60s after they start


This is the delay before the autoheal container itself starts monitoring — not a per-container grace period. Docker's start_period in the healthcheck config handles per-container startup grace.

Suggested change

- AUTOHEAL_START_PERIOD=60 # Don't restart containers for 60s after they start

- AUTOHEAL_START_PERIOD=60 # Wait 60s after autoheal starts before monitoring

mihow · 2026-03-03T09:20:54Z

+    environment:
+      - AUTOHEAL_CONTAINER_LABEL=autoheal
+      - AUTOHEAL_INTERVAL=10      # Check container health every 10 seconds
+      - AUTOHEAL_START_PERIOD=60  # Don't restart containers for 60s after they start


Same clarification as the worker compose — this controls the autoheal container's own startup delay.

Suggested change

- AUTOHEAL_START_PERIOD=60 # Don't restart containers for 60s after they start

- AUTOHEAL_START_PERIOD=60 # Wait 60s after autoheal starts before monitoring

Adds end-to-end recovery for the failure mode observed on 2026-04-16: the celerybeat container on ami-live showed "Up 10 hours" in `docker ps` with four live PIDs and RestartCount=0, but the scheduler thread was hung after a Redis connection blip. `jobs_health_check` stopped firing and stuck job 2421 was never reaped. Changes: - celeryworker healthcheck via `celery inspect ping` (detects stuck workers that keep their PID but stop servicing the broker). - celerybeat healthcheck via a heartbeat file. Since `inspect ping` is a worker control message and DatabaseScheduler has no on-disk schedule file, we add HeartbeatDatabaseScheduler (ami/celery_schedulers.py) that overrides `tick()` to touch /tmp/beat-heartbeat on every loop. If the scheduler hangs, the file goes stale and Docker flips to unhealthy. - willfarrell/autoheal sidecar on production/staging/worker compose. Docker Compose still has no native restart-on-unhealthy (swarm-only). - docker-compose.staging.yml now gets the same healthchecks + autoheal (this is what the demo host uses; previously had zero celery healthchecks). - Worker start-script env var renamed DEBUGGER -> CELERY_DEBUG and adds CELERY_NO_RELOAD to disable watchfiles in local dev. - Production compose no longer defines `celeryworker` — workers run on dedicated hosts via docker-compose.worker.yml. This was already in the original PR direction. Supersedes the `pgrep`-based beat healthcheck in an earlier draft of this PR, which would not have caught the 2026-04-16 hang (the process was alive; the scheduler thread was not). Co-Authored-By: Claude <noreply@anthropic.com>

mihow · 2026-04-17T07:22:00Z

Claude says: (on behalf of @mihow, context from the celery beat hang debugged today in #1246)

Picking this back up. The shape is good — this should rebase on main and land rather than start over. One load-bearing change needed, plus a few smaller notes.

1. `docker-compose.production.yml` — celerybeat healthcheck needs a different signal

The pgrep -f 'celery.*beat' check misses exactly the failure mode we debugged today (see #1246): beat deadlocks inside redis-py's ConnectionPool.release() but the process, sockets, and file descriptors all stay up. pgrep passes, autoheal doesn't fire, beat stays wedged for hours until manual restart.

Suggest replacing with a schedule-liveness check:

healthcheck:
  test:
    - CMD-SHELL
    - |
      python -c "
      import os, django, sys
      os.environ.setdefault('DJANGO_SETTINGS_MODULE','config.settings.production')
      django.setup()
      from django_celery_beat.models import PeriodicTask, IntervalSchedule
      from django.utils import timezone
      from datetime import timedelta
      min_interval = IntervalSchedule.objects.order_by('every').first()
      threshold = timedelta(seconds=min_interval.every * 2.5) if min_interval else timedelta(minutes=15)
      latest = PeriodicTask.objects.filter(enabled=True).exclude(last_run_at=None).order_by('-last_run_at').first()
      if latest is None or (timezone.now() - latest.last_run_at) > threshold: sys.exit(1)
      "
  interval: 60s
  timeout: 15s
  retries: 3
  start_period: 120s

This measures the actual "is beat scheduling?" signal. It's the only thing that catches the silent-hang class of bug. Starts firing 2 min after container start to avoid false positives during boot.

2. `compose/production/django/celery/healthcheck.sh` — overhead

celery -A config.celery_app inspect ping spawns a full Django process each invocation (~1–2s, ~100MB RSS). At interval: 30s that's ~3% baseline CPU just for health. Either raise interval to 60s (worker recovery SLA with retries=3 still comes in under 3min) or drop the -A path overhead if there's a lighter option. Not a blocker, just a knob to consider.

3. `docker-compose.worker.yml` — autoheal env var duplication

AUTOHEAL_START_PERIOD=60 on the autoheal sidecar overlaps with the service-level start_period: 60s on the worker healthcheck. One of them is redundant. Also worth setting AUTOHEAL_DEFAULT_STOP_TIMEOUT=45 (default is 10s) so in-flight ML tasks get a real chance to finish before SIGKILL when autoheal restarts.

4. `docker-compose.production.yml` — celeryworker removal is a separate behavior change

Removing the celeryworker service from production.yml (in favor of worker.yml only) is a behavior change that deserves its own PR or at minimum a callout in the body. Anyone bringing up production.yml alone on an old deploy flow will silently lose worker capacity. Is worker.yml guaranteed up on every env that uses production.yml now?

5. Flower healthcheck — add explicit urlopen timeout

The urllib.request.urlopen(...) call doesn't pass a timeout arg. The 5s timeout: field in docker-compose bounds the total command but it's worth passing timeout=3 to urlopen for clarity and to avoid relying on docker SIGKILL to free a hung connection.

6. `compose/local/django/celery/worker/start` — 1 GiB max-memory-per-child is tight for dev

ML tasks routinely exceed 1 GiB in dev when batch sizes aren't throttled. Suggest matching production (2 GiB) or going higher (4 GiB) to avoid surprise restarts mid-debug session.

Merge plan

Rebase on main (6-week drift, low risk).
Swap the celerybeat healthcheck to the schedule-liveness check (item 1 above).
Address items 3, 5, 6 inline. 2 is a knob, 4 is a callout.
Land fix(deps): pin redis-py 5.3.1 to avoid ConnectionPool deadlock #1246 first (the underlying redis-py bug) so autoheal isn't papering over a known-fixable issue.
Ping for a human review once rebased.

If the schedule-liveness check gets pushback and stalls the PR, land the worker autoheal + flower healthcheck in this PR and split celerybeat into a follow-up rather than letting all three sit another 6 weeks.

mihow changed the title ~~feat: add Docker healthcheck and auto-restart for Celery workers~~ Docker healthcheck and auto-restart for Celery workers Oct 30, 2025

coderabbitai Bot reviewed Oct 31, 2025

View reviewed changes

Copilot AI mentioned this pull request Nov 14, 2025

Refactor docker healthchecks from Redis to RabbitMQ and add worker protections #1050

Draft

5 tasks

Copilot AI review requested due to automatic review settings March 3, 2026 08:47

Copilot started reviewing on behalf of mihow March 3, 2026 08:47 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

mihow commented Mar 3, 2026

View reviewed changes

mihow force-pushed the feat/celery-worker-healthcheck branch from 0d0cebb to c940b3e Compare April 17, 2026 07:22

		MAX_TASKS_PER_CHILD=100
		MAX_MEMORY_PER_CHILD=2097152 # 2 GiB in KB

	test: ["CMD-SHELL", "curl -f http://localhost:5555/ \|\| exit 1"]
	test: ["CMD-SHELL", "python -c 'import urllib.request, sys; urllib.request.urlopen(\"http://localhost:5555/\")' \|\| exit 1"]

	timeout: 15s # Healthcheck must complete within 15s (ping timeout is 10s + overhead)
	timeout: 15s # Healthcheck script must complete within 15s

	image: willfarrell/autoheal:latest
	image: willfarrell/autoheal:1.2.0

	CELERY_PIDS=$(pgrep -f "celery.*worker" \|\| true)
	CELERY_PIDS=$(pgrep -f "[c]elery.*-A config.celery_app worker" \|\| true)

	# - Healthcheck detects STUCK workers (not responding to ping)
	# - Healthcheck detects DEAD workers (process no longer running)


		# Copy celery scripts directory for healthcheck
		COPY ./compose/local/django/celery /celery

	- AUTOHEAL_START_PERIOD=60 # Don't restart containers for 60s after they start
	- AUTOHEAL_START_PERIOD=60 # Wait 60s after autoheal starts before monitoring

Conversation

mihow commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this changes

Files

Verification

Notes

Uh oh!

netlify Bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai Bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

mihow commented Mar 3, 2026

Code review

Uh oh!

netlify Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

mihow left a comment

Choose a reason for hiding this comment

Review: Direction and Approach

Key gap: healthcheck strength

Other items

Uh oh!

mihow commented Oct 30, 2025 •

edited

Loading

netlify Bot commented Oct 30, 2025 •

edited

Loading

coderabbitai Bot commented Oct 30, 2025 •

edited

Loading

netlify Bot commented Mar 3, 2026 •

edited

Loading

1. `docker-compose.production.yml` — celerybeat healthcheck needs a different signal

2. `compose/production/django/celery/healthcheck.sh` — overhead

3. `docker-compose.worker.yml` — autoheal env var duplication

4. `docker-compose.production.yml` — celeryworker removal is a separate behavior change

6. `compose/local/django/celery/worker/start` — 1 GiB max-memory-per-child is tight for dev