-
Notifications
You must be signed in to change notification settings - Fork 13
Docker healthcheck and auto-restart for Celery workers #1024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| """Celery beat scheduler with a Docker-healthcheck-friendly heartbeat file. | ||
|
|
||
| Why this module exists | ||
| ---------------------- | ||
| Docker's default ``restart: unless-stopped`` only catches process death, not a | ||
| frozen scheduler thread. On 2026-04-16 the celerybeat container on ami-live | ||
| showed "Up 10 hours" in ``docker ps`` with four live PIDs and | ||
| ``RestartCount=0``, yet its last log line was "Sending due task | ||
| celery.check_processing_services_online" twelve hours earlier — a Redis | ||
| connection blip had deadlocked the connection-pool lock and the scheduler | ||
| thread never recovered. The 15-minute ``jobs_health_check`` beat task stopped | ||
| firing, and stuck job 2421 was never reaped. | ||
|
|
||
| To let Docker flip the container to ``unhealthy`` on that failure mode, we | ||
| need a heartbeat signal that proves the scheduler's main loop is progressing. | ||
| Constraints: | ||
|
|
||
| - Beat does not answer ``celery inspect ping`` (that's a worker control | ||
| message), so we can't reuse the worker healthcheck. | ||
| - We use ``DatabaseScheduler`` from ``django_celery_beat``, which keeps the | ||
| schedule in Postgres, so there is no on-disk schedule file whose mtime | ||
| would update naturally. | ||
| - A plain Celery task written from a worker would touch a file in the | ||
| **worker's** filesystem, not beat's — Docker healthchecks read files | ||
| inside the checked container. | ||
|
|
||
| So: override ``DatabaseScheduler.tick()`` to touch ``/tmp/beat-heartbeat`` | ||
| on every iteration. ``tick()`` runs inside the beat process itself, so the | ||
| file lives in the beat container. If the scheduler loop hangs anywhere | ||
| (Redis pool lock, DB query, sync deadlock), ``tick()`` stops returning and | ||
| the file goes stale within ~60 s. The healthcheck | ||
| (``compose/*/django/celery/healthcheck-beat.sh``) fails, Docker marks the | ||
| container ``unhealthy``, and autoheal restarts it. | ||
|
|
||
| Activation | ||
| ---------- | ||
| Wired in via ``CELERY_BEAT_SCHEDULER`` in ``config/settings/base.py``. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import logging | ||
| from pathlib import Path | ||
|
|
||
| from django_celery_beat.schedulers import DatabaseScheduler | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| HEARTBEAT_PATH = Path("/tmp/beat-heartbeat") | ||
|
|
||
|
|
||
| class HeartbeatDatabaseScheduler(DatabaseScheduler): | ||
| """DatabaseScheduler that touches a heartbeat file on every tick. | ||
|
|
||
| Each call to ``tick()`` represents one cycle of the scheduler's main loop: | ||
| evaluate due tasks, enqueue them, return the seconds until the next tick. | ||
| If any step in that cycle hangs (e.g. a Redis or DB call blocks forever), | ||
| ``tick()`` stops returning, the file mtime stops advancing, and the Docker | ||
| healthcheck flips the container to ``unhealthy`` within ~2 minutes. | ||
|
|
||
| We touch the file *before* delegating to ``super().tick()`` so a successful | ||
| iteration of the loop itself is what proves liveness; if the heartbeat | ||
| write ever fails (disk full, permission error), we log at warning level | ||
| but don't re-raise — an I/O problem writing ``/tmp`` shouldn't take down | ||
| the scheduler. Docker will eventually mark the container unhealthy on the | ||
| stale file, which is the right outcome. | ||
| """ | ||
|
|
||
| def tick(self, *args, **kwargs): | ||
| try: | ||
| HEARTBEAT_PATH.touch() | ||
| except OSError as exc: | ||
| logger.warning("beat heartbeat touch failed: %s", exc) | ||
| return super().tick(*args, **kwargs) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| #!/bin/bash | ||
| # Celerybeat healthcheck: verify the scheduler is alive by checking heartbeat file age. | ||
| # | ||
| # Beat doesn't respond to `celery inspect ping` (that's a worker control message), | ||
| # and with DatabaseScheduler (django_celery_beat) there's no schedule file whose | ||
| # mtime we can watch. So we rely on a dedicated `ami.tasks.beat_heartbeat` task | ||
| # that runs every 60s via CELERY_BEAT_SCHEDULE and touches /tmp/beat-heartbeat. | ||
| # | ||
| # If beat hangs (e.g. scheduler thread deadlocked on a Redis connection blip — | ||
| # the 2026-04-16 incident), the task stops firing and the file goes stale. | ||
| # Docker flips the container to `unhealthy`, autoheal restarts it. | ||
| # | ||
| # Window: task runs every 60s, we tolerate up to 2 min of staleness before | ||
| # marking unhealthy (one missed tick is fine; two in a row is a hang). | ||
| set -e | ||
| find /tmp/beat-heartbeat -mmin -2 2>/dev/null | grep -q . || exit 1 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| #!/bin/bash | ||
| # Celery worker healthcheck: verify the worker is responsive via the broker. | ||
| # Catches stuck, deadlocked, and crashed workers — not just process existence. | ||
| set -e | ||
| exec celery -A config.celery_app inspect ping \ | ||
| --destination "celery@$(hostname)" \ | ||
| --timeout 10 > /dev/null 2>&1 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| #!/bin/bash | ||
| # Celerybeat healthcheck: verify the scheduler is alive by checking heartbeat file age. | ||
| # | ||
| # Beat doesn't respond to `celery inspect ping` (that's a worker control message), | ||
| # and with DatabaseScheduler (django_celery_beat) there's no schedule file whose | ||
| # mtime we can watch. So we rely on a dedicated `ami.tasks.beat_heartbeat` task | ||
| # that runs every 60s via CELERY_BEAT_SCHEDULE and touches /tmp/beat-heartbeat. | ||
| # | ||
| # If beat hangs (e.g. scheduler thread deadlocked on a Redis connection blip — | ||
| # the 2026-04-16 incident), the task stops firing and the file goes stale. | ||
| # Docker flips the container to `unhealthy`, autoheal restarts it. | ||
| # | ||
| # Window: task runs every 60s, we tolerate up to 2 min of staleness before | ||
| # marking unhealthy (one missed tick is fine; two in a row is a hang). | ||
| set -e | ||
| find /tmp/beat-heartbeat -mmin -2 2>/dev/null | grep -q . || exit 1 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| #!/bin/bash | ||
| # Celery worker healthcheck: verify the worker is responsive via the broker. | ||
| # Catches stuck, deadlocked, and crashed workers — not just process existence. | ||
| set -e | ||
| exec celery -A config.celery_app inspect ping \ | ||
| --destination "celery@$(hostname)" \ | ||
| --timeout 10 > /dev/null 2>&1 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,7 +22,7 @@ services: | |
|
|
||
| celeryworker: | ||
| environment: | ||
| - DEBUGGER=1 | ||
| - CELERY_DEBUG=1 | ||
| ports: | ||
| - "5679:5679" | ||
| volumes: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,18 +25,21 @@ services: | |
| scale: 1 # Can't scale until the load balancer is within the compose config | ||
| restart: always | ||
|
|
||
| celeryworker: | ||
| <<: *django | ||
| scale: 1 | ||
| ports: [] | ||
| command: /start-celeryworker | ||
| restart: always | ||
| # Workers run on dedicated machines via docker-compose.worker.yml (not here). | ||
|
|
||
| celerybeat: | ||
| <<: *django | ||
| ports: [] | ||
| command: /start-celerybeat | ||
| restart: always | ||
| healthcheck: | ||
| test: ["CMD-SHELL", "/celery/healthcheck-beat.sh"] | ||
| interval: 30s | ||
| timeout: 5s | ||
| retries: 3 | ||
| start_period: 90s | ||
| labels: | ||
| - "autoheal=true" | ||
|
|
||
| flower: | ||
| <<: *django | ||
|
|
@@ -47,6 +50,20 @@ services: | |
| volumes: | ||
| - ./data/flower/:/data/ | ||
|
|
||
| autoheal: | ||
| # Docker Compose has no native restart-on-unhealthy (swarm-only feature). | ||
| # willfarrell/autoheal watches the docker socket and restarts any container | ||
| # labeled `autoheal=true` that Docker has marked unhealthy. | ||
| image: willfarrell/autoheal:1.2.0 | ||
| container_name: ami_production_autoheal | ||
| restart: always | ||
| environment: | ||
| - AUTOHEAL_CONTAINER_LABEL=autoheal | ||
| - AUTOHEAL_INTERVAL=10 # poll docker for health every 10s | ||
| - AUTOHEAL_START_PERIOD=60 # ignore containers in their start_period | ||
| volumes: | ||
| - /var/run/docker.sock:/var/run/docker.sock | ||
|
Comment on lines
+64
to
+65
|
||
|
|
||
| awscli: | ||
| build: | ||
| context: . | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,3 +27,22 @@ services: | |
| ports: [] | ||
| command: /start-celeryworker | ||
| restart: always | ||
| healthcheck: | ||
| test: ["CMD-SHELL", "/celery/healthcheck.sh"] | ||
| interval: 30s | ||
| timeout: 15s | ||
| retries: 3 | ||
| start_period: 90s | ||
| labels: | ||
| - "autoheal=true" | ||
|
|
||
| autoheal: | ||
| image: willfarrell/autoheal:1.2.0 | ||
| container_name: ami_worker_autoheal | ||
| restart: always | ||
| environment: | ||
| - AUTOHEAL_CONTAINER_LABEL=autoheal | ||
| - AUTOHEAL_INTERVAL=10 | ||
| - AUTOHEAL_START_PERIOD=60 | ||
| volumes: | ||
| - /var/run/docker.sock:/var/run/docker.sock | ||
|
Comment on lines
+47
to
+48
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worker protections described in the PR summary (
--max-tasks-per-child=50/--max-memory-per-child=4000000) don’t match the values configured here (100 / 2097152). Please reconcile the PR description and the actual configured limits so operators don’t deploy with unintended settings.