Skip to content

Support ML async job cancellation, fail jobs on redis errors#13

Closed
carlosgjs wants to merge 25 commits intomainfrom
carlos/redisfail
Closed

Support ML async job cancellation, fail jobs on redis errors#13
carlosgjs wants to merge 25 commits intomainfrom
carlos/redisfail

Conversation

@carlosgjs
Copy link
Copy Markdown
Member

Summary

This pull request builds on RolnickLab#1150 and it's based off carlos/redisatomic

This pull request introduces a new chaos testing management command for fault injection and refactors async job cleanup logic to improve reliability and resilience. The most important changes include the addition of a manual chaos testing utility, improved job log handling to prevent lost logs, and a more robust cleanup of async resources for jobs using Redis and NATS. The cleanup logic is now more consistent and reliable, especially in failure and cancellation scenarios.

  • Added chaos_monkey.py management command for manual fault injection of Redis and NATS, allowing developers to flush or pause these services to simulate outages and test job resilience.
  • Refactored cleanup_async_job_resources to accept job ID and logger instead of a Job instance, ensuring cleanup can occur even if the Job object is unavailable and improving logging consistency.
  • Introduced _fail_job helper to mark jobs as failed and trigger async resource cleanup when Redis state is missing, improving failure handling in NATS pipeline results.
  • Updated job cancellation logic to always trigger async cleanup and correctly set status to REVOKED for async jobs.
  • Improved job log handler to refresh logs from the database before writing, reducing lost logs due to concurrent updates.
  • Ensured logger handler always references the current job instance, preventing stale log writes in worker processes.
  • Added _stream_exists check in NATS queue orchestration to avoid unnecessary stream creation and improve error handling when reserving tasks.

How to Test the Changes

Start a job with e.g.:

docker compose run --rm django python manage.py test_ml_job_e2e --collection "ami-1000" --pipeline quebec_vermont_moths_2023 --dispatch-mode async_api --project 1

Then either cancel it in the UI or flush/stop Redis

docker compose run --rm django python manage.py chaos_monkey flush redis # must run in container
docker compose down redis

Screenshots

image image

Known Issues

Occasionally the Error logs get overwritten by another worker and hence the error won't be displayed, which is a known issue with the job logger.

Deployment Notes

Include instructions if this PR requires specific steps for its deployment (database migrations, config changes, etc.)

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation or comments.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

carlosgjs and others added 18 commits February 24, 2026 15:37
JobState(str, OrderedEnum) was using str's lexicographic __gt__
instead of OrderedEnum's definition-order __gt__, because str
comes first in the MRO. This caused max(FAILURE, SUCCESS) to
return SUCCESS, silently discarding failure state in concurrent
job progress updates.

Fix: __init_subclass__ injects comparison methods directly onto
each subclass so they take MRO priority over data-type mixins.

Also preserve FAILURE status through the progress ternary when
progress < 1.0, so early failure detection isn't overwritten.

Co-Authored-By: Claude <noreply@anthropic.com>
The NATS message is ACK'd at line 145, before update_state() and
_update_job_progress(). If either of those raises, the except
block was logging "NATS will redeliver" when it won't.

Co-Authored-By: Claude <noreply@anthropic.com>
@carlosgjs carlosgjs requested a review from mihow February 26, 2026 19:35
@carlosgjs carlosgjs marked this pull request as ready for review February 26, 2026 19:37
@carlosgjs carlosgjs changed the title Suppport ML async job cancellation, fail jobs on redis errors Support ML async job cancellation, fail jobs on redis errors Feb 26, 2026
mihow and others added 5 commits February 26, 2026 17:56
…livery

For async_api jobs, the Celery task completes after queuing images to NATS,
so task.revoke() has no effect. The worker kept pulling tasks via the /tasks
endpoint because it only checked final_states(), not CANCELING.

- Add JobState.active_states() (STARTED, RETRY) for positive task-serving check
- /tasks endpoint returns empty unless job is in active_states()
- Job.cancel() for async_api jobs: clean up NATS/Redis, then set REVOKED

Co-Authored-By: Claude <noreply@anthropic.com>
canRetry now excludes CANCELING so the Retry button stays hidden
during the drain period, matching the backend's transitional state.

Co-Authored-By: Claude <noreply@anthropic.com>
When a job is canceled, NATS/Redis cleanup runs before in-flight results
finish processing. The resulting "Redis state missing" message is expected,
not an error.

Co-Authored-By: Claude <noreply@anthropic.com>
Covers all monitoring points for NATS async jobs: Django ORM, REST API,
tasks endpoint, NATS consumer state, Redis counters, Docker logs, and
AMI worker logs. Linked from CLAUDE.md and the test_ml_job_e2e command.

Co-Authored-By: Claude <noreply@anthropic.com>
Tests need to set job status to STARTED since the /tasks endpoint
now only serves tasks for jobs in active_states() (STARTED, RETRY).

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@mihow mihow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works! I am canceling and retrying on a 1000 image job. Very cool!

I made one fix to keep /tasks from returning tasks while canceling and added a new active_states group that includes all job states that count as job still in motion.

@mihow mihow changed the base branch from carlosg/redisatomic to main February 27, 2026 07:00
@carlosgjs carlosgjs closed this Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants