Support ML async job cancellation, fail jobs on redis errors by carlosgjs · Pull Request #13 · uw-ssec/antenna

carlosgjs · 2026-02-26T19:35:12Z

Summary

This pull request builds on RolnickLab#1150 and it's based off carlos/redisatomic

This pull request introduces a new chaos testing management command for fault injection and refactors async job cleanup logic to improve reliability and resilience. The most important changes include the addition of a manual chaos testing utility, improved job log handling to prevent lost logs, and a more robust cleanup of async resources for jobs using Redis and NATS. The cleanup logic is now more consistent and reliable, especially in failure and cancellation scenarios.

Added chaos_monkey.py management command for manual fault injection of Redis and NATS, allowing developers to flush or pause these services to simulate outages and test job resilience.
Refactored cleanup_async_job_resources to accept job ID and logger instead of a Job instance, ensuring cleanup can occur even if the Job object is unavailable and improving logging consistency.
Introduced _fail_job helper to mark jobs as failed and trigger async resource cleanup when Redis state is missing, improving failure handling in NATS pipeline results.
Updated job cancellation logic to always trigger async cleanup and correctly set status to REVOKED for async jobs.
Improved job log handler to refresh logs from the database before writing, reducing lost logs due to concurrent updates.
Ensured logger handler always references the current job instance, preventing stale log writes in worker processes.
Added _stream_exists check in NATS queue orchestration to avoid unnecessary stream creation and improve error handling when reserving tasks.

How to Test the Changes

Start a job with e.g.:

docker compose run --rm django python manage.py test_ml_job_e2e --collection "ami-1000" --pipeline quebec_vermont_moths_2023 --dispatch-mode async_api --project 1

Then either cancel it in the UI or flush/stop Redis

docker compose run --rm django python manage.py chaos_monkey flush redis # must run in container
docker compose down redis

Screenshots

Known Issues

Occasionally the Error logs get overwritten by another worker and hence the error won't be displayed, which is a known issue with the job logger.

Deployment Notes

Include instructions if this PR requires specific steps for its deployment (database migrations, config changes, etc.)

Checklist

I have tested these changes appropriately.
I have added and/or modified relevant tests.
I updated relevant documentation or comments.
I have verified that this PR follows the project's coding standards.
Any dependent changes have already been merged to main.

JobState(str, OrderedEnum) was using str's lexicographic __gt__ instead of OrderedEnum's definition-order __gt__, because str comes first in the MRO. This caused max(FAILURE, SUCCESS) to return SUCCESS, silently discarding failure state in concurrent job progress updates. Fix: __init_subclass__ injects comparison methods directly onto each subclass so they take MRO priority over data-type mixins. Also preserve FAILURE status through the progress ternary when progress < 1.0, so early failure detection isn't overwritten. Co-Authored-By: Claude <noreply@anthropic.com>

The NATS message is ACK'd at line 145, before update_state() and _update_job_progress(). If either of those raises, the except block was logging "NATS will redeliver" when it won't. Co-Authored-By: Claude <noreply@anthropic.com>

… carlosg/redisatomic

…livery For async_api jobs, the Celery task completes after queuing images to NATS, so task.revoke() has no effect. The worker kept pulling tasks via the /tasks endpoint because it only checked final_states(), not CANCELING. - Add JobState.active_states() (STARTED, RETRY) for positive task-serving check - /tasks endpoint returns empty unless job is in active_states() - Job.cancel() for async_api jobs: clean up NATS/Redis, then set REVOKED Co-Authored-By: Claude <noreply@anthropic.com>

canRetry now excludes CANCELING so the Retry button stays hidden during the drain period, matching the backend's transitional state. Co-Authored-By: Claude <noreply@anthropic.com>

When a job is canceled, NATS/Redis cleanup runs before in-flight results finish processing. The resulting "Redis state missing" message is expected, not an error. Co-Authored-By: Claude <noreply@anthropic.com>

Covers all monitoring points for NATS async jobs: Django ORM, REST API, tasks endpoint, NATS consumer state, Redis counters, Docker logs, and AMI worker logs. Linked from CLAUDE.md and the test_ml_job_e2e command. Co-Authored-By: Claude <noreply@anthropic.com>

Tests need to set job status to STARTED since the /tasks endpoint now only serves tasks for jobs in active_states() (STARTED, RETRY). Co-Authored-By: Claude <noreply@anthropic.com>

mihow

Works! I am canceling and retrying on a 1000 image job. Very cool!

I made one fix to keep /tasks from returning tasks while canceling and added a new active_states group that includes all job states that count as job still in motion.

carlosgjs and others added 18 commits February 24, 2026 15:37

Avoid redis based locking by using atomic updates

8df89be

Merge branch 'main' into carlosg/redisatomic

1096fd9

Test concurrency

30c8db3

Increase max ack pending

deea095

update comment

20c0fbd

CR feedback

e84421e

Cancel jobs if Redis state is missing

cbb2d7f

Add chaos monkey

3861190

CR feedback

d591bd6

CR 2

4720bb6

fix: correct misleading error log about NATS redelivery

e3134a1

The NATS message is ACK'd at line 145, before update_state() and _update_job_progress(). If either of those raises, the except block was logging "NATS will redeliver" when it won't. Co-Authored-By: Claude <noreply@anthropic.com>

Merge branch 'carlosg/redisatomic' of github.com:uw-ssec/antenna into…

41b1232

… carlosg/redisatomic

Use job.logger

94e1bbb

Use job.logger

dcf57fe

Integrate cancellation support

4a25e54

Merge branch 'carlosg/redisatomic' into carlos/redisfail

654593b

merge, update tests

5d38d67

carlosgjs requested a review from mihow February 26, 2026 19:35

Remove pause support in monkey

ac90c2f

carlosgjs marked this pull request as ready for review February 26, 2026 19:37

carlosgjs changed the title ~~Suppport ML async job cancellation, fail jobs on redis errors~~ Support ML async job cancellation, fail jobs on redis errors Feb 26, 2026

mihow and others added 5 commits February 26, 2026 17:56

fix(ui): hide Retry button while job is in CANCELING state

8671214

canRetry now excludes CANCELING so the Retry button stays hidden during the drain period, matching the backend's transitional state. Co-Authored-By: Claude <noreply@anthropic.com>

fix: downgrade Redis-missing log to warning for canceled jobs

b1146cc

When a job is canceled, NATS/Redis cleanup runs before in-flight results finish processing. The resulting "Redis state missing" message is expected, not an error. Co-Authored-By: Claude <noreply@anthropic.com>

fix: update tests for active_states() guard on /tasks endpoint

d63be48

Tests need to set job status to STARTED since the /tasks endpoint now only serves tasks for jobs in active_states() (STARTED, RETRY). Co-Authored-By: Claude <noreply@anthropic.com>

mihow approved these changes Feb 27, 2026

View reviewed changes

mihow changed the base branch from carlosg/redisatomic to main February 27, 2026 07:00

Merge branch 'main' into carlos/redisfail

934db1d

carlosgjs closed this Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ML async job cancellation, fail jobs on redis errors#13

Support ML async job cancellation, fail jobs on redis errors#13
carlosgjs wants to merge 25 commits intomainfrom
carlos/redisfail

carlosgjs commented Feb 26, 2026

Uh oh!

mihow left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

carlosgjs commented Feb 26, 2026

Summary

How to Test the Changes

Screenshots

Known Issues

Deployment Notes

Checklist

Uh oh!

mihow left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow left a comment •

edited

Loading