Harden CPU overuse warning checks and add unit tests by avnig05 · Pull Request #5512 · DataBiosphere/toil

avnig05 · 2026-05-05T19:16:51Z

Moves CPU-overuse warning evaluation into Job._check_cpu_usage() so the logic is centralized and easier to test. The check skips warnings when wall-clock runtime is non-positive, warns when an invalid core request is encountered (cores <= 0), and only gives a warning when CPU consumption exceeds the configured threshold factor.

Changelog Entry

To be copied to the draft changelog by merger:

PR submitter writes their recommendation for a changelog entry here

Reviewer Checklist

Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
- If it is coming from an external repo, make sure to pull it in for CI with:
```
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
```
- If there is no associated issue, create one.
Read through the code changes. Make sure that it doesn't have:
- Addition of trailing whitespace.
- New variable or member names in camelCase that want to be in snake_case.
- New functions without type hints.
- New functions or classes without informative docstrings.
- Changes to semantics not reflected in the relevant docstrings.
- New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
- New features without tests.
Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
Finish the review with an overall description of your opinion.

Merger Checklist

Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
Merge with the Github "Squash and merge" feature.
- If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
Copy its recommended changelog entry to the Draft Changelog.
Append the issue number in parentheses to the changelog entry.

adamnovak

The implementation looks good here.

The tests seem like they were written to pin down exactly what the implementation happens to do right now, rather than what is important about what it promises to do now and in the future. In general we want to test interfaces and not implementation details. For example, if we decide to reword the message to say used XXX times as much CPU as requested instead of used XXXx more CPU than the requested X cores, we don't want the tests to start to fail.

One way to avoid this is to write the bulk of the tests before writing the implementation. If generating tests, it might make sense to provide the generator only with the documentation/signature for the function, and avoid showing it the body, or at least send along exhortations to ignore it.

adamnovak · 2026-05-06T20:43:07Z



 class TestJob:
+


Suggested change

adamnovak · 2026-05-06T20:44:23Z

+            job = Job()
+            file_store = Mock()
+            job._check_cpu_usage(
+                job_time=0.0, job_cpu_time=100.0, fileStore=file_store)


The layout is inconsistent here for no reason I can see.

Suggested change

job_time=0.0, job_cpu_time=100.0, fileStore=file_store)

job_time=0.0, job_cpu_time=100.0, fileStore=file_store

)

adamnovak · 2026-05-06T20:52:53Z

+        with subtests.test("warn for invalid cores value"):
+            job = Job()
+            file_store = Mock()
+            job.cores = 0
+            job._check_cpu_usage(
+                job_time=10.0, job_cpu_time=10.0, fileStore=file_store
+            )
+            file_store.log_to_leader.assert_called_once()
+            invalid_cores_message = file_store.log_to_leader.call_args.args[0]
+            invalid_cores_level = file_store.log_to_leader.call_args.kwargs["level"]
+            assert "invalid requested cores value 0" in invalid_cores_message
+            assert invalid_cores_level == logging.WARNING


The point that there ought to be a warning for 0 cores is not documented as part of the function's docstring, so it seems odd to me to test it. We end up with a test that just constrains the function to do what it currently happens to do, rather than what it promises to do. In a perfect world, we ought to be able to delete the body of a function and reconstruct a version of it that passes all the tests from just the documentation.

I would do something like this, where we make sure we can run the function through with 0 cores but we are agnostic as to whether it complains or not.

Suggested change

with subtests.test("warn for invalid cores value"):

job = Job()

file_store = Mock()

job.cores = 0

job._check_cpu_usage(

job_time=10.0, job_cpu_time=10.0, fileStore=file_store

)

file_store.log_to_leader.assert_called_once()

invalid_cores_message = file_store.log_to_leader.call_args.args[0]

invalid_cores_level = file_store.log_to_leader.call_args.kwargs["level"]

assert "invalid requested cores value 0" in invalid_cores_message

assert invalid_cores_level == logging.WARNING

with subtests.test("accepts 0 cores"):

job = Job()

file_store = Mock()

job.cores = 0

job._check_cpu_usage(

job_time=10.0, job_cpu_time=10.0, fileStore=file_store

)

We could alternatively add the point that the function promises to warn on 0 cores to its documentation.

adamnovak · 2026-05-06T21:01:35Z

+        with subtests.test("threshold boundary does not warn"):
+            threshold_job = Job(cores=2)
+            threshold_store = Mock()
+            threshold_job._check_cpu_usage(
+                job_time=10.0, job_cpu_time=21.0, fileStore=threshold_store
+            )
+            threshold_store.log_to_leader.assert_not_called()

+        with subtests.test("just above threshold warns"):
+            threshold_job = Job(cores=2)
+            threshold_store = Mock()
+            threshold_job._check_cpu_usage(
+                job_time=10.0, job_cpu_time=21.1, fileStore=threshold_store
+            )
+            threshold_store.log_to_leader.assert_called_once()
+            cpu_message = threshold_store.log_to_leader.call_args.args[0]
+            cpu_level = threshold_store.log_to_leader.call_args.kwargs["level"]
+            assert "used " in cpu_message
+            assert "more CPU than the requested 2 cores" in cpu_message
+            assert cpu_level == logging.WARNING


These are testing both the behavior at the threshold and the default threshold value, and we don't promise that the threshold will always be exactly 1.05.

It's also not immediately clear that the magic numbers have the right relationship for these tests to work.

I would do something like:

Suggested change

with subtests.test("threshold boundary does not warn"):

threshold_job = Job(cores=2)

threshold_store = Mock()

threshold_job._check_cpu_usage(

job_time=10.0, job_cpu_time=21.0, fileStore=threshold_store

)

threshold_store.log_to_leader.assert_not_called()

with subtests.test("just above threshold warns"):

threshold_job = Job(cores=2)

threshold_store = Mock()

threshold_job._check_cpu_usage(

job_time=10.0, job_cpu_time=21.1, fileStore=threshold_store

)

threshold_store.log_to_leader.assert_called_once()

cpu_message = threshold_store.log_to_leader.call_args.args[0]

cpu_level = threshold_store.log_to_leader.call_args.kwargs["level"]

assert "used " in cpu_message

assert "more CPU than the requested 2 cores" in cpu_message

assert cpu_level == logging.WARNING

TIME = 10

THRESHOLD = 1.05

CORES = 2

with subtests.test("threshold boundary does not warn"):

threshold_job = Job(cores=CORES)

threshold_store = Mock()

threshold_job._check_cpu_usage(

job_time=TIME,

job_cpu_time=TIME * CORES * THRESHOLD,

fileStore=threshold_store,

threshold_factor=THRESHOLD,

)

threshold_store.log_to_leader.assert_not_called()

with subtests.test("just above threshold warns"):

threshold_job = Job(cores=CORES)

threshold_store = Mock()

threshold_job._check_cpu_usage(

job_time=TIME,

job_cpu_time=TIME * CORES * THRESHOLD + 0.1,

fileStore=threshold_store,

threshold_factor=THRESHOLD,

)

threshold_store.log_to_leader.assert_called_once()

cpu_message = threshold_store.log_to_leader.call_args.args[0]

cpu_level = threshold_store.log_to_leader.call_args.kwargs["level"]

assert str(threshold_job.description) in cpu_message

assert cpu_level == logging.WARNING

(I'm also de-constraining the message text here, to test whether a warning that mentions the job is sent, rather than the specific wording of the message.)

Harden CPU overuse warning checks and add unit tests

7cdac00

avnig05 requested a review from adamnovak May 5, 2026 19:16

adamnovak requested changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden CPU overuse warning checks and add unit tests#5512

Harden CPU overuse warning checks and add unit tests#5512
avnig05 wants to merge 1 commit intomasterfrom
issues/5478-cpu-usage-warnings

avnig05 commented May 5, 2026

Uh oh!

adamnovak left a comment

Uh oh!

adamnovak May 6, 2026

Uh oh!

adamnovak May 6, 2026

Uh oh!

adamnovak May 6, 2026

Uh oh!

adamnovak May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	job_time=0.0, job_cpu_time=100.0, fileStore=file_store)
	job_time=0.0, job_cpu_time=100.0, fileStore=file_store
	)

Conversation

avnig05 commented May 5, 2026

Changelog Entry

Reviewer Checklist

Merger Checklist

Uh oh!

adamnovak left a comment

Choose a reason for hiding this comment

Uh oh!

adamnovak May 6, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak May 6, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak May 6, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants