Skip to content

Harden CPU overuse warning checks and add unit tests#5512

Open
avnig05 wants to merge 1 commit intomasterfrom
issues/5478-cpu-usage-warnings
Open

Harden CPU overuse warning checks and add unit tests#5512
avnig05 wants to merge 1 commit intomasterfrom
issues/5478-cpu-usage-warnings

Conversation

@avnig05
Copy link
Copy Markdown
Collaborator

@avnig05 avnig05 commented May 5, 2026

Moves CPU-overuse warning evaluation into Job._check_cpu_usage() so the logic is centralized and easier to test. The check skips warnings when wall-clock runtime is non-positive, warns when an invalid core request is encountered (cores <= 0), and only gives a warning when CPU consumption exceeds the configured threshold factor.

Changelog Entry

To be copied to the draft changelog by merger:

  • PR submitter writes their recommendation for a changelog entry here

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
  • Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@avnig05 avnig05 requested a review from adamnovak May 5, 2026 19:16
Copy link
Copy Markdown
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good here.

The tests seem like they were written to pin down exactly what the implementation happens to do right now, rather than what is important about what it promises to do now and in the future. In general we want to test interfaces and not implementation details. For example, if we decide to reword the message to say used XXX times as much CPU as requested instead of used XXXx more CPU than the requested X cores, we don't want the tests to start to fail.

One way to avoid this is to write the bulk of the tests before writing the implementation. If generating tests, it might make sense to provide the generator only with the documentation/signature for the function, and avoid showing it the body, or at least send along exhortations to ignore it.



class TestJob:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

job = Job()
file_store = Mock()
job._check_cpu_usage(
job_time=0.0, job_cpu_time=100.0, fileStore=file_store)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The layout is inconsistent here for no reason I can see.

Suggested change
job_time=0.0, job_cpu_time=100.0, fileStore=file_store)
job_time=0.0, job_cpu_time=100.0, fileStore=file_store
)

Comment on lines +833 to +844
with subtests.test("warn for invalid cores value"):
job = Job()
file_store = Mock()
job.cores = 0
job._check_cpu_usage(
job_time=10.0, job_cpu_time=10.0, fileStore=file_store
)
file_store.log_to_leader.assert_called_once()
invalid_cores_message = file_store.log_to_leader.call_args.args[0]
invalid_cores_level = file_store.log_to_leader.call_args.kwargs["level"]
assert "invalid requested cores value 0" in invalid_cores_message
assert invalid_cores_level == logging.WARNING
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point that there ought to be a warning for 0 cores is not documented as part of the function's docstring, so it seems odd to me to test it. We end up with a test that just constrains the function to do what it currently happens to do, rather than what it promises to do. In a perfect world, we ought to be able to delete the body of a function and reconstruct a version of it that passes all the tests from just the documentation.

I would do something like this, where we make sure we can run the function through with 0 cores but we are agnostic as to whether it complains or not.

Suggested change
with subtests.test("warn for invalid cores value"):
job = Job()
file_store = Mock()
job.cores = 0
job._check_cpu_usage(
job_time=10.0, job_cpu_time=10.0, fileStore=file_store
)
file_store.log_to_leader.assert_called_once()
invalid_cores_message = file_store.log_to_leader.call_args.args[0]
invalid_cores_level = file_store.log_to_leader.call_args.kwargs["level"]
assert "invalid requested cores value 0" in invalid_cores_message
assert invalid_cores_level == logging.WARNING
with subtests.test("accepts 0 cores"):
job = Job()
file_store = Mock()
job.cores = 0
job._check_cpu_usage(
job_time=10.0, job_cpu_time=10.0, fileStore=file_store
)

We could alternatively add the point that the function promises to warn on 0 cores to its documentation.

Comment on lines +846 to +865
with subtests.test("threshold boundary does not warn"):
threshold_job = Job(cores=2)
threshold_store = Mock()
threshold_job._check_cpu_usage(
job_time=10.0, job_cpu_time=21.0, fileStore=threshold_store
)
threshold_store.log_to_leader.assert_not_called()

with subtests.test("just above threshold warns"):
threshold_job = Job(cores=2)
threshold_store = Mock()
threshold_job._check_cpu_usage(
job_time=10.0, job_cpu_time=21.1, fileStore=threshold_store
)
threshold_store.log_to_leader.assert_called_once()
cpu_message = threshold_store.log_to_leader.call_args.args[0]
cpu_level = threshold_store.log_to_leader.call_args.kwargs["level"]
assert "used " in cpu_message
assert "more CPU than the requested 2 cores" in cpu_message
assert cpu_level == logging.WARNING
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are testing both the behavior at the threshold and the default threshold value, and we don't promise that the threshold will always be exactly 1.05.

It's also not immediately clear that the magic numbers have the right relationship for these tests to work.

I would do something like:

Suggested change
with subtests.test("threshold boundary does not warn"):
threshold_job = Job(cores=2)
threshold_store = Mock()
threshold_job._check_cpu_usage(
job_time=10.0, job_cpu_time=21.0, fileStore=threshold_store
)
threshold_store.log_to_leader.assert_not_called()
with subtests.test("just above threshold warns"):
threshold_job = Job(cores=2)
threshold_store = Mock()
threshold_job._check_cpu_usage(
job_time=10.0, job_cpu_time=21.1, fileStore=threshold_store
)
threshold_store.log_to_leader.assert_called_once()
cpu_message = threshold_store.log_to_leader.call_args.args[0]
cpu_level = threshold_store.log_to_leader.call_args.kwargs["level"]
assert "used " in cpu_message
assert "more CPU than the requested 2 cores" in cpu_message
assert cpu_level == logging.WARNING
TIME = 10
THRESHOLD = 1.05
CORES = 2
with subtests.test("threshold boundary does not warn"):
threshold_job = Job(cores=CORES)
threshold_store = Mock()
threshold_job._check_cpu_usage(
job_time=TIME,
job_cpu_time=TIME * CORES * THRESHOLD,
fileStore=threshold_store,
threshold_factor=THRESHOLD,
)
threshold_store.log_to_leader.assert_not_called()
with subtests.test("just above threshold warns"):
threshold_job = Job(cores=CORES)
threshold_store = Mock()
threshold_job._check_cpu_usage(
job_time=TIME,
job_cpu_time=TIME * CORES * THRESHOLD + 0.1,
fileStore=threshold_store,
threshold_factor=THRESHOLD,
)
threshold_store.log_to_leader.assert_called_once()
cpu_message = threshold_store.log_to_leader.call_args.args[0]
cpu_level = threshold_store.log_to_leader.call_args.kwargs["level"]
assert str(threshold_job.description) in cpu_message
assert cpu_level == logging.WARNING

(I'm also de-constraining the message text here, to test whether a warning that mentions the job is sent, rather than the specific wording of the message.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants