Harden CPU overuse warning checks and add unit tests#5512
Harden CPU overuse warning checks and add unit tests#5512
Conversation
adamnovak
left a comment
There was a problem hiding this comment.
The implementation looks good here.
The tests seem like they were written to pin down exactly what the implementation happens to do right now, rather than what is important about what it promises to do now and in the future. In general we want to test interfaces and not implementation details. For example, if we decide to reword the message to say used XXX times as much CPU as requested instead of used XXXx more CPU than the requested X cores, we don't want the tests to start to fail.
One way to avoid this is to write the bulk of the tests before writing the implementation. If generating tests, it might make sense to provide the generator only with the documentation/signature for the function, and avoid showing it the body, or at least send along exhortations to ignore it.
|
|
||
|
|
||
| class TestJob: | ||
|
|
| job = Job() | ||
| file_store = Mock() | ||
| job._check_cpu_usage( | ||
| job_time=0.0, job_cpu_time=100.0, fileStore=file_store) |
There was a problem hiding this comment.
The layout is inconsistent here for no reason I can see.
| job_time=0.0, job_cpu_time=100.0, fileStore=file_store) | |
| job_time=0.0, job_cpu_time=100.0, fileStore=file_store | |
| ) |
| with subtests.test("warn for invalid cores value"): | ||
| job = Job() | ||
| file_store = Mock() | ||
| job.cores = 0 | ||
| job._check_cpu_usage( | ||
| job_time=10.0, job_cpu_time=10.0, fileStore=file_store | ||
| ) | ||
| file_store.log_to_leader.assert_called_once() | ||
| invalid_cores_message = file_store.log_to_leader.call_args.args[0] | ||
| invalid_cores_level = file_store.log_to_leader.call_args.kwargs["level"] | ||
| assert "invalid requested cores value 0" in invalid_cores_message | ||
| assert invalid_cores_level == logging.WARNING |
There was a problem hiding this comment.
The point that there ought to be a warning for 0 cores is not documented as part of the function's docstring, so it seems odd to me to test it. We end up with a test that just constrains the function to do what it currently happens to do, rather than what it promises to do. In a perfect world, we ought to be able to delete the body of a function and reconstruct a version of it that passes all the tests from just the documentation.
I would do something like this, where we make sure we can run the function through with 0 cores but we are agnostic as to whether it complains or not.
| with subtests.test("warn for invalid cores value"): | |
| job = Job() | |
| file_store = Mock() | |
| job.cores = 0 | |
| job._check_cpu_usage( | |
| job_time=10.0, job_cpu_time=10.0, fileStore=file_store | |
| ) | |
| file_store.log_to_leader.assert_called_once() | |
| invalid_cores_message = file_store.log_to_leader.call_args.args[0] | |
| invalid_cores_level = file_store.log_to_leader.call_args.kwargs["level"] | |
| assert "invalid requested cores value 0" in invalid_cores_message | |
| assert invalid_cores_level == logging.WARNING | |
| with subtests.test("accepts 0 cores"): | |
| job = Job() | |
| file_store = Mock() | |
| job.cores = 0 | |
| job._check_cpu_usage( | |
| job_time=10.0, job_cpu_time=10.0, fileStore=file_store | |
| ) |
We could alternatively add the point that the function promises to warn on 0 cores to its documentation.
| with subtests.test("threshold boundary does not warn"): | ||
| threshold_job = Job(cores=2) | ||
| threshold_store = Mock() | ||
| threshold_job._check_cpu_usage( | ||
| job_time=10.0, job_cpu_time=21.0, fileStore=threshold_store | ||
| ) | ||
| threshold_store.log_to_leader.assert_not_called() | ||
|
|
||
| with subtests.test("just above threshold warns"): | ||
| threshold_job = Job(cores=2) | ||
| threshold_store = Mock() | ||
| threshold_job._check_cpu_usage( | ||
| job_time=10.0, job_cpu_time=21.1, fileStore=threshold_store | ||
| ) | ||
| threshold_store.log_to_leader.assert_called_once() | ||
| cpu_message = threshold_store.log_to_leader.call_args.args[0] | ||
| cpu_level = threshold_store.log_to_leader.call_args.kwargs["level"] | ||
| assert "used " in cpu_message | ||
| assert "more CPU than the requested 2 cores" in cpu_message | ||
| assert cpu_level == logging.WARNING |
There was a problem hiding this comment.
These are testing both the behavior at the threshold and the default threshold value, and we don't promise that the threshold will always be exactly 1.05.
It's also not immediately clear that the magic numbers have the right relationship for these tests to work.
I would do something like:
| with subtests.test("threshold boundary does not warn"): | |
| threshold_job = Job(cores=2) | |
| threshold_store = Mock() | |
| threshold_job._check_cpu_usage( | |
| job_time=10.0, job_cpu_time=21.0, fileStore=threshold_store | |
| ) | |
| threshold_store.log_to_leader.assert_not_called() | |
| with subtests.test("just above threshold warns"): | |
| threshold_job = Job(cores=2) | |
| threshold_store = Mock() | |
| threshold_job._check_cpu_usage( | |
| job_time=10.0, job_cpu_time=21.1, fileStore=threshold_store | |
| ) | |
| threshold_store.log_to_leader.assert_called_once() | |
| cpu_message = threshold_store.log_to_leader.call_args.args[0] | |
| cpu_level = threshold_store.log_to_leader.call_args.kwargs["level"] | |
| assert "used " in cpu_message | |
| assert "more CPU than the requested 2 cores" in cpu_message | |
| assert cpu_level == logging.WARNING | |
| TIME = 10 | |
| THRESHOLD = 1.05 | |
| CORES = 2 | |
| with subtests.test("threshold boundary does not warn"): | |
| threshold_job = Job(cores=CORES) | |
| threshold_store = Mock() | |
| threshold_job._check_cpu_usage( | |
| job_time=TIME, | |
| job_cpu_time=TIME * CORES * THRESHOLD, | |
| fileStore=threshold_store, | |
| threshold_factor=THRESHOLD, | |
| ) | |
| threshold_store.log_to_leader.assert_not_called() | |
| with subtests.test("just above threshold warns"): | |
| threshold_job = Job(cores=CORES) | |
| threshold_store = Mock() | |
| threshold_job._check_cpu_usage( | |
| job_time=TIME, | |
| job_cpu_time=TIME * CORES * THRESHOLD + 0.1, | |
| fileStore=threshold_store, | |
| threshold_factor=THRESHOLD, | |
| ) | |
| threshold_store.log_to_leader.assert_called_once() | |
| cpu_message = threshold_store.log_to_leader.call_args.args[0] | |
| cpu_level = threshold_store.log_to_leader.call_args.kwargs["level"] | |
| assert str(threshold_job.description) in cpu_message | |
| assert cpu_level == logging.WARNING |
(I'm also de-constraining the message text here, to test whether a warning that mentions the job is sent, rather than the specific wording of the message.)
Moves CPU-overuse warning evaluation into Job._check_cpu_usage() so the logic is centralized and easier to test. The check skips warnings when wall-clock runtime is non-positive, warns when an invalid core request is encountered (cores <= 0), and only gives a warning when CPU consumption exceeds the configured threshold factor.
Changelog Entry
To be copied to the draft changelog by merger:
Reviewer Checklist
issues/XXXX-fix-the-thingin the Toil repo, or from an external repo.camelCasethat want to be insnake_case.docs/running/{cliOptions,cwl,wdl}.rstMerger Checklist