Skip to content

Retry on ENOLCK from NFS lockd in fcntl-based locks (#4846)#5508

Merged
adamnovak merged 3 commits into
masterfrom
issues/4846-retry-enolck-nfs
May 7, 2026
Merged

Retry on ENOLCK from NFS lockd in fcntl-based locks (#4846)#5508
adamnovak merged 3 commits into
masterfrom
issues/4846-retry-enolck-nfs

Conversation

@annagiroti
Copy link
Copy Markdown
Collaborator

When Toil runs on NFS filesystems, fcntl.flock can raise OSError [Errno 37] No locks available (ENOLCK) if the NFS lock daemon (lockd) is temporarily unavailable. Previously, this caused jobs to crash immediately. This PR extends the existing retry logic in safe_lock (which already handled EIO for Ceph) to also retry on ENOLCK with exponential backoff. safe_unlock_and_close is also updated to swallow ENOLCK the same way it does EIO. Unit tests are added for both error cases using mocked fcntl.flock.

Resolves #4846

Changelog Entry

To be copied to the draft changelog by merger:

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
  • Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@annagiroti annagiroti requested a review from adamnovak April 30, 2026 21:31
Copy link
Copy Markdown
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good, but the test code I think should be deduplicated a bit.

Comment thread src/toil/lib/threading.py Outdated
)
else:
logger.critical(
"Too many IO errors talking to lock file. If using Ceph, check for MDS deadlocks. See <https://tracker.ceph.com/issues/62123>."
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might wrap this message like the new one.

Comment thread src/toil/lib/threading.py Outdated
if e.errno != errno.EIO:
if e.errno not in (errno.EIO, errno.ENOLCK):
raise
# Sometimes Ceph produces EIO. We don't need to retry then because
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment now only mentions one of the two cases its branch needs to implement. We probably could drop that and just talk about how we don't need to retry.

Comment thread src/toil/test/src/threadingTest.py Outdated
"precious"
), f"File {filename} still exists"

# Tests for ENOLCK (toil#4846)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Is it better as a docstring?

Comment thread src/toil/test/src/threadingTest.py Outdated
), f"File {filename} still exists"

# Tests for ENOLCK (toil#4846)
def testSafeLockRetriesOnENOLCK(self) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new function names ought to be snake_case.

Comment thread src/toil/test/src/threadingTest.py Outdated
Comment on lines +75 to +125
def testSafeLockRetriesOnENOLCK(self) -> None:
enolck = OSError(errno.ENOLCK, "No locks available")
# First call raises ENOLCK, second call succeeds
with patch("fcntl.flock", side_effect=[enolck, None]) as mock_flock:
safe_lock(0)
assert mock_flock.call_count == 2

def testSafeLockFailsAfterMaxRetriesOnENOLCK(self) -> None:
enolck = OSError(errno.ENOLCK, "No locks available")
# First call raises ENOLCK, second call succeeds
with patch("fcntl.flock", side_effect=enolck):
with patch("toil.lib.threading.time.sleep"): # skip the backoff waits
try:
safe_lock(0)
assert False, "Expected OSError to be raised"
except OSError as e:
assert e.errno == errno.ENOLCK

def testSafeLockRetriesOnEIO(self) -> None:
eio = OSError(errno.EIO, "Input/Output Error")
# First call raises EIO, second call succeeds
with patch("fcntl.flock", side_effect=[eio, None]) as mock_flock:
safe_lock(0)
assert mock_flock.call_count == 2

def testSafeLockFailsAfterMaxRetriesOnEIO(self) -> None:
eio = OSError(errno.EIO, "Input/Output Error")
# First call raises EIO, second call succeeds
with patch("fcntl.flock", side_effect=eio):
with patch("toil.lib.threading.time.sleep"): # skip the backoff waits
try:
safe_lock(0)
assert False, "Expected OSError to be raised"
except OSError as e:
assert e.errno == errno.EIO

def testSafeUnlockAndCloseSwallowsENOLCK(self) -> None:
enolck = OSError(errno.ENOLCK, "No locks available")
# First call raises ENOLCK, second call succeeds
with patch("fcntl.flock", side_effect=enolck):
with patch("os.close") as mock_close:
safe_unlock_and_close(0)
mock_close.assert_called_once_with(0)

def testSafeUnlockAndCloseSwallowsEIO(self) -> None:
# First call raises EIO, second call succeeds
eio = OSError(errno.EIO, "Input/output error")
with patch("fcntl.flock", side_effect=eio):
with patch("os.close") as mock_close:
safe_unlock_and_close(0)
mock_close.assert_called_once_with(0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two substantially identical sets of 3 tests here, which differ just on the errno value and message used. We should consolidate them, either using inheritance (one base class with an abstract raise_error that we implement in IOError and NoLocksError subclasses), or using pytest subtests and a loop over the errno-and-message pairs (or over a constant list of pre-constructed exceptions).

@adamnovak
Copy link
Copy Markdown
Member

I like the code now, but if you click on the little red X and then "Details" on the failing Gitlab job, and open up the failing lint step and peruse the log, you can see that the type checking is failing because of this:

src/toil/test/src/threadingTest.py: note: In member "test_safe_lock_fails_after_max_retries" of class "BaseSafeLockingTest":
src/toil/test/src/threadingTest.py:102:39: error: "Exception" has no attribute
"errno"  [attr-defined]
                        assert e.errno == error.errno
                                          ^~~~~~~~~~~
Found 1 error in 1 file (checked 131 source files)

I think the problem is that get_exception() is typed to return Exception, but for us to check things based on the result's errno we need to type it as returning OSError instead.

@adamnovak
Copy link
Copy Markdown
Member

@annagiroti See if you can fix this up so it passes all the CI tests. You should be able to make mypy to run the type checking locally.

@adamnovak adamnovak merged commit 5904df0 into master May 7, 2026
3 checks passed
@adamnovak adamnovak deleted the issues/4846-retry-enolck-nfs branch May 7, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle OSError: [Errno 37] No locks available in file locking

2 participants