Python: Fix MCP plugin hang when initialization fails #13414#13437
Python: Fix MCP plugin hang when initialization fails #13414#13437aryanyk wants to merge 3 commits intomicrosoft:mainfrom
Conversation
|
@microsoft-github-policy-service agree |
Python Test Coverage Report •
Python Unit Test Overview
|
||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 93%
✗ Correctness
The PR replaces asyncio.Event with asyncio.Future for signaling readiness between connect() and _inner_connect(), which correctly enables exception propagation to the caller. However, it introduces a deadlock: if load_tools(), load_prompts(), or session.initialize() in the elif branch throws an exception, the future is never resolved (no set_result or set_exception), causing connect() to hang forever on
await ready_future. The old code was protected from this because load_tools/load_prompts had internal try/except that swallowed all exceptions, but this PR removes those guards without adding equivalent protection around the calls in _inner_connect. Additionally, removing the try/except in load_tools/load_prompts also affects the message_handler notification path (lines 462-465), where failures will now propagate as unhandled exceptions.
✗ Security Reliability
The PR replaces asyncio.Event with asyncio.Future for signaling readiness from the _inner_connect task, which is a sound improvement for propagating exceptions. However, it simultaneously removes the try/except guards around load_tools() and load_prompts(), which are called before ready_future.set_result(None) in _inner_connect. If either method raises, the exception propagates as an unhandled task exception, but the Future is never resolved — causing connect() to hang indefinitely awaiting the Future. This is a deadlock/hang bug introduced by the combination of these two changes.
✗ Test Coverage
This PR replaces asyncio.Event with asyncio.Future for signaling readiness in connect/inner_connect and removes silent exception swallowing in load_tools/load_prompts. The critical test coverage gap is that the behavior change in load_tools() and load_prompts() (from silently swallowing exceptions to propagating them) has no corresponding tests. Worse, testing this would reveal a latent bug: if load_tools() or load_prompts() raises, the exception propagates out of _inner_connect() without ever resolving the ready_future, causing connect() to hang indefinitely on
await ready_future. The three guarded except blocks in _inner_connect correctly set_exception on the future, but exceptions from load_tools/load_prompts between those blocks and set_result are unhandled. Additionally, there are no tests for the session creation failure or session initialization failure paths (second and third except blocks in _inner_connect).
✗ Design Approach
The
asyncio.Event→asyncio.Futurerefactor is a genuine and necessary bug fix: in the original code, exceptions raised inside_inner_connect(a background task) were set on the event and re-raised in the task, butawait ready_event.wait()inconnect()never surfaced the task's exception—meaning theexcept KernelPluginInvalidConfigurationErrorblock inconnect()was dead code and calers never saw connection errors. TheFuture-based approach correctly propagates exceptions. However, the removal of the try/except guards insideload_toolsandload_promptsis too broad. These methods are called in two distinct contexts: (1) during initial connect, where propagating errors is desirable, and (2) frommessage_handlerin response to livenotifications/tools/list_changed/notifications/prompts/list_changedserver notifications. Removing the guards means a transientlist_tools()failure during a live notification will propagate uncaught frommessage_handler, which the MCP client library may not handle gracefully and which could silently break ongoing sessions. The correct fix is to propagate errors during initial connect (which the Future refactor already handles), while keeping resilient error handling in themessage_handlercall sites or inside the load methods themselves.
Flagged Issues
- Deadlock: If load_tools(), load_prompts(), session.initialize() (elif branch), or set_logging_level() raises an exception inside _inner_connect, the ready_future is never resolved—neither set_result nor set_exception is called. This causes connect() to hang forever on
await ready_future. The code between session setup and ready_future.set_result(None) needs a wrapping try/except that calls ready_future.set_exception(exc) on failure, consistent with the earlier error-handling blocks in _inner_connect. - Removing exception handling from load_tools() and load_prompts() is too broad. Both methods are called not only during initial connection but also as notification callbacks from message_handler (for notifications/tools/list_changed and notifications/prompts/list_changed). A transient list_tools()/list_prompts() failure during a live server notification will now propagate as an unhandled exception from message_handler, potentially crashing or degrading the session. The fix should add try/except guards at the message_handler call sites so the initial-connect path propagates errors via the Future while the notification path degrades gracefully.
Suggestions
- Wrap the entire block between session setup and ready_future.set_result(None)—including load_tools, load_prompts, session.initialize (elif branch), and set_logging_level—in a try/except that calls ready_future.set_exception(exc) on failure, ensuring the Future is always resolved.
- When manually setting exc.cause, also set exc.suppress_context = True to match the behavior of
raise X from Yand produce cleaner tracebacks. - Add tests for connection failure paths: session creation failure, session initialization failure, and list_tools()/list_prompts() failure during connect. A test mocking list_tools() to raise would currently hang, confirming the missing error handling.
- The Event→Future conversion is the right approach—keep it. It correctly surfaces errors that were previously silently swallowed via the Event.
Automated review by moonbox3's agents
| @@ -344,7 +356,8 @@ async def _inner_connect(self, ready_event: asyncio.Event) -> None: | |||
| except Exception: | |||
| logger.warning("Failed to set log level to %s", logger.level) | |||
| # Setting up is complete, will now signal the main loop that we are ready | |||
There was a problem hiding this comment.
Deadlock bug: If load_tools(), load_prompts(), session.initialize() (line 332), or set_logging_level() raises an unhandled exception, the task exits but ready_future is never resolved—connect() hangs forever on await ready_future. Wrap this entire block in a try/except that resolves the future on failure, consistent with the earlier error paths in _inner_connect.
| try: | |
| if self.load_tools_flag: | |
| await self.load_tools() | |
| if self.load_prompts_flag: | |
| await self.load_prompts() | |
| if logger.level != logging.NOTSET: | |
| try: | |
| await self.session.set_logging_level( | |
| next(level for level, value in LOG_LEVEL_MAPPING.items() if value == logger.level) | |
| ) | |
| except Exception: | |
| logger.warning("Failed to set log level to %s", logger.level) | |
| # Setting up is complete, will now signal the main loop that we are ready | |
| if not ready_future.done(): | |
| ready_future.set_result(None) | |
| except Exception as ex: | |
| await self._exit_stack.aclose() | |
| if not ready_future.done(): | |
| ready_future.set_exception(ex) | |
| return |
| if not ready_future.done(): | ||
| exc = KernelPluginInvalidConfigurationError( | ||
| "Failed to connect to the MCP server. Please check your configuration." | ||
| ) |
There was a problem hiding this comment.
Unlike raise KernelPluginInvalidConfigurationError(...) from ex, manually setting __cause__ does not set __suppress_context__ = True. This means Python will print 'During handling of the above exception…' instead of 'The above exception was the direct cause of…'. Add exc.__suppress_context__ = True after setting __cause__.
| ) | |
| exc.__cause__ = ex | |
| exc.__suppress_context__ = True | |
| ready_future.set_exception(exc) |
Summary
Issue No : #13414
This PR fixes a deadlock in the Python MCP plugins where connection failures during background initialization could cause connect() to hang indefinitely.
Problem
connect() waits on a readiness future that was never resolved when _inner_connect() failed during session initialization (e.g. HTTP 401/403). The exception was trapped in the background task and never propagated.
Fix
Ensure _inner_connect() always resolves the readiness future with an exception when initialization fails, allowing the error to surface immediately and preventing hangs.
Scope
No changes to MCP protocol behavior
No changes to tool loading or normalization
Minimal diff, focused solely on error propagation
Notes
Some MCP unit tests are environment-dependent and may fail locally, including on main. This PR does not modify test behavior.