Skip to content

Python: Fix MCP plugin hang when initialization fails #13414#13437

Open
aryanyk wants to merge 3 commits intomicrosoft:mainfrom
aryanyk:fix-mcp-hang-on-init
Open

Python: Fix MCP plugin hang when initialization fails #13414#13437
aryanyk wants to merge 3 commits intomicrosoft:mainfrom
aryanyk:fix-mcp-hang-on-init

Conversation

@aryanyk
Copy link
Copy Markdown

@aryanyk aryanyk commented Jan 3, 2026

Summary
Issue No : #13414

This PR fixes a deadlock in the Python MCP plugins where connection failures during background initialization could cause connect() to hang indefinitely.

Problem

connect() waits on a readiness future that was never resolved when _inner_connect() failed during session initialization (e.g. HTTP 401/403). The exception was trapped in the background task and never propagated.

Fix

Ensure _inner_connect() always resolves the readiness future with an exception when initialization fails, allowing the error to surface immediately and preventing hangs.

Scope

No changes to MCP protocol behavior

No changes to tool loading or normalization

Minimal diff, focused solely on error propagation

Notes

Some MCP unit tests are environment-dependent and may fail locally, including on main. This PR does not modify test behavior.

@aryanyk aryanyk requested a review from a team as a code owner January 3, 2026 21:49
@moonbox3 moonbox3 added the python Pull requests for the Python Semantic Kernel label Jan 3, 2026
@github-actions github-actions bot changed the title Fix MCP plugin hang when initialization fails #13414 Python: Fix MCP plugin hang when initialization fails #13414 Jan 3, 2026
@aryanyk
Copy link
Copy Markdown
Author

aryanyk commented Jan 3, 2026

@microsoft-github-policy-service agree

@moonbox3
Copy link
Copy Markdown
Collaborator

moonbox3 commented Jan 7, 2026

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
connectors
   mcp.py42820751%75, 87, 95–102, 108–109, 114, 126–133, 141–145, 147–148, 150, 157–159, 248–250, 289–292, 295–297, 300–303, 306–308, 320–321, 324–325, 334–336, 348–349, 353–354, 356, 358, 360–366, 370–372, 374–384, 388–389, 393–394, 398, 400, 404–405, 409, 423, 438–446, 452–453, 456–457, 475–476, 479–480, 483–488, 492–493, 496–497, 500–506, 580, 582, 649, 651, 653, 655, 724, 726, 728, 730, 732, 790, 833–838, 840–843, 848, 901, 903, 906, 919, 939–941, 948–953, 956–959, 962–964, 968–972, 975–976, 985–986, 988–990, 1004–1005, 1007–1008, 1010–1012, 1015, 1022–1025, 1033, 1038–1043, 1048, 1050, 1053–1056
TOTAL28068481282% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
3787 23 💤 0 ❌ 0 🔥 1m 45s ⏱️

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 93%

✗ Correctness

The PR replaces asyncio.Event with asyncio.Future for signaling readiness between connect() and _inner_connect(), which correctly enables exception propagation to the caller. However, it introduces a deadlock: if load_tools(), load_prompts(), or session.initialize() in the elif branch throws an exception, the future is never resolved (no set_result or set_exception), causing connect() to hang forever on await ready_future. The old code was protected from this because load_tools/load_prompts had internal try/except that swallowed all exceptions, but this PR removes those guards without adding equivalent protection around the calls in _inner_connect. Additionally, removing the try/except in load_tools/load_prompts also affects the message_handler notification path (lines 462-465), where failures will now propagate as unhandled exceptions.

✗ Security Reliability

The PR replaces asyncio.Event with asyncio.Future for signaling readiness from the _inner_connect task, which is a sound improvement for propagating exceptions. However, it simultaneously removes the try/except guards around load_tools() and load_prompts(), which are called before ready_future.set_result(None) in _inner_connect. If either method raises, the exception propagates as an unhandled task exception, but the Future is never resolved — causing connect() to hang indefinitely awaiting the Future. This is a deadlock/hang bug introduced by the combination of these two changes.

✗ Test Coverage

This PR replaces asyncio.Event with asyncio.Future for signaling readiness in connect/inner_connect and removes silent exception swallowing in load_tools/load_prompts. The critical test coverage gap is that the behavior change in load_tools() and load_prompts() (from silently swallowing exceptions to propagating them) has no corresponding tests. Worse, testing this would reveal a latent bug: if load_tools() or load_prompts() raises, the exception propagates out of _inner_connect() without ever resolving the ready_future, causing connect() to hang indefinitely on await ready_future. The three guarded except blocks in _inner_connect correctly set_exception on the future, but exceptions from load_tools/load_prompts between those blocks and set_result are unhandled. Additionally, there are no tests for the session creation failure or session initialization failure paths (second and third except blocks in _inner_connect).

✗ Design Approach

The asyncio.Eventasyncio.Future refactor is a genuine and necessary bug fix: in the original code, exceptions raised inside _inner_connect (a background task) were set on the event and re-raised in the task, but await ready_event.wait() in connect() never surfaced the task's exception—meaning the except KernelPluginInvalidConfigurationError block in connect() was dead code and calers never saw connection errors. The Future-based approach correctly propagates exceptions. However, the removal of the try/except guards inside load_tools and load_prompts is too broad. These methods are called in two distinct contexts: (1) during initial connect, where propagating errors is desirable, and (2) from message_handler in response to live notifications/tools/list_changed / notifications/prompts/list_changed server notifications. Removing the guards means a transient list_tools() failure during a live notification will propagate uncaught from message_handler, which the MCP client library may not handle gracefully and which could silently break ongoing sessions. The correct fix is to propagate errors during initial connect (which the Future refactor already handles), while keeping resilient error handling in the message_handler call sites or inside the load methods themselves.

Flagged Issues

  • Deadlock: If load_tools(), load_prompts(), session.initialize() (elif branch), or set_logging_level() raises an exception inside _inner_connect, the ready_future is never resolved—neither set_result nor set_exception is called. This causes connect() to hang forever on await ready_future. The code between session setup and ready_future.set_result(None) needs a wrapping try/except that calls ready_future.set_exception(exc) on failure, consistent with the earlier error-handling blocks in _inner_connect.
  • Removing exception handling from load_tools() and load_prompts() is too broad. Both methods are called not only during initial connection but also as notification callbacks from message_handler (for notifications/tools/list_changed and notifications/prompts/list_changed). A transient list_tools()/list_prompts() failure during a live server notification will now propagate as an unhandled exception from message_handler, potentially crashing or degrading the session. The fix should add try/except guards at the message_handler call sites so the initial-connect path propagates errors via the Future while the notification path degrades gracefully.

Suggestions

  • Wrap the entire block between session setup and ready_future.set_result(None)—including load_tools, load_prompts, session.initialize (elif branch), and set_logging_level—in a try/except that calls ready_future.set_exception(exc) on failure, ensuring the Future is always resolved.
  • When manually setting exc.cause, also set exc.suppress_context = True to match the behavior of raise X from Y and produce cleaner tracebacks.
  • Add tests for connection failure paths: session creation failure, session initialization failure, and list_tools()/list_prompts() failure during connect. A test mocking list_tools() to raise would currently hang, confirming the missing error handling.
  • The Event→Future conversion is the right approach—keep it. It correctly surfaces errors that were previously silently swallowed via the Event.

Automated review by moonbox3's agents

Comment on lines 334 to 358
@@ -344,7 +356,8 @@ async def _inner_connect(self, ready_event: asyncio.Event) -> None:
except Exception:
logger.warning("Failed to set log level to %s", logger.level)
# Setting up is complete, will now signal the main loop that we are ready
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deadlock bug: If load_tools(), load_prompts(), session.initialize() (line 332), or set_logging_level() raises an unhandled exception, the task exits but ready_future is never resolved—connect() hangs forever on await ready_future. Wrap this entire block in a try/except that resolves the future on failure, consistent with the earlier error paths in _inner_connect.

Suggested change
try:
if self.load_tools_flag:
await self.load_tools()
if self.load_prompts_flag:
await self.load_prompts()
if logger.level != logging.NOTSET:
try:
await self.session.set_logging_level(
next(level for level, value in LOG_LEVEL_MAPPING.items() if value == logger.level)
)
except Exception:
logger.warning("Failed to set log level to %s", logger.level)
# Setting up is complete, will now signal the main loop that we are ready
if not ready_future.done():
ready_future.set_result(None)
except Exception as ex:
await self._exit_stack.aclose()
if not ready_future.done():
ready_future.set_exception(ex)
return

if not ready_future.done():
exc = KernelPluginInvalidConfigurationError(
"Failed to connect to the MCP server. Please check your configuration."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike raise KernelPluginInvalidConfigurationError(...) from ex, manually setting __cause__ does not set __suppress_context__ = True. This means Python will print 'During handling of the above exception…' instead of 'The above exception was the direct cause of…'. Add exc.__suppress_context__ = True after setting __cause__.

Suggested change
)
exc.__cause__ = ex
exc.__suppress_context__ = True
ready_future.set_exception(exc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python Pull requests for the Python Semantic Kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants