Python: Fix MCP plugin hang when initialization fails #13414 by aryanyk · Pull Request #13437 · microsoft/semantic-kernel

aryanyk · 2026-01-03T21:49:50Z

Summary
Issue No : #13414

This PR fixes a deadlock in the Python MCP plugins where connection failures during background initialization could cause connect() to hang indefinitely.

Problem

connect() waits on a readiness future that was never resolved when _inner_connect() failed during session initialization (e.g. HTTP 401/403). The exception was trapped in the background task and never propagated.

Fix

Ensure _inner_connect() always resolves the readiness future with an exception when initialization fails, allowing the error to surface immediately and preventing hangs.

Scope

No changes to MCP protocol behavior

No changes to tool loading or normalization

Minimal diff, focused solely on error propagation

Notes

Some MCP unit tests are environment-dependent and may fail locally, including on main. This PR does not modify test behavior.

aryanyk · 2026-01-03T21:53:19Z

@microsoft-github-policy-service agree

moonbox3 · 2026-01-07T10:29:20Z

Python Test Coverage Report •

File	Stmts	Miss	Cover	Missing
connectors
mcp.py	428	207	51%	75, 87, 95–102, 108–109, 114, 126–133, 141–145, 147–148, 150, 157–159, 248–250, 289–292, 295–297, 300–303, 306–308, 320–321, 324–325, 334–336, 348–349, 353–354, 356, 358, 360–366, 370–372, 374–384, 388–389, 393–394, 398, 400, 404–405, 409, 423, 438–446, 452–453, 456–457, 475–476, 479–480, 483–488, 492–493, 496–497, 500–506, 580, 582, 649, 651, 653, 655, 724, 726, 728, 730, 732, 790, 833–838, 840–843, 848, 901, 903, 906, 919, 939–941, 948–953, 956–959, 962–964, 968–972, 975–976, 985–986, 988–990, 1004–1005, 1007–1008, 1010–1012, 1015, 1022–1025, 1033, 1038–1043, 1048, 1050, 1053–1056
TOTAL	28068	4812	82%

Python Unit Test Overview

Tests	Skipped	Failures	Errors	Time
3787	23 💤	0 ❌	0 🔥	1m 45s ⏱️

github-actions

Automated Code Review

Reviewers: 4 | Confidence: 93%

✗ Correctness

The PR replaces asyncio.Event with asyncio.Future for signaling readiness between connect() and _inner_connect(), which correctly enables exception propagation to the caller. However, it introduces a deadlock: if load_tools(), load_prompts(), or session.initialize() in the elif branch throws an exception, the future is never resolved (no set_result or set_exception), causing connect() to hang forever on await ready_future. The old code was protected from this because load_tools/load_prompts had internal try/except that swallowed all exceptions, but this PR removes those guards without adding equivalent protection around the calls in _inner_connect. Additionally, removing the try/except in load_tools/load_prompts also affects the message_handler notification path (lines 462-465), where failures will now propagate as unhandled exceptions.

✗ Security Reliability

The PR replaces asyncio.Event with asyncio.Future for signaling readiness from the _inner_connect task, which is a sound improvement for propagating exceptions. However, it simultaneously removes the try/except guards around load_tools() and load_prompts(), which are called before ready_future.set_result(None) in _inner_connect. If either method raises, the exception propagates as an unhandled task exception, but the Future is never resolved — causing connect() to hang indefinitely awaiting the Future. This is a deadlock/hang bug introduced by the combination of these two changes.

✗ Test Coverage

This PR replaces asyncio.Event with asyncio.Future for signaling readiness in connect/inner_connect and removes silent exception swallowing in load_tools/load_prompts. The critical test coverage gap is that the behavior change in load_tools() and load_prompts() (from silently swallowing exceptions to propagating them) has no corresponding tests. Worse, testing this would reveal a latent bug: if load_tools() or load_prompts() raises, the exception propagates out of _inner_connect() without ever resolving the ready_future, causing connect() to hang indefinitely on await ready_future. The three guarded except blocks in _inner_connect correctly set_exception on the future, but exceptions from load_tools/load_prompts between those blocks and set_result are unhandled. Additionally, there are no tests for the session creation failure or session initialization failure paths (second and third except blocks in _inner_connect).

✗ Design Approach

The asyncio.Event → asyncio.Future refactor is a genuine and necessary bug fix: in the original code, exceptions raised inside _inner_connect (a background task) were set on the event and re-raised in the task, but await ready_event.wait() in connect() never surfaced the task's exception—meaning the except KernelPluginInvalidConfigurationError block in connect() was dead code and calers never saw connection errors. The Future-based approach correctly propagates exceptions. However, the removal of the try/except guards inside load_tools and load_prompts is too broad. These methods are called in two distinct contexts: (1) during initial connect, where propagating errors is desirable, and (2) from message_handler in response to live notifications/tools/list_changed / notifications/prompts/list_changed server notifications. Removing the guards means a transient list_tools() failure during a live notification will propagate uncaught from message_handler, which the MCP client library may not handle gracefully and which could silently break ongoing sessions. The correct fix is to propagate errors during initial connect (which the Future refactor already handles), while keeping resilient error handling in the message_handler call sites or inside the load methods themselves.

Flagged Issues

Deadlock: If load_tools(), load_prompts(), session.initialize() (elif branch), or set_logging_level() raises an exception inside _inner_connect, the ready_future is never resolved—neither set_result nor set_exception is called. This causes connect() to hang forever on await ready_future. The code between session setup and ready_future.set_result(None) needs a wrapping try/except that calls ready_future.set_exception(exc) on failure, consistent with the earlier error-handling blocks in _inner_connect.
Removing exception handling from load_tools() and load_prompts() is too broad. Both methods are called not only during initial connection but also as notification callbacks from message_handler (for notifications/tools/list_changed and notifications/prompts/list_changed). A transient list_tools()/list_prompts() failure during a live server notification will now propagate as an unhandled exception from message_handler, potentially crashing or degrading the session. The fix should add try/except guards at the message_handler call sites so the initial-connect path propagates errors via the Future while the notification path degrades gracefully.

Suggestions

Wrap the entire block between session setup and ready_future.set_result(None)—including load_tools, load_prompts, session.initialize (elif branch), and set_logging_level—in a try/except that calls ready_future.set_exception(exc) on failure, ensuring the Future is always resolved.
When manually setting exc.cause, also set exc.suppress_context = True to match the behavior of raise X from Y and produce cleaner tracebacks.
Add tests for connection failure paths: session creation failure, session initialization failure, and list_tools()/list_prompts() failure during connect. A test mocking list_tools() to raise would currently hang, confirming the missing error handling.
The Event→Future conversion is the right approach—keep it. It correctly surfaces errors that were previously silently swallowed via the Event.

Automated review by moonbox3's agents

github-actions · 2026-04-08T00:48:47Z

python/semantic_kernel/connectors/mcp.py

@@ -344,7 +356,8 @@ async def _inner_connect(self, ready_event: asyncio.Event) -> None:
            except Exception:
                logger.warning("Failed to set log level to %s", logger.level)
        # Setting up is complete, will now signal the main loop that we are ready


Deadlock bug: If load_tools(), load_prompts(), session.initialize() (line 332), or set_logging_level() raises an unhandled exception, the task exits but ready_future is never resolved—connect() hangs forever on await ready_future. Wrap this entire block in a try/except that resolves the future on failure, consistent with the earlier error paths in _inner_connect.

Suggested change

try:

if self.load_tools_flag:

await self.load_tools()

if self.load_prompts_flag:

await self.load_prompts()

if logger.level != logging.NOTSET:

try:

await self.session.set_logging_level(

next(level for level, value in LOG_LEVEL_MAPPING.items() if value == logger.level)

)

except Exception:

logger.warning("Failed to set log level to %s", logger.level)

# Setting up is complete, will now signal the main loop that we are ready

if not ready_future.done():

ready_future.set_result(None)

except Exception as ex:

await self._exit_stack.aclose()

if not ready_future.done():

ready_future.set_exception(ex)

return

github-actions · 2026-04-08T00:48:47Z

python/semantic_kernel/connectors/mcp.py

+                if not ready_future.done():
+                    exc = KernelPluginInvalidConfigurationError(
+                        "Failed to connect to the MCP server. Please check your configuration."
+                    )


Unlike raise KernelPluginInvalidConfigurationError(...) from ex, manually setting __cause__ does not set __suppress_context__ = True. This means Python will print 'During handling of the above exception…' instead of 'The above exception was the direct cause of…'. Add exc.__suppress_context__ = True after setting __cause__.

Suggested change

)

exc.__cause__ = ex

exc.__suppress_context__ = True

ready_future.set_exception(exc)

Fix MCP plugin hang when initialization fails

93b77d3

aryanyk requested a review from a team as a code owner January 3, 2026 21:49

moonbox3 added the python Pull requests for the Python Semantic Kernel label Jan 3, 2026

github-actions bot changed the title ~~Fix MCP plugin hang when initialization fails #13414~~ Python: Fix MCP plugin hang when initialization fails #13414 Jan 3, 2026

Merge branch 'main' into fix-mcp-hang-on-init

96b3a6a

github-actions bot reviewed Apr 8, 2026

View reviewed changes

Merge branch 'main' into fix-mcp-hang-on-init

13115ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Fix MCP plugin hang when initialization fails #13414#13437

Python: Fix MCP plugin hang when initialization fails #13414#13437
aryanyk wants to merge 3 commits intomicrosoft:mainfrom
aryanyk:fix-mcp-hang-on-init

aryanyk commented Jan 3, 2026

Uh oh!

aryanyk commented Jan 3, 2026

Uh oh!

moonbox3 commented Jan 7, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot Apr 8, 2026

Uh oh!

github-actions bot Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

+        try:
+            if self.load_tools_flag:
+                await self.load_tools()
+            if self.load_prompts_flag:
+                await self.load_prompts()
+            if logger.level != logging.NOTSET:
+                try:
+                    await self.session.set_logging_level(
+                        next(level for level, value in LOG_LEVEL_MAPPING.items() if value == logger.level)
+                    )
+                except Exception:
+                    logger.warning("Failed to set log level to %s", logger.level)
+            # Setting up is complete, will now signal the main loop that we are ready
+            if not ready_future.done():
+                ready_future.set_result(None)
+        except Exception as ex:
+            await self._exit_stack.aclose()
+            if not ready_future.done():
+                ready_future.set_exception(ex)
+            return

-                    )
+                exc.__cause__ = ex
+                    exc.__suppress_context__ = True
+                    ready_future.set_exception(exc)

Conversation

aryanyk commented Jan 3, 2026

Uh oh!

aryanyk commented Jan 3, 2026

Uh oh!

moonbox3 commented Jan 7, 2026

Python Unit Test Overview

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Automated Code Review

✗ Correctness

✗ Security Reliability

✗ Test Coverage

✗ Design Approach

Flagged Issues

Suggestions

Uh oh!

github-actions bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants