Skip to content

Python 3.14 compatibility + EZSP-over-TCP stability fixes#720

Open
silenthooligan wants to merge 2 commits intozigpy:devfrom
silenthooligan:fix/python-3.14-ezsp-tcp
Open

Python 3.14 compatibility + EZSP-over-TCP stability fixes#720
silenthooligan wants to merge 2 commits intozigpy:devfrom
silenthooligan:fix/python-3.14-ezsp-tcp

Conversation

@silenthooligan
Copy link
Copy Markdown

Five focused changes that together unbreak EZSP/EmberZNet on Python 3.14 and fix several long-standing rough edges with TCP-bridged serial radios (ser2net, ESPHome stream_server / serial_proxy).

Supersedes #711, which was closed unmerged by its author with no review. This PR rebases that work onto current dev, drops the diagnostic-only LOGGER.warning calls #711 carried alongside the fixes, restores the post-#714 NETWORK_COORDINATOR_STARTUP_RESET_WAIT = 2, adds the bellows/cli/util.py get_event_loop fix, and is validated end-to-end against real hardware (see below).

Python 3.14 compatibility

  • bellows/thread.py: asyncio.iscoroutinefunctioninspect.iscoroutinefunction. Same semantics, drops the deprecation warning.
  • bellows/thread.py, bellows/uart.py: replace four asyncio.get_event_loop() calls (deprecated outside a running loop on 3.14) with asyncio.get_running_loop(). All four sites are reached from async def paths.
  • bellows/cli/util.py: background() decorator wrapped CLI commands with asyncio.get_event_loop().run_until_complete(...), which raises RuntimeError: no current event loop on 3.14 from a fresh sync entry point. Replaced with asyncio.run(...).

ThreadsafeProxy: closed-loop as ConnectionError

When the secondary event-loop thread is gone, ThreadsafeProxy used to log a warning and return None, which propagated as TypeError: 'NoneType' object can't be awaited and (on 3.14) left ZHA in an indefinite retry loop. Raise ConnectionError so callers can handle the failure cleanly. Also guards the loop being closed between the is_closed() check and run_coroutine_threadsafe() dispatch.

EZSP startup race over TCP

Reset frames from the NCP can arrive faster than wait_for_startup_reset() is reached when the transport is TCP. The first reset gets dropped, enter_failed_state() fires, and the integration never recovers without a manual restart.

  • bellows/uart.py::_connect: pre-create gateway._startup_reset_future before opening the connection so any reset frame received during setup is caught by reset_received().
  • bellows/uart.py::wait_for_startup_reset: replace assert is None with if is None so the pre-created future isn't rejected.

TCP serial-bridge connection lifecycle

  • bellows/ash.py::eof_received: return True to suppress the transport's auto-close. Serial-over-TCP bridges (ser2net, ESPHome stream_server / serial_proxy) sometimes signal EOF during initialization without intending to close the socket; the default auto-close orphans the connection.
  • bellows/ezsp/__init__.py::_startup_reset: raise a clear EzspError when self._gw is None.
  • bellows/ezsp/__init__.py::disconnect: tolerate _gw is None, and on ConnectionError from the gateway force-close the underlying TCP socket so ser2net/stream_server releases the port for subsequent attempts.

Test plan

  • All 112 existing tests pass on Python 3.14.2.
  • tests/test_thread.py::test_proxy_loop_closed now asserts ConnectionError (was: silently returns).
  • tests/test_ezsp.py::test_startup_reset_gw_none covers the new null-gateway guard.
  • tests/test_ezsp.py::test_disconnect_gw_none covers tolerance of null gateway in disconnect().

End-to-end validation against real hardware

Patched bellows was vendored into a Home Assistant 2026.5.0b2 image and exercised against a live Nabu Casa Connect ZBT-2 (EFR32MG24 Zigbee NCP) behind an ESPHome stream_server raw-TCP bridge on Python 3.14.2:

  1. bellows.ezsp.EZSP.connect() + startup_reset() complete cleanly. EZSP_VERSION = 13 negotiated. get_board_info() returns NCP identity (Nabu Casa, Connect ZBT).
  2. ZHA boots fresh, forms a network on channel 25, pairs five IAS-Zone water-leak sensors (Samjin model water).
  3. After OTA-ing the dongle to ESPHome serial_proxy (encrypted native API on :6053 only) and reconfiguring ZHA to esphome-hass://esphome/<entry_id>?port_name=..., the existing zigbee.db is reused via setup_strategy_advanced -> reuse_settings. All five sensors auto-rejoin without re-pair, attribute reports flow (battery, temperature, water-leak state).

Without this patch, step 1 fails immediately on Python 3.14 with 'NoneType' object can't be awaited (closed event loop) or AttributeError (get_event_loop no current loop), depending on which call path fires first. Step 3 also exercises the EOF-during-init suppression — the encrypted native-API tunnel issued an EOF during the noise-protocol handshake that the eof_received -> return True change keeps open.

Provenance

The bulk of the runtime fixes (and their tests) were originally drafted by @aautem in #711. Credited via Co-Authored-By trailer in the commit.

🤖 Validation report and PR drafted with Claude Code.

Five focused changes that together unbreak EZSP/EmberZNet on Python 3.14
and fix several long-standing rough edges with TCP-bridged serial radios
(ser2net, ESPHome stream_server / serial_proxy).

## Python 3.14 compatibility

- `bellows/thread.py`: `asyncio.iscoroutinefunction` is deprecated in 3.12
  and slated for removal; use `inspect.iscoroutinefunction` instead. Same
  semantics, no deprecation warning.
- `bellows/thread.py`, `bellows/uart.py`: replace four call sites of
  `asyncio.get_event_loop()` (deprecated outside a running loop) with
  `asyncio.get_running_loop()`. All four sites are reached from `async
  def` paths so they have a running loop.
- `bellows/cli/util.py`: `background()` decorator wrapped CLI commands
  with `asyncio.get_event_loop().run_until_complete(...)`, which raises
  `RuntimeError: no current event loop` on 3.14 from a fresh sync entry
  point. Replaced with `asyncio.run(...)` which is the documented modern
  equivalent.

## ThreadsafeProxy: surface closed-loop as ConnectionError

When the secondary event-loop thread is gone, `ThreadsafeProxy` used to
log a warning and return `None`, which propagated as
`TypeError: 'NoneType' object can't be awaited` at the call site —
opaque, and on Python 3.14 it left the integration in an indefinite
retry loop without recovery. Raise `ConnectionError` instead so callers
(EZSP startup_reset, retries) can handle the failure cleanly. Also
guards against the loop being closed between the `is_closed()` check
and `run_coroutine_threadsafe()` dispatch.

## EZSP startup race over TCP

Reset frames from the NCP can arrive on the wire faster than
`wait_for_startup_reset()` is reached when the underlying transport is
TCP. The first reset gets dropped, `enter_failed_state()` fires, and the
integration never recovers without a manual restart.

- `bellows/uart.py::_connect`: pre-create `gateway._startup_reset_future`
  before opening the connection so any reset frame received during
  setup is caught by `reset_received()` instead of being treated as
  spurious.
- `bellows/uart.py::wait_for_startup_reset`: replace the
  `assert is None` with `if is None` so the pre-created future isn't
  rejected.

## TCP serial-bridge connection lifecycle

- `bellows/ash.py::eof_received`: return `True` to suppress the
  transport's auto-close. Serial-over-TCP bridges (ser2net, ESPHome
  stream_server / serial_proxy) sometimes signal EOF during
  initialization handshakes without intending to close the socket; the
  default auto-close orphans the connection.
- `bellows/ezsp/__init__.py::_startup_reset`: raise a clear `EzspError`
  when `self._gw` is `None` instead of failing with an opaque
  `AttributeError`.
- `bellows/ezsp/__init__.py::disconnect`: tolerate `_gw is None`, and on
  `ConnectionError` from the gateway's disconnect (secondary loop dead)
  force-close the underlying TCP socket so ser2net/stream_server
  releases the port for subsequent attempts.

## Tests

- `tests/test_thread.py::test_proxy_loop_closed`: now asserts
  `ConnectionError` is raised (was: silently returns).
- `tests/test_ezsp.py::test_startup_reset_gw_none`: covers the new
  null-gateway guard.
- `tests/test_ezsp.py::test_disconnect_gw_none`: covers tolerance of
  null gateway in `disconnect()`.

## Validation

Patched bellows was vendored into a Home Assistant 2026.5.0b2 image and
exercised against a live Nabu Casa Connect ZBT-2 (EFR32MG24 Zigbee NCP)
behind an ESPHome `stream_server` raw-TCP bridge on Python 3.14.2:
- `bellows.ezsp.EZSP.connect()` + `startup_reset()` complete; protocol
  version 13 read; `get_board_info()` returns NCP identity.
- ZHA boots, forms a fresh network, pairs 5 IAS-Zone water-leak sensors.
- After OTA-ing the dongle from `stream_server` to ESPHome `serial_proxy`
  (encrypted native API), reconfiguring ZHA to
  `esphome-hass://esphome/<entry_id>?port_name=<port>` reuses the
  existing zigbee.db; all 5 sensors auto-rejoin without re-pair.

The full upstream test suite (112 tests) passes.

## Provenance

The bulk of the runtime fixes (and their tests) were originally drafted
by @aautem in zigpy#711 (closed by author with no review). This PR rebases
that work onto current `dev`, drops the diagnostic-only `LOGGER.warning`
calls that zigpy#711 carried alongside the fixes, restores the post-zigpy#714
`NETWORK_COORDINATOR_STARTUP_RESET_WAIT = 2`, and adds the
`bellows/cli/util.py` get_event_loop fix and the EZSP-over-TCP
end-to-end validation against real hardware.

Co-Authored-By: Alex Autem <autem.alex@gmail.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.54%. Comparing base (4b97a6d) to head (3f34b5c).

Additional details and impacted files
@@           Coverage Diff           @@
##              dev     #720   +/-   ##
=======================================
  Coverage   99.54%   99.54%           
=======================================
  Files          61       61           
  Lines        4147     4166   +19     
=======================================
+ Hits         4128     4147   +19     
  Misses         19       19           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

silenthooligan pushed a commit to silenthooligan/code-sharing that referenced this pull request May 5, 2026
The ZBT-2 Zigbee YAML now ships ESPHome `serial_proxy` (encrypted
native API on :6053) as the default. ZHA reaches the radio via
  esphome-hass://esphome/<entry_id>?port_name=MG24%20Zigbee%20NCP
on HA 2026.5+. The previous `stream_server` raw-TCP variant moves to
zbt-2-zigbee/legacy-stream-server/ as the pre-2026.5 / pre-bellows-fix
fallback.

Status & dependencies are documented in zbt-2-zigbee/README.md and the
top-level README. Two upstream gates: HA >= 2026.5 (for the
`esphome-hass://` URL handler in homeassistant/components/esphome/
serial_proxy.py) and zigpy/bellows#720 (Python 3.14 + EZSP-over-TCP
runtime fixes; until merged, vendor patched bellows into the HA image).

Validated against HA 2026.5.0b2 + the patched bellows fork. EZSP
startup completes, ZHA pairs IAS-Zone end devices and routers,
attribute reports flow.

Migration walkthrough (existing stream_server -> serial_proxy without
re-pairing devices) is in zbt-2-zigbee/README.md. Key gotcha: ZHA
doesn't support reconfiguring radio path, must delete + recreate using
"Advanced -> Reuse settings" so the existing zigbee.db is honored.

Closes #3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three focused tests for the new code paths introduced in this PR:

- `tests/test_thread.py::test_proxy_coroutine_loop_closed_mid_dispatch`
  exercises the `RuntimeError` catch around `run_coroutine_threadsafe()`
  when the loop is closed between the `is_closed()` check and dispatch:
  asserts ConnectionError is raised and the orphaned coroutine is closed
  (no un-awaited-coroutine warning).
- `tests/test_ezsp.py::test_disconnect_force_closes_socket_on_connection_error`
  exercises the `disconnect()` path that, on `ConnectionError` from the
  gateway, walks the proxy/transport chain to force-close the underlying
  TCP socket so ser2net / stream_server / serial_proxy releases the
  port.
- `tests/test_ezsp.py::test_disconnect_socket_force_close_swallows_exceptions`
  exercises the inner `except Exception: pass` fallback when the
  proxy/transport chain isn't fully populated (e.g. `_obj` has no
  `_transport` attribute), confirming `_gw` is still cleared.

Lifts patch coverage from 63.6% to 100% on `bellows/thread.py` and
brings the EZSP changes within the project's 99.25% gate.
@puddly
Copy link
Copy Markdown
Contributor

puddly commented May 5, 2026

Can you attach some debug logs of the issue you're having and how this PR solves it? ZHA is used with a ton of TCP coordinators and we haven't really had reports of issues like this.

@silenthooligan
Copy link
Copy Markdown
Author

silenthooligan commented May 5, 2026

Live logs against a Nabu Casa Connect ZBT-2 (EFR32MG24, ESPHome stream_server raw-TCP bridge on 192.168.40.16:6638) from inside a fresh python:3.14-slim container, comparing stock 0.49.1 vs this PR.

Full bellows DEBUG output for all three runs + the test scripts are in this gist:
https://gist.github.com/silenthooligan/6d1399b94ebad2716a2e0b39a1809c7f

Summary:

01: stock 0.49.1, first connect (no teardown)
EZSP startup completes cleanly, board info reads. The recent #714 startup-wait extension dodges the original startup race on first connect, so the bug doesn't reproduce here on a clean boot. It surfaces on reconnect cycles instead, see runs 02/03.

02: stock 0.49.1, secondary loop teardown then dispatch

>>> Phase 1: connect + startup
>>> connected, EZSP version 13
>>> Phase 2: closing the secondary loop
>>> secondary loop running=False closed=True
>>> Phase 3: dispatch a command via the proxy
WARNING [bellows.thread] Attempted to use a closed event loop
>>> dispatch FAILED: TypeError: 'NoneType' object can't be awaited
>>> Phase 4: clean disconnect
WARNING [bellows.thread] Attempted to use a closed event loop
>>> disconnect FAILED: TypeError: 'NoneType' object can't be awaited

Caller has no signal that the connection died (just a TypeError on None), so the integration's retry path can't tell the difference between a transient command failure and a dead transport. ZHA loops without ever rebuilding the gateway.

03: patched bellows (this PR), same teardown

>>> Phase 1: connect + startup
>>> connected, EZSP version 13
>>> Phase 2: closing the secondary loop
>>> secondary loop running=False closed=True
>>> Phase 3: dispatch a command via the proxy
>>> dispatch FAILED: ConnectionError: Attempted to use a closed event loop, the connection may have been lost
>>> Phase 4: clean disconnect
>>> disconnect OK

ConnectionError is raised on dispatch (catchable, signals "rebuild the connection"), and disconnect() succeeds because the PR's fallback force-closes the underlying TCP socket when the gateway can't be reached.

Three conditions for the bug to fire in production: closed secondary loop, caller mid-dispatch, Python 3.14's stricter await None. Stable single-boot ConBee/SkyConnect setups don't see it. Setups that cycle (HA restarts, host reboots, WiFi-bridged dongles flapping) do.

Other bits in this PR (uart.py::wait_for_startup_reset race pre-create, ash.py::eof_received -> True for serial-over-TCP bridges that signal EOF mid-handshake) are TCP-only and don't surface on the first-connect run because #714 already extended the startup wait. They show up on reconnect attempts against ESPHome serial_proxy (encrypted native API) where the noise handshake itself produces an EOF that auto-closes the socket on stock.

@puddly
Copy link
Copy Markdown
Contributor

puddly commented May 5, 2026

Can you attach a log from the live device that you're communicating with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants