Skip to content

fix(opcua): prevent BadSessionIdInvalid race condition during session reconnection#2123

Open
palandri wants to merge 1 commit intothingsboard:masterfrom
palandri:fix/opcua-badsessionidinvalid-race-condition
Open

fix(opcua): prevent BadSessionIdInvalid race condition during session reconnection#2123
palandri wants to merge 1 commit intothingsboard:masterfrom
palandri:fix/opcua-badsessionidinvalid-race-condition

Conversation

@palandri
Copy link
Copy Markdown

@palandri palandri commented Apr 4, 2026

fix(opcua): prevent BadSessionIdInvalid race condition on slow servers

Summary

Fixes a race condition in the OPC-UA connector that caused BadSessionIdInvalid errors when reconnecting to slow servers:

  • disconnect_if_connected() now sets __connected = False before awaiting the actual disconnect, preventing the monitor loop from observing a stale True state.
  • _monitor_server_loop skips the server_state.read_value() heartbeat call whenever __connected is False.

Motivation

On slow (or misconfigured) OPC-UA servers, the session timeout might be long (e.g. 1200 s in our case, OPC-UA server not configured by us). The watchdog interval is derived from it (session_timeout / 2), so _monitor_server_loop sleeps for tens of seconds between heartbeats.

When the main loop or an error handler called disconnect_if_connected(), the asyncio scheduler yielded control during await self.__client.disconnect() — but __connected was only set to False later (in the finally block of start_client, and only when __stopped was True). If _monitor_server_loop woke up in that window, it saw __connected = True and attempted read_value() on an already-closed session, producing BadSessionIdInvalid.

Implementation Details

Root cause fix — disconnect_if_connected()

async def disconnect_if_connected(self):
    if self.__connected:
+       self.__connected = False   # atomic in asyncio: no await between check and set
        try:
            ...
            await self.__client.disconnect()

Because asyncio is cooperative, there is no await between the guard check and the flag mutation — the assignment is effectively atomic. Any coroutine that wakes up during await disconnect() will already observe __connected = False.

Defensive guard — _monitor_server_loop()

while not self.__client._closing and not self.__stopped:
    await asyncio.sleep(timeout)
+   if not self.__connected:
+       continue
    _ = await self.__client.nodes.server_state.read_value()

A second layer of protection: even if __connected were not zeroed through disconnect_if_connected() (e.g. set directly by the monitor's own exception handlers), the heartbeat read is unconditionally skipped when not connected.

Behavior Change

Scenario Before After
Monitor wakes up during disconnect Calls read_value()BadSessionIdInvalid Sees __connected = False, skips
disconnect_if_connected() called twice Second call disconnects again Second call is a no-op (idempotent)
Monitor in pending-reconnect state Keeps polling Sleeps idle until reconnected

Backward Compatibility

No breaking changes. The __connected flag semantics are preserved — it is set back to True only after a successful connect() in start_client().

Touched Code

Tests / Verification

  • Existing unit tests: tests/unit/connectors/opcua/35 tests, all passing with this fix applied.
  • Validated against packet capture showing the error eliminated on a slow server with sessionTimeoutInMillis: 120000.
  • Validated in production server (Siemens S7-1200) with 1200s of Max. Session Timeout.

Checklist

  • Root cause identified and fixed (disconnect_if_connected flag order)
  • Defensive guard added in _monitor_server_loop
  • disconnect_if_connected is now idempotent
  • No breaking changes to existing behavior
  • Existing unit tests pass (35/35)
  • Validation on slow server in production environment (Siemens S7-1200)

Move `__connected` flag clearing into `disconnect_if_connected()` to ensure
connection state is properly synchronized. Add guard at monitor loop entry to
skip server reads when not connected, preventing stale data access during
reconnection sequences.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant