Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion livekit-agents/livekit/agents/stt/stt.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@
from dataclasses import dataclass, field
from enum import Enum, unique
from types import TracebackType
from typing import Any, Generic, Literal, TypeVar
from typing import TYPE_CHECKING, Any, Generic, Literal, TypeVar

if TYPE_CHECKING:
from .. import vad as _vad

from pydantic import BaseModel, ConfigDict, Field

Expand Down Expand Up @@ -277,6 +280,15 @@ def prewarm(self) -> None:
"""Pre-warm connection to the STT service"""
pass

def on_vad_event(self, ev: _vad.VADEvent) -> None:
"""Receive VAD events from the session-level VAD, when one is attached.

Default implementation is a no-op. Plugins may override this to react to
external VAD signals — for example, to call `finalize()` on END_OF_SPEECH
when running in an externally-driven turn detection mode.
"""
pass


class RecognizeStream(ABC):
class _FlushSentinel:
Expand Down
8 changes: 8 additions & 0 deletions livekit-agents/livekit/agents/voice/audio_recognition.py
Original file line number Diff line number Diff line change
Expand Up @@ -880,6 +880,14 @@ async def _on_stt_event(self, ev: stt.SpeechEvent) -> None:

@utils.log_exceptions(logger=logger)
async def _on_vad_event(self, ev: vad.VADEvent) -> None:
# Forward to the active STT plugin so it can react to session-level VAD
# (e.g. call finalize() on END_OF_SPEECH for externally-driven modes).
if (stt_inst := self._session.stt) is not None:
try:
stt_inst.on_vad_event(ev)
except Exception:
logger.exception("error forwarding VAD event to STT")
Comment on lines +885 to +889
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 VAD events forwarded to session-level STT instead of the active STT instance

The code at audio_recognition.py:885 uses self._session.stt to forward VAD events, but the active STT (the one actually processing audio) is resolved by agent_activity.py:3629-3630 as self._agent.stt if is_given(self._agent.stt) else self._session.stt. When a user configures the agent with its own STT via Agent(stt=my_stt), the active STT is the agent's instance, not the session's. In this case, self._session.stt may return a different STT instance (or None if only the agent has an STT), so the on_vad_event call either reaches the wrong instance or doesn't happen at all. This means any plugin that overrides on_vad_event (the stated purpose of this PR) won't receive events when STT is set at the agent level.

Prompt for agents
The issue is in `_on_vad_event` in `audio_recognition.py`. The code forwards VAD events to `self._session.stt`, but the active STT may be the agent-level one (resolved via `agent_activity.stt` property at `agent_activity.py:3629-3630`). 

The `AudioRecognition` class currently only holds a reference to the `AgentSession` (via `self._session`), not the `AgentActivity` or `Agent`. To fix this, you could either:
1. Store a reference to the active STT instance (the `stt.STT` object, not just the `io.STTNode` callable) in `AudioRecognition` and update it when `update_stt` is called. For example, add an optional `stt_instance: stt.STT | None` parameter.
2. Have `AudioRecognition.__init__` or a new setter accept the active STT instance, and have `AgentActivity` pass `self.stt` (which correctly resolves agent vs session STT).
3. Access the active STT through the session's current activity, though this would add coupling.

The goal is to ensure `on_vad_event` is called on the same STT instance that the default `stt_node` uses (i.e., `activity.stt`).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.


if ev.type == vad.VADEventType.START_OF_SPEECH:
speech_start_time = time.time() - ev.speech_duration - ev.inference_duration
if not self._vad_speech_started:
Expand Down
Loading