Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ description: "Turn detection using Krisp VIVA SDK"

## Overview

`KrispVivaTurn` is a turn analyzer that uses Krisp's VIVA SDK turn detection (Tt) API to determine when a user has finished speaking. Unlike the [Smart Turn model](/api-reference/server/utilities/turn-detection/smart-turn-overview) which analyzes audio in batches when VAD detects a pause, `KrispVivaTurn` processes audio frame-by-frame in real time using Krisp's streaming model.
`KrispVivaTurn` is a turn analyzer that uses Krisp's VIVA SDK turn detection v3 (Tt) API to determine when a user has finished speaking. The Tt API accepts an external VAD flag with each audio frame, allowing the model to leverage voice activity information for more accurate turn detection. Unlike the [Smart Turn model](/api-reference/server/utilities/turn-detection/smart-turn-overview) which analyzes audio in batches when VAD detects a pause, `KrispVivaTurn` processes audio frame-by-frame in real time using Krisp's streaming model.

<CardGroup cols={2}>
<Card
Expand Down Expand Up @@ -101,10 +101,10 @@ user_aggregator, assistant_aggregator = LLMContextAggregatorPair(

## How It Works

`KrispVivaTurn` processes audio as a streaming model, analyzing each audio frame in real time:
`KrispVivaTurn` processes audio as a streaming model, analyzing each audio frame in real time with VAD integration:

1. **Frame-by-frame processing**: Each incoming audio frame is processed by the Krisp turn detection model, which outputs a probability that the user's turn is complete.
2. **Speech tracking**: VAD signals are used to track when speech starts and stops.
1. **VAD-enhanced processing**: Each incoming audio frame is processed by the Krisp turn detection v3 model along with a VAD flag indicating whether speech is present. The model uses both the audio and VAD information to output a probability that the user's turn is complete.
2. **Speech tracking**: VAD signals are used to track when speech starts and stops, providing context to the turn detection model.
3. **Threshold crossing**: When the model's probability exceeds the configured `threshold` after speech has been detected, the turn is marked as complete.

This differs from the [Smart Turn model](/api-reference/server/utilities/turn-detection/smart-turn-overview) which buffers audio and runs batch inference when VAD detects a pause. `KrispVivaTurn` makes its decision continuously as audio flows through, which can result in faster turn detection.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,57 @@ async def on_wake_phrase_timeout(strategy):
detection.
</Note>

### KrispVivaIPUserTurnStartStrategy

Uses Krisp's Interruption Prediction (IP) model to distinguish genuine user interruptions from backchannels (e.g., "uh-huh", "yeah"). When VAD detects user speech, this strategy feeds audio frames into the Krisp VIVA IP model, which outputs a probability indicating whether the speech is a genuine interruption. A user turn is triggered only when this probability exceeds the configured threshold.

This strategy is designed to work alongside other start strategies (e.g., `TranscriptionUserTurnStartStrategy` as a fallback).

<ParamField path="model_path" type="Optional[str]" default="None">
Path to the Krisp VIVA IP model file (.kef extension). If None, uses the
`KRISP_VIVA_IP_MODEL_PATH` environment variable.
</ParamField>

<ParamField path="threshold" type="float" default="0.5">
IP probability threshold (0.0 to 1.0). When the model's output exceeds this
value, the speech is classified as a genuine interruption.
</ParamField>

<ParamField path="frame_duration_ms" type="int" default="20">
Frame duration in milliseconds for IP processing. Supported values: 10, 15,
20, 30, 32.
</ParamField>

<ParamField path="api_key" type="str" default='""'>
Krisp SDK API key. If empty, falls back to the `KRISP_VIVA_API_KEY`
environment variable.
</ParamField>

```python
from pipecat.turns.user_start import (
KrispVivaIPUserTurnStartStrategy,
TranscriptionUserTurnStartStrategy,
)

strategy = KrispVivaIPUserTurnStartStrategy(
model_path="/path/to/ip_model.kef",
threshold=0.5,
)

# Use with a fallback strategy
strategies = UserTurnStrategies(
start=[
KrispVivaIPUserTurnStartStrategy(threshold=0.5),
TranscriptionUserTurnStartStrategy(), # Fallback
],
)
```

<Note>
Requires the Krisp Python SDK. See the [Krisp VIVA
guide](/pipecat/features/krisp-viva) for installation instructions.
</Note>

### ExternalUserTurnStartStrategy

Delegates turn start detection to an external processor. This strategy listens for `UserStartedSpeakingFrame` frames emitted by other components in the pipeline (such as speech-to-speech services).
Expand Down
50 changes: 48 additions & 2 deletions pipecat/features/krisp-viva.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,11 @@ description: "Learn how to integrate Krisp's VIVA voice isolation and turn detec

## Overview

Krisp's VIVA SDK provides three capabilities for Pipecat applications:
Krisp's VIVA SDK provides four capabilities for Pipecat applications:

- **Voice Isolation** — Filter out background noise and voices from the user's audio input stream, yielding clearer audio for fewer false interruptions and better transcription.
- **Turn Detection** — Determine when a user has finished speaking using Krisp's streaming turn detection model, as an alternative to the [Smart Turn model](/api-reference/server/utilities/turn-detection/smart-turn-overview).
- **Interruption Prediction** — Distinguish genuine user interruptions from backchannels (e.g. "uh-huh", "yeah"), preventing the bot from being interrupted by brief acknowledgements.
- **Voice Activity Detection** — Detect speech in audio streams using Krisp's VAD model, supporting sample rates from 8kHz to 48kHz.

You can use any combination of these features together.
Expand All @@ -29,6 +30,13 @@ You can use any combination of these features together.
>
API reference for turn detection
</Card>
<Card
title="KrispVivaIPUserTurnStartStrategy"
icon="code"
href="/api-reference/server/utilities/turn-management/user-turn-strategies#krispvivaipuserturnstartstrategy"
>
API reference for interruption prediction
</Card>
<Card
title="KrispVivaVadAnalyzer Reference"
icon="code"
Expand Down Expand Up @@ -111,13 +119,17 @@ KRISP_VIVA_FILTER_MODEL_PATH=/PATH_TO_UNZIPPED_MODELS/krisp-viva-tel-v2.kef
# Turn detection model path
KRISP_VIVA_TURN_MODEL_PATH=/PATH_TO_UNZIPPED_MODELS/krisp-viva-tt-v2.kef

# Interruption prediction model path
KRISP_VIVA_IP_MODEL_PATH=/PATH_TO_UNZIPPED_MODELS/krisp-viva-ip-v3.kef

# Voice activity detection model path (optional)
KRISP_VIVA_VAD_MODEL_PATH=/PATH_TO_UNZIPPED_MODELS/krisp-viva-vad-v2.kef
```

<Note>
Each feature uses a **different model**. Set `KRISP_VIVA_FILTER_MODEL_PATH`
for voice isolation, `KRISP_VIVA_TURN_MODEL_PATH` for turn detection, and
for voice isolation, `KRISP_VIVA_TURN_MODEL_PATH` for turn detection,
`KRISP_VIVA_IP_MODEL_PATH` for interruption prediction, and
`KRISP_VIVA_VAD_MODEL_PATH` for voice activity detection.
</Note>

Expand Down Expand Up @@ -182,6 +194,40 @@ user_aggregator, assistant_aggregator = LLMContextAggregatorPair(

See the [KrispVivaTurn reference](/api-reference/server/utilities/turn-detection/krisp-viva-turn) for configuration options.

## Interruption Prediction

`KrispVivaIPUserTurnStartStrategy` uses Krisp's Interruption Prediction (IP) model to distinguish genuine user interruptions from backchannels. When VAD detects user speech, the IP model analyzes the audio and outputs a probability indicating whether the speech is a real interruption or a brief acknowledgement (e.g., "uh-huh", "yeah").

This prevents the bot from being interrupted unnecessarily by short utterances. Configure it as a user turn start strategy:

```python
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
)
from pipecat.turns.user_start import (
KrispVivaIPUserTurnStartStrategy,
TranscriptionUserTurnStartStrategy,
)
from pipecat.turns.user_turn_strategies import UserTurnStrategies

user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(
user_turn_strategies=UserTurnStrategies(
start=[
KrispVivaIPUserTurnStartStrategy(threshold=0.5),
TranscriptionUserTurnStartStrategy(), # Fallback
],
),
vad_analyzer=SileroVADAnalyzer(),
),
)
```

See the [KrispVivaIPUserTurnStartStrategy reference](/api-reference/server/utilities/turn-management/user-turn-strategies#krispvivaipuserturnstartstrategy) for configuration options.

## Voice Activity Detection

`KrispVivaVadAnalyzer` detects speech in audio streams using Krisp's VAD model. It supports sample rates from 8kHz to 48kHz, making it suitable for a wide range of applications including telephony and high-quality audio.
Expand Down
Loading