Push-To-Talk

Overview

In real conversations, people don’t always speak in clean, uninterrupted sentences. Users may pause to gather their thoughts, correct themselves mid-sentence, or briefly stop talking without actually being done. In these cases, relying purely on automatic speech or silence detection can lead to premature interruptions or incomplete responses.

To address this, we provide Push-to-Talk (PTT) — a mechanism that gives developers explicit control over when user speech starts and ends. Instead of the system guessing when the user is finished speaking, your application tells us exactly when to listen and when to process.

Why Push-to-Talk?

Push-to-Talk is ideal when:

Your users have non-linear or thoughtful conversational patterns
Short pauses should not be interpreted as the end of speech
You want deterministic control over speech boundaries
You’re implementing custom UI/UX (e.g. press-and-hold, toggle buttons, keyboard shortcuts)

By explicitly defining the speaking window, you ensure that only the intended audio is processed and responded to.

High-Level Flow

The Push-to-Talk interaction follows a simple, explicit lifecycle:

Start Push-to-Talk
- Your application signals that the user is beginning to speak.
- The system starts buffering and recording user audio.
- Send the event,
User Speaks
- The user can talk freely, including pauses or corrections.
- No response is generated during this phase.
End Push-to-Talk
- Your application signals that the user has finished speaking.
- Audio capture stops.
Processing & Response
- The avatar processes only the audio captured between the start and end signals.
- The avatar generates and delivers its response.

Implementation Notes

When starting the session, explicitly set interactivity_type to be PUSH_TO_TALK
Push-to-Talk command events should be sent by application.
- The start event is user.start_push_to_talk
- The end event is user.stop_push_to_talk
The system will not attempt to infer speech completion while PTT is active.
- We emit the corresponding server events.
  - user.push_to_talk_started
  - user.push_to_talk_start_failed
  - user.push_to_talk_stopped
  - user.push_to_talk_stop_failed
Any audio outside of a PTT window is ignored.
You are recommended to map PTT to UI gestures such as:
- Press-and-hold buttons
- Toggle switches
- Keyboard shortcuts
- Touch or controller input

Summary

Push-to-Talk shifts control from automatic speech detection to your application. By explicitly defining when speech starts and ends, you gain predictable, interruption-free conversations that better match real human speaking patterns.

Use PTT when conversational timing matters — and let your users speak at their own pace.