Push-To-Talk
Supporting explicit detection of user conversation
Overview
In real conversations, people don’t always speak in clean, uninterrupted sentences. Users may pause to gather their thoughts, correct themselves mid-sentence, or briefly stop talking without actually being done. In these cases, relying purely on automatic speech or silence detection can lead to premature interruptions or incomplete responses.
To address this, we provide Push-to-Talk (PTT) — a mechanism that gives developers explicit control over when user speech starts and ends. Instead of the system guessing when the user is finished speaking, your application tells us exactly when to listen and when to process.
Why Push-to-Talk?
Push-to-Talk is ideal when:
- Your users have non-linear or thoughtful conversational patterns
- Short pauses should not be interpreted as the end of speech
- You want deterministic control over speech boundaries
- You’re implementing custom UI/UX (e.g. press-and-hold, toggle buttons, keyboard shortcuts)
By explicitly defining the speaking window, you ensure that only the intended audio is processed and responded to.
High-Level Flow
The Push-to-Talk interaction follows a simple, explicit lifecycle:
- Start Push-to-Talk
- Your application signals that the user is beginning to speak.
- The system starts buffering and recording user audio.
- Send the event,
- User Speaks
- The user can talk freely, including pauses or corrections.
- No response is generated during this phase.
- End Push-to-Talk
- Your application signals that the user has finished speaking.
- Audio capture stops.
- Processing & Response
- The avatar processes only the audio captured between the start and end signals.
- The avatar generates and delivers its response.
Implementation Notes
- When starting the session, explicitly set
interactivity_typeto bePUSH_TO_TALK - Push-to-Talk command events should be sent by application.
- The start event is
user.start_push_to_talk - The end event is
user.stop_push_to_talk
- The start event is
- The system will not attempt to infer speech completion while PTT is active.
- We emit the corresponding server events.
user.push_to_talk_starteduser.push_to_talk_start_faileduser.push_to_talk_stoppeduser.push_to_talk_stop_failed
- We emit the corresponding server events.
- Any audio outside of a PTT window is ignored.
- You are recommended to map PTT to UI gestures such as:
- Press-and-hold buttons
- Toggle switches
- Keyboard shortcuts
- Touch or controller input
Summary
Push-to-Talk shifts control from automatic speech detection to your application. By explicitly defining when speech starts and ends, you gain predictable, interruption-free conversations that better match real human speaking patterns.
Use PTT when conversational timing matters — and let your users speak at their own pace.
Updated about 4 hours ago