VisionAgents

VisionAgents is GetStream’s Python framework for composing real-time AI agents from pluggable STT, LLM, TTS, and avatar components. The liveavatar plugin adds a LiveAvatar-rendered avatar to any VisionAgents pipeline, with lip-synced video and audio driven by your agent’s TTS (or Realtime LLM) output.

This page assumes you already have a VisionAgents pipeline running. If not, start with the VisionAgents quickstart, or pick a different integration path.

Upstream plugin references:

Prerequisites

A LiveAvatar account and API key. Sign up at app.liveavatar.com.
A GetStream account — VisionAgents uses getstream.Edge() as its transport for the end-user-facing call.
A working VisionAgents project with STT, LLM, and TTS providers configured (or a Realtime LLM).

Installation

uv add "vision-agents[liveavatar]"

pip install "vision-agents[liveavatar]"

Quick Start

import asyncio
from uuid import uuid4
from dotenv import load_dotenv

from vision_agents.core import Agent, User
from vision_agents.plugins import deepgram, gemini, getstream, liveavatar

load_dotenv()


async def start_avatar_agent():
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Avatar Agent", id="agent"),
        instructions="You're a friendly AI assistant. Keep responses short.",
        llm=gemini.LLM("gemini-3-flash-preview"),
        tts=deepgram.TTS(),
        stt=deepgram.STT(),
        avatar=liveavatar.Avatar(),
    )

    call = await agent.create_call("default", str(uuid4()))

    async with agent.join(call):
        await agent.simple_response("Hello! I'm your AI assistant with an avatar.")
        await agent.finish()


if __name__ == "__main__":
    asyncio.run(start_avatar_agent())

Required environment:

LIVEAVATAR_API_KEY=...
LIVEAVATAR_AVATAR_ID=...
GEMINI_API_KEY=...
DEEPGRAM_API_KEY=...
STREAM_API_KEY=...
STREAM_API_SECRET=...

Realtime LLM variant

With a Realtime LLM the TTS step is skipped — the LLM’s audio output is forwarded straight to the avatar. Do not pass tts= alongside a Realtime() LLM; the realtime audio path bypasses it entirely.

from vision_agents.plugins import gemini, getstream, liveavatar

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Avatar Agent", id="agent"),
    instructions="You're a friendly AI assistant.",
    llm=gemini.Realtime(),
    avatar=liveavatar.Avatar(is_sandbox=False),
)

Configuration

The plugin reads credentials from the environment:

LIVEAVATAR_API_KEY=...
LIVEAVATAR_AVATAR_ID=...

You can also pass api_key="..." and avatar_id="..." directly to liveavatar.Avatar(...).

Avatar parameters

Parameter	Type	Default	Notes
`avatar_id`	string	`None`	LiveAvatar avatar UUID. Falls back to `LIVEAVATAR_AVATAR_ID`.
`api_key`	string	`None`	Falls back to `LIVEAVATAR_API_KEY`.
`base_url`	string	`None`	Override the default LiveAvatar API endpoint.
`is_sandbox`	boolean	`True`	Sandbox sessions are free but duration-capped. Set `False` in production.
`max_session_duration`	integer	`None`	Session timeout in seconds. `None` uses LiveAvatar’s default.
`video_quality`	string	`"high"`	`low` \| `medium` \| `high` \| `very_high`.
`video_encoding`	string	`"H264"`	`H264` \| `VP8`.
`width`	integer	`1280`	Video width in pixels.
`height`	integer	`720`	Video height in pixels.
`fps`	integer	`30`	Frame rate.
`buffer_seconds`	float	`1.0`	Video buffer depth ahead of audio.

width, height, and fps configure the outbound track the bot republishes into the Stream Edge room — not the avatar render itself. The render is sized server-side by video_quality. Setting width=1920 with video_quality="low" upscales a low-resolution render; it does not produce a higher-resolution avatar.

buffer_seconds is not end-to-end latency. It caps the video queue depth (fps * buffer_seconds frames) used by the A/V synchronizer to absorb jitter. Larger values smooth more network variance at the cost of added smoothing delay; smaller values are tighter but more sensitive to hiccups.

Sandbox vs. production

is_sandbox=True (the default) lets you build without burning credits, but sessions are duration-capped. For production set is_sandbox=False and pass your own avatar_id — sessions then run against your account’s credits with no duration limit beyond max_session_duration.

Sandbox mode is pinned to a specific sandbox avatar — setting LIVEAVATAR_AVATAR_ID to one of your own production avatars while keeping is_sandbox=True will fail. Use the sandbox avatar in sandbox mode, and switch is_sandbox=False when pointing at your own avatar_id.

Architecture

VisionAgents uses GetStream’s Edge transport for the user-facing call. The liveavatar.Avatar() component slots into the agent’s avatar parameter: it opens a WebSocket to LiveAvatar, forwards TTS (or Realtime LLM) audio up, receives synchronized avatar audio + video, and republishes the avatar tracks into the Stream Edge room your end user is in. Your pipeline still owns STT, LLM, and TTS. LiveAvatar only renders the avatar.

How it works

Under the hood, liveavatar.Avatar() manages three things: a control connection to LiveAvatar, an inbound media connection from the LiveKit room LiveAvatar provisions, and an A/V sync stage that feeds frames into the Stream Edge track your bot publishes.

Two channels to LiveAvatar

WebSocket (control + audio upstream). TTS chunks coming out of your pipeline arrive as AudioOutputChunk items on the agent’s input_audio_stream. The plugin forwards each chunk via send_audio_frame, signals end_turn when a turn finalizes, and calls interrupt on a flush (barge-in).
LiveKit RTC (avatar A/V downstream). Session start returns a livekit_url + livekit_agent_token. The plugin joins that LiveKit room as a subscriber and receives the rendered avatar’s video frames and PCM audio.

The LiveKit room is LiveAvatar’s internal transport for delivering the rendered avatar — your end user never joins it. They only see the Stream Edge room your bot publishes into.

AVSynchronizer and `buffer_seconds`

Avatar video and audio arrive on separate tracks and can drift. The plugin runs them through an AVSynchronizer that queues frames and emits them in lock-step on the bot’s outbound Stream tracks. The video queue is capped at fps * buffer_seconds frames — that is what buffer_seconds controls. Larger values absorb more jitter at the cost of latency; smaller values are tighter but more sensitive to network hiccups. width, height, and fps configure the outbound track the bot publishes to Stream; the avatar render itself is sized by video_quality on the LiveAvatar side.

Session lifecycle

agent.join(call) calls Avatar.start().
The plugin calls create_session_token (passing avatar_id, is_sandbox, max_session_duration, video_quality, video_encoding), then start_session to get the LiveKit URL, agent token, and WebSocket URL.
RTC + WebSocket connect; an audio-input task starts draining the agent’s audio stream into the WebSocket.
On agent.finish() or context exit, close() cancels the audio task, closes the WebSocket and RTC connection, drains the sync stage, and calls stop_session.

Interruptions

When the agent emits an AudioOutputFlush (e.g. the user barges in and your STT cuts off the current turn), the plugin sends interrupt to LiveAvatar over the WebSocket and flushes the local A/V queue so the avatar stops mid-utterance instead of finishing the cancelled response.

Resources

VisionAgents LiveAvatar integration

Upstream reference for the plugin.

Plugin source

Source, examples, and tests for vision-agents[liveavatar].

Runnable example

End-to-end Gemini + Deepgram + LiveAvatar agent you can run via Runner / AgentLauncher.

Getting Started

Core Concepts

FULL Mode

LITE Mode

FAQ

Prerequisites

Installation

Quick Start

Realtime LLM variant

Configuration

Avatar parameters

Sandbox vs. production

Architecture

How it works

Two channels to LiveAvatar

AVSynchronizer and `buffer_seconds`

Session lifecycle

Interruptions

Resources

VisionAgents LiveAvatar integration

Plugin source

Runnable example

Getting Started

Core Concepts

FULL Mode

LITE Mode

FAQ

Documentation Index

​Prerequisites

​Installation

​Quick Start

​Realtime LLM variant

​Configuration

​Avatar parameters

​Sandbox vs. production

​Architecture

​How it works

​Two channels to LiveAvatar

​AVSynchronizer and buffer_seconds

​Session lifecycle

​Interruptions

​Resources

VisionAgents LiveAvatar integration

Plugin source

Runnable example

Prerequisites

Installation

Quick Start

Realtime LLM variant

Configuration

Avatar parameters

Sandbox vs. production

Architecture

How it works

Two channels to LiveAvatar

AVSynchronizer and `buffer_seconds`

Session lifecycle

Interruptions

Resources