Skip to main content

Voice & Speech

Voice input (speech-to-text)

The microphone button records audio and transcribes it through POST /api/transcribe. The transport is auto-selected from the deployment name.

REST path (zero-config)

For classic Whisper-family models:

WHISPER_DEPLOYMENT_NAME=whisper # or whisper-1 / gpt-4o-transcribe / gpt-4o-mini-transcribe

The backend calls POST /audio/transcriptions synchronously and returns the transcript -- no extra deployments needed.

Realtime path (gpt-realtime-whisper)

gpt-realtime-whisper is transcription-only and runs alongside a voice Realtime model, so two deployments are required:

WHISPER_DEPLOYMENT_NAME=gpt-realtime-whisper
WHISPER_REALTIME_CONNECTION_DEPLOYMENT=gpt-realtime-mini # cheapest voice model

Transport is chosen by the realtime substring in WHISPER_DEPLOYMENT_NAME; override with WHISPER_MODEL_KIND=rest|realtime. Optional knobs:

# Empty selects the GA URL; set a preview value only for legacy models
# such as gpt-4o-realtime-preview.
AZURE_OPENAI_REALTIME_API_VERSION=
# Browser webm/Opus is resampled to this PCM rate. Allowed: 16000, 24000.
WHISPER_REALTIME_AUDIO_RATE=24000

The POST /api/transcribe contract is byte-for-byte identical across both transports, so the SPA is unchanged.

Text-to-speech

On-demand TTS for any message. The speaker button plays audio and the download button saves an MP3; audio is cached to avoid duplicate calls. Pick the provider with TTS_PROVIDER (default elevenlabs).

Option A -- ElevenLabs:

TTS_PROVIDER=elevenlabs
ELEVENLABS_API_KEY=your-api-key
TTS_MODEL_ID=eleven_multilingual_v2
TTS_VOICE_ID=your-voice-id

Option B -- Azure OpenAI Realtime voice (e.g. gpt-realtime-2), reusing your existing AZURE_OPENAI_ENDPOINT and credentials:

TTS_PROVIDER=azure-realtime
TTS_REALTIME_DEPLOYMENT=gpt-realtime-2
TTS_REALTIME_VOICE=alloy
# TTS_REALTIME_AUDIO_RATE=24000

The Azure Realtime lane reads the message text verbatim (it does not converse), matching ElevenLabs behavior.