Voice & Speech
Voice input (speech-to-text)
The microphone button records audio and transcribes it through
POST /api/transcribe. The transport is auto-selected from the deployment name.
REST path (zero-config)
For classic Whisper-family models:
WHISPER_DEPLOYMENT_NAME=whisper # or whisper-1 / gpt-4o-transcribe / gpt-4o-mini-transcribe
The backend calls POST /audio/transcriptions synchronously and returns the
transcript -- no extra deployments needed.
Realtime path (gpt-realtime-whisper)
gpt-realtime-whisper is transcription-only and runs alongside a voice Realtime
model, so two deployments are required:
WHISPER_DEPLOYMENT_NAME=gpt-realtime-whisper
WHISPER_REALTIME_CONNECTION_DEPLOYMENT=gpt-realtime-mini # cheapest voice model
Transport is chosen by the realtime substring in WHISPER_DEPLOYMENT_NAME;
override with WHISPER_MODEL_KIND=rest|realtime. Optional knobs:
# Empty selects the GA URL; set a preview value only for legacy models
# such as gpt-4o-realtime-preview.
AZURE_OPENAI_REALTIME_API_VERSION=
# Browser webm/Opus is resampled to this PCM rate. Allowed: 16000, 24000.
WHISPER_REALTIME_AUDIO_RATE=24000
The POST /api/transcribe contract is byte-for-byte identical across both
transports, so the SPA is unchanged.
Text-to-speech
On-demand TTS for any message. The speaker button plays audio and the download
button saves an MP3; audio is cached to avoid duplicate calls. Pick the provider
with TTS_PROVIDER (default elevenlabs).
Option A -- ElevenLabs:
TTS_PROVIDER=elevenlabs
ELEVENLABS_API_KEY=your-api-key
TTS_MODEL_ID=eleven_multilingual_v2
TTS_VOICE_ID=your-voice-id
Option B -- Azure OpenAI Realtime voice (e.g. gpt-realtime-2), reusing your
existing AZURE_OPENAI_ENDPOINT and credentials:
TTS_PROVIDER=azure-realtime
TTS_REALTIME_DEPLOYMENT=gpt-realtime-2
TTS_REALTIME_VOICE=alloy
# TTS_REALTIME_AUDIO_RATE=24000
The Azure Realtime lane reads the message text verbatim (it does not converse), matching ElevenLabs behavior.