Skip to main content

Models & Reasoning

Multi-model switching

Configure several models and switch between them mid-conversation:

AZURE_OPENAI_MODELS=gpt-4o,o3,gpt-4.1-mini
  • A model selector appears above the chat input (hidden when only one model is configured)
  • Per-session model selection persists across reloads
  • Regenerate with a different model -- click the chevron on Regenerate
  • Each assistant message shows which model generated it
  • All models share the same Tools, Skills, and MCP integrations

Per-model context-window limits use one shared variable across all providers:

MODEL_MAX_CONTEXT_TOKENS=gpt-5.5:1050000,claude-opus-4-8:200000

The context-window progress bar above the input updates automatically when you switch models.

Anthropic (Claude) provider

Enable Claude models alongside Azure OpenAI -- they appear in the same selector and can be picked per turn. Anthropic is disabled by default; leaving ANTHROPIC_MODELS unset is a no-op.

Two hostings, selected by one variable:

# direct (Anthropic public API) | foundry (Anthropic on Azure AI Foundry)
ANTHROPIC_HOSTING=direct
ANTHROPIC_MODELS=claude-sonnet-4-5-20250929,claude-haiku-4-5

Direct hosting:

ANTHROPIC_API_KEY=sk-ant-...
# ANTHROPIC_BASE_URL=https://your-gateway.example.com # optional proxy

Foundry hosting (Anthropic on Azure AI Foundry) -- supply exactly one endpoint and one auth method:

ANTHROPIC_HOSTING=foundry
ANTHROPIC_FOUNDRY_RESOURCE=my-aifoundry # resource name (subdomain) only
# OR: ANTHROPIC_FOUNDRY_BASE_URL=https://my-aifoundry.services.ai.azure.com/anthropic/

# Auth A -- API key (sent as the api-key header):
ANTHROPIC_FOUNDRY_API_KEY=<foundry-key>
# Auth B -- Entra ID: leave the API key EMPTY and reuse AZURE_CREDENTIAL_MODE
# (cli | managed-identity | default) + AZURE_TENANT_ID.
warning

ANTHROPIC_FOUNDRY_RESOURCE is the Azure AI Services resource name (the subdomain), not the Foundry project URL. For Entra ID auth, do not place a token in ANTHROPIC_FOUNDRY_API_KEY -- it is sent verbatim as the api-key header and returns HTTP 401.

Anthropic requires max_tokens on every request as a hard output cap:

ANTHROPIC_MAX_TOKENS=8192

Hosted web search works out of the box for Claude (web_search_20250305). Every other agent feature works on either provider as long as the model supports tool calling. Speech-to-text, text-to-speech, image generation, and RAG embedding run on their own dedicated Azure models, independent of the chat provider.

Per-message reasoning effort

A selector next to the model picker sets how hard the model reasons, per message. Allowed levels and the default follow the selected model and are served by the backend (GET /api/model); there is no environment variable. The chosen effort is shown next to the model name and token usage and persists with the session.

ProviderLevelsDefaultMechanism
Azure OpenAI (gpt-5.5 / gpt-5.4)low, medium, high, xhighmediumreasoning.effort
Anthropic (Opus 4.8 / 4.7)low, medium, high, xhigh, maxxhighadaptive thinking + output_config.effort

Prompt caching (input-token cost)

Every model call re-sends a large, stable prefix -- the system prompt plus the full tool schemas. On a long turn (especially a coding tool loop, where one message fans out into many sequential model calls) that prefix is billed again and again. Prompt caching marks the stable prefix as cacheable so it is billed once and re-read cheaply on the following calls. It is output-transparent: the model's replies are identical -- only billing and latency change.

It is on by default and provider-agnostic:

  • Anthropic (Claude): the backend injects cache_control breakpoints on the system block (which also covers the tool definitions) and the last few conversation messages, so in an agentic/coding tool loop the conversation tail -- tool results, file contents, reasoning -- re-reads at a large discount instead of being re-billed at full price on every model call.
  • Azure OpenAI: prompt caching is automatic for prefixes of ~1024 tokens or more; nothing to configure.

For very long tool loops, the sliding-window history compaction shifts the conversation prefix and lowers the conversation cache hit-rate. If caching is your priority, raise COMPACTION_KEEP_LAST_GROUPS (keeps a wider stable window) and/or set ANTHROPIC_PROMPT_CACHE_TTL=1h (survives multi-minute pauses).

# Master toggle (default true). Set false to disable caching entirely.
PROMPT_CACHE_ENABLED=true

# Anthropic cache lifetime: 5m (default) or 1h (extended cache). Unknown -> 5m.
ANTHROPIC_PROMPT_CACHE_TTL=5m

When the active provider reports cache usage, the per-message token readout includes cache_read_input_tokens / cache_write_input_tokens, so you can see the savings directly.

tip

If credit usage still feels high on Claude, also lower the reasoning effort per message (the selector above). The Anthropic default is xhigh, which spends a lot of output tokens thinking on every step; medium or low is plenty for many coding edits.