Reference Guide

Mistral & DeepSeek

The two leading non-US open-weight model providers. Mistral AI from France — EU residency, frontier and edge models, Apache 2.0 weights. DeepSeek from China — massive Mixture-of-Experts architectures, aggressive pricing, and the V4 release that just shipped a 1M-token open-source frontier model.

← Back to Reference Hub

Mistral's flagship released as part of the Mistral 3 family in December 2025. Multimodal Mixture-of-Experts model with image understanding, available as both base and instruct under Apache 2.0 — a frontier-class open-weight model that any team can self-host or fine-tune.

  • $2.00 / 1M input, $6.00 / 1M output via La Plateforme — roughly 40% below GPT-5.4 and Claude Sonnet output rates
  • 128K token context window
  • Multilingual (French, German, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Chinese, Japanese, Korean, Arabic plus more) and natively multimodal
  • Apache 2.0 weights on Hugging Face — commercial use, fine-tuning, distillation all permitted
  • Available on Microsoft Azure AI Foundry, AWS Bedrock, Google Vertex AI, IBM watsonx, Snowflake Cortex, NVIDIA NIM

Limitations: 128K context lags GPT, Claude, and DeepSeek V4 (1M+). Self-hosting at full precision needs serious GPU infrastructure (4×H100 minimum for production). Reasoning benchmarks trail OpenAI o-series and DeepSeek R1.

FrontierApache 2.0

The middle tier in the Mistral 3 lineup. Designed as the everyday production workhorse — meaningfully cheaper than Large with most of the capability for general chat, RAG, and structured output workloads.

  • $0.40 / 1M input, $2.00 / 1M output — the sweet spot for production deployment
  • 128K token context window
  • Function calling, JSON mode, and structured output
  • Available via La Plateforme, Azure AI Foundry, AWS Bedrock, and Vertex AI
  • Same multilingual coverage as Large 3

Limitations: Not always available as open weights — some Medium-tier releases ship API-only while community gets Small + Large. Confirm weight availability before assuming you can self-host.

Mid-Tier

The high-volume, edge-deployable model in the Mistral 3 family, refreshed in March 2026 as Small 4. Apache 2.0 licensed and small enough to run on a single consumer GPU — positioned as the open replacement for proprietary mid-tier models.

  • $0.15 / 1M input, $0.60 / 1M output via La Plateforme
  • 128K token context window
  • Apache 2.0 weights — runs on a single H100 at full precision, or quantized on consumer hardware
  • Strong instruction-following relative to size; good fit for fine-tuning on domain data
  • Ideal for high-volume agent loops, classification, summarization, and embedded inference

Limitations: Multimodal capability varies by release; check the model card for image support. Reasoning quality is bounded — route hard problems to Large 3 or DeepSeek R1. Not a chat-by-default product; build your own UX or use Le Chat.

Small / EdgeApache 2.0

Mistral's code-completion specialist. 22B parameters — small enough for an RTX 4090 to run full-precision — with a 256K context window (twice the rest of the lineup). Trained on 80+ programming languages with strong fill-in-the-middle for IDE autocomplete.

  • $0.30 / 1M input, $0.90 / 1M output via Mistral's standard API
  • Free Codestral API endpoint for IDE-integration use cases
  • 256K token context — the largest in the Mistral lineup
  • Native fill-in-the-middle (FIM) for autocomplete; code-specific tokenizer
  • Integrations with Cursor, Continue, Tabby, JetBrains AI Assistant

Limitations: Optimized for code completion, not general-purpose chat or reasoning — use Large or Medium for explanation and architectural discussion. Open weights are released under the Mistral Non-Production License (MNPL), which restricts commercial production use without a paid license; the Apache 2.0 family does not include Codestral by default.

Code ModelMNPL license

Mistral's dedicated vision-language line. Pixtral 12B (Sept 2024) is a small open-weight model for multimodal apps; Pixtral Large (Nov 2024) is a 124B vision-first model claiming to beat GPT-4o on chart and document interpretation. Both predate the Mistral 3 multimodal capabilities and remain available for vision-specific workloads.

  • Pixtral 12B — $0.10 / 1M input, $0.10 / 1M output, 128K context, 12B params, Apache 2.0
  • Pixtral Large — 124B params, 128K context, API-only with no public per-token pricing
  • Process screenshots, charts, diagrams, photos, and PDFs alongside text without an external pipeline
  • Pixtral 12B runs on a single A100 / H100 at full precision; great for self-hosted vision RAG

Limitations: Mistral Large 3 now ships native multimodal capability, partially overlapping Pixtral's positioning — pick Pixtral when you specifically want a smaller, vision-focused model or open weights for the 12B size. Pixtral Large's API-only access and lack of public pricing make budget forecasting hard.

Vision12B is Apache 2.0

Mistral's consumer-facing chat product, the European answer to ChatGPT. Free tier at chat.mistral.ai with web search, code interpreter, and image generation. Le Chat Enterprise ships in two variants: SaaS hosted by Mistral in France, or self-hosted on private cloud. As of April 2026, also distributed on AWS, Azure, and GCP marketplaces for enterprise procurement.

  • Le Chat (free) — Mistral Large by default, web search, image generation, document upload
  • Le Chat Pro (€14.99/mo) — higher rate limits, priority access
  • Le Chat Team / Enterprise — SSO, admin console, custom assistants, MCP connectors
  • Le Chat Enterprise can be deployed self-hosted, in private cloud (AWS / Azure / GCP), or as Mistral-hosted SaaS in France
  • Zero data retention enforceable on Enterprise; default consumer retention is 30 days

Limitations: Smaller ecosystem of plugins/integrations than ChatGPT or Claude. Le Chat reasoning quality is bounded by Mistral Large 3 — not as strong as GPT or Claude on hardest problems. Mobile apps shipped later than competitors.

Chat AppFree Tier

Mistral's developer platform — the equivalent of OpenAI's platform.openai.com. The structural pitch versus US providers is data residency: with OpenAI, Anthropic, or Google you opt into EU residency as an enterprise add-on; with Mistral, EU residency is the baseline and US routing is the opt-in.

  • Pay-as-you-go API access to all Mistral models including fine-tuning endpoints
  • EU-hosted by default; DPA available without enterprise upgrade
  • 30-day token retention by default; zero retention available on Enterprise plan
  • Embedding models, OCR, function calling, JSON mode, batch API, structured output
  • Mistral Agents API for building agentic workflows with built-in tools (web search, code interpreter, image generation, document library)
  • SOC 2, ISO 27001, GDPR-aligned by design

Limitations: Smaller third-party ecosystem than OpenAI or Anthropic — fewer pre-built SDKs, eval tools, observability integrations. Latency from US-east is meaningfully higher than calling US-hosted providers; consider routing through hyperscaler marketplace endpoints (Azure / AWS / GCP) if your users are mostly American.

Developer APIEU residency default

The headline 2026 release — an open-source MoE family built from the ground up for million-token context as a default rather than a bolt-on. Two sizes: V4-Pro for frontier-grade work, V4-Flash for high-throughput production. Both shipped with Apache 2.0 weights on Hugging Face the same day they hit the API.

  • V4-Pro: 1.6T total params / 49B activated. $0.145 cache-hit input, $1.74 cache-miss input, $3.48 output per 1M tokens (list)
  • V4-Pro promo — active through 2026-05-31 15:59 UTC: 75% off, so effective rates today are $0.003625 cached input, $0.435 cache-miss input, $0.87 output per 1M tokens. A May-2026 deployment costs ~4x less than the list prices above; expect to revert to list from June 1.
  • V4-Flash: 284B total params / 13B activated. $0.028 cache-hit input, $0.14 cache-miss input, $0.28 output per 1M tokens
  • Native 1M token context window — not a sliding-window approximation
  • At 1M context, V4-Pro uses ~27% of the per-token FLOPs and 10% of the KV cache vs. V3.2; V4-Flash drops to ~10% FLOPs and 7% cache
  • Apache 2.0 open weights for commercial use; available on Hugging Face, Together, Fireworks, OpenRouter, and DeepSeek's own API

Limitations: Less than 24 hours old as of this guide — production benchmarks and third-party tooling support are still maturing. Self-hosting V4-Pro at the full 1.6T parameter count requires a serious GPU cluster; most teams will use it through DeepSeek's API or a hyperscaler. Cache-miss pricing is what most non-repetitive workloads pay; the headline cache-hit prices assume meaningful prompt repetition.

FrontierApache 2.0Just released

The previous-generation flagship that established DeepSeek's reputation. 671B total parameters with 37B activated per token. Trained on 14.8T tokens, refined across V3.1 and V3.2 releases through 2025. Still the safest production choice if you don't want to be the first to deploy V4.

  • Mixture-of-Experts with Multi-Head Latent Attention (MLA) for efficient KV caching
  • 128K token context window in standard deployments (up to 1M in research builds)
  • Apache 2.0 open weights on Hugging Face
  • Wide third-party support — Together, Fireworks, OpenRouter, Groq, Cerebras, NVIDIA NIM
  • Strong general-purpose performance; well-understood failure modes after 18 months in production

Limitations: Now superseded by V4 on capability and context length. Pricing on DeepSeek's own API has been unified into the V4 lineup; some third-party hosts still serve V3.2 at the older rates ($0.27 / $1.10 per 1M historically). Useful as a fallback model in router setups.

ProductionApache 2.0

The first frontier-grade open reasoning model. R1 introduced large-scale reinforcement learning over chain-of-thought traces, producing OpenAI o1-class results on math, code, and science benchmarks — while shipping the weights publicly. The release reset industry expectations for what open models could do on reasoning workloads.

  • Same MoE architecture as V3 (671B / 37B activated)
  • Generates extensive visible reasoning traces before final answers — latency is higher than V3 / V4-Flash but answers are demonstrably better on hard problems
  • Distilled smaller variants (1.5B, 7B, 8B, 14B, 32B, 70B) released as open weights for self-hosting
  • Code released MIT, model weights released under DeepSeek's permissive Model License (commercial use allowed with use-case restrictions on illegal/harmful content)
  • Available on DeepSeek API and every major third-party host

Limitations: Slower and more expensive per query than non-reasoning models — only use it when the task actually benefits from extended thinking. Visible chain-of-thought can be a problem for end-user UX; many apps suppress it. R2 is rumored but unreleased as of April 2026.

ReasoningOpen weights

DeepSeek's dedicated code model, trained from scratch on 6T tokens with 338 programming languages represented. Coder V2 ships in two sizes — a 236B MoE and a 16B MoE Lite — both with open weights and commercial-use rights.

  • Coder V2: 236B total / 21B activated, 128K context, frontier-grade benchmarks on HumanEval, MBPP, LiveCodeBench
  • Coder V2 Lite: 16B total / 2.4B activated, runs locally on consumer GPUs
  • Code MIT-licensed; model weights under DeepSeek's commercial-use Model License
  • Strong fill-in-the-middle, repo-level reasoning, and multi-file refactor performance
  • Available via DeepSeek API, Hugging Face, Ollama, LM Studio, and major inference hosts

Limitations: Many teams now use DeepSeek V4-Flash for general code work and reserve Coder V2 for repository-scale reasoning or self-hosted IDE integrations. Smaller community-maintained tooling than Codestral has inside the JetBrains/VS Code ecosystem.

Code ModelOpen weights

DeepSeek's first-party API at api.deepseek.com. The pricing strategy — aggressive headline rates plus context-cache discounts plus off-peak windows — is the cheapest way to access frontier-grade models for many workloads. OpenAI-compatible endpoints make migration mostly a base-URL change.

  • OpenAI-compatible REST API — drop-in for the OpenAI Python / Node SDKs by changing the base URL
  • Context caching applies a 90% discount automatically when input tokens are served from cache (e.g. repeated system prompts in agent loops)
  • Off-peak window: 16:30–00:30 GMT, 50–75% discount on regular rates — useful for batch / async workloads
  • Function calling, JSON mode, streaming, and the standard OpenAI feature set
  • Hosted in China — latency from the US/EU is meaningfully higher than US-hosted providers; expect 200–400ms first-token latency from the US

Limitations: China-hosted infrastructure raises real data-governance questions for regulated industries and US enterprise procurement — expect security-review headwinds. Use a hyperscaler-hosted endpoint (AWS Bedrock Marketplace, Together, Fireworks, OpenRouter) when residency matters. No SLA tiers comparable to OpenAI / Anthropic enterprise.

Developer APIChina-hosted