Interactive leaderboard

Fastest LLM APIs 2026: Low-Latency AI Models Compared

Rank the fastest LLM APIs in 2026 using throughput and TTFT latency signals with pricing context. Shortlist responsive AI models for real-time apps in the US, Canada, and Australia.

Speed-ranked LLMs with API cost on the same canvas in 2026

Speed scores reflect interactive-class behavior: smaller fast tiers vs. heavy flagships, grounded in benchmark metadata and tier cues—not a single vendor’s marketing latency claim. Product teams across the United States, Canada, and Australia use this tab to protect UX on chat surfaces while still eyeballing what responsiveness costs at production token volumes.

Workload & pricing toggles

Workload presets

Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.

Include Vision / Image Processing

Off — no image fees in cost estimates for vision-capable models.

Turn On to include image fees.

OffOn

Use Cached Pricing

Enable to get 50% off input tokens where cached rates apply

OffOn

Deep Reasoning / Thinking Mode

Model hidden reasoning / extended thinking charged like output tokens when enabled.

OffOn

Batch Pricing

Enable for 50% off input & output where batch/async pricing applies

OffOn
≈ $100.00/mo
8K
1K1.0M
≈ $100.00/mo
2K
100500K
≈ $200.00 total
5K
10100K

Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.

Magic quadrant (top 15)

X: est. monthly · Y: Speed · Dot: provider color · Hover for rank, model & details

Full leaderboard

Showing 48 of 365 models.

PickModelEst. monthlyROI scoreCodingReasoningSpeedMathContextOverall
Inception: Mercury 2$17.5040
43
44
100
44
128K
44Fast-dLLM v2 (Mercury 2) averages 43.5 across HumanEval, GPQA, and MMLU. As a 1B-scale lightweight model, coding and logic map to ~43. Speed is 100 due to >1,000 tokens/sec diffusion generation.
Relace: Relace Apply 3$46.5047
70
65
100
50
256K
63Evidence explicitly states benchmarks are unavailable. As a specialized code-patching model with 10,000 tok/s throughput, speed is scored 100. Coding and logic are inferred moderately (70/60) due to lack of SWE-bench or GPQA data.
Google: Gemini 3.1 Flash Lite Preview$25.0053
55
78
98
70
1.0M
70GPQA Diamond at 86.9% maps logic to 85. Coding score of 30.1 maps to 55. MMMU Pro at 76.8% sets multimodal to 77. As a Lite tier, speed is exceptionally high (381 tok/s) mapping to 98.
Google: Gemini 3.1 Flash Lite$25.0050
25
83
98
66
1.0M
64SWE-bench Verified at 22% maps to 25 coding. GPQA Diamond at 86.9% maps to 87 logic. As a Lite tier, it excels in speed (381 t/s, 98) but trails flagships in coding.
Llama Guard 3 8B$19.6646
55
60
98
45
131K
55Based on Llama 3 8B stats: HumanEval 62.2% (Coding 55), MMLU 68.4% (Logic 60), MATH 30% (Math 45). As an 8B lightweight tier, scores are adjusted down versus flagships. Speed is 98 reflecting 765 tps on Groq.
LiquidAI: LFM2-24B-A2B$2.4061
71
44
97
55
128K
54Benchable.ai cites Coding 71%, Reasoning 50%, Instruction 38%, and Speed at 97th percentile. Mapped directly to 0-100 scale. As a lightweight 2B active MoE, it prioritizes speed over flagship-level logic and coding.
OpenAI: GPT-5.4 Nano$20.5056
75
75
95
70
400K
74SWE-Bench Pro at 52.4% maps to 75 coding. Humanity's Last Exam at 24.3% maps to 70 logic. As a 'Nano' tier model, it is optimized for speed (95) over deep reasoning, reflecting its lightweight architecture.
Relace: Relace Search$70.0041
50
60
95
40
256K
53No standard benchmarks (SWE-bench, GPQA) provided. Scores estimated for a specialized codebase search subagent. Speed rated 95 based on claims of 10,000 tokens/sec. Native reasoning confirmed via explicit mention of reasoning tokens.
xAI: Grok 3 Mini$17.0063
85
85
95
85
131K
85Web digest notes Grok 3 Mini outperforms Grok-2 mini (HumanEval >87.2%, MATH >70.2%). Mapped to ~85 for coding/math. As a 'Mini' tier with native reasoning, it prioritizes speed (100 tok/s, mapped to 95) over flagship-level logic.
Anthropic: Claude 3 Haiku$22.5053
65
73
95
65
200K
69Claude 3 Haiku (Lightweight tier) scores MMLU 76.7%, HumanEval 75.9%, GSM8K 88.9%, and MMMU 50.2%. Mapped coding to 65 and logic to 70, reflecting its distilled nature compared to flagship models. Speed is rated 95.
Google: Gemini 3.5 Flash$150.0057
85
88
95
75
1.0M
84SWE-bench Verified 78.0% and GPQA Diamond 90.4% map to 85 and 90. Despite being a lightweight Flash tier (speed 95), explicit evidence dictates high capability scores, though typically lower than Pro.
Qwen: Qwen3 Coder Flash$17.5554
70
68
95
65
1.0M
68Evidence lacks exact benchmark numbers but notes Qwen3 Coder Flash is a speed-optimized, lightweight tier. Scores inferred cautiously for a Flash model, prioritizing speed (95) over coding/logic compared to the flagship Coder Plus.
Google Gemini Flash Latest$150.0048
70
65
95
75
1.0M
69Evidence cites HumanEval 74.3% (Coding ~70), GPQA 51.0% (Logic ~60), and MMMU 62.3% (Multimodal ~65). As a lightweight Flash tier, scores are adjusted lower than Pro flagships, while Speed is rated high (~95) for its class.
Qwen: Qwen3.6 Flash$18.7560
85
78
95
80
1.0M
80SWE-bench Verified 73.4 (via 35B-A3B base) maps to 85 coding. Flash tier yields 95 speed (119 tok/s). Logic/Math inferred ~75-80 due to missing GPQA/MMLU scores. Native reasoning supported.
OpenAI: GPT-5.1-Codex-Mini$30.0059
88
78
95
80
400K
81SWE-bench Verified 55.0% (mapped to 88) and GPQA Diamond 52.0% (mapped to 75) show strong capabilities. As a Mini tier, it prioritizes speed (175 tok/s, mapped to 95) while maintaining solid reasoning and coding performance.
Qwen: Qwen3 Coder Next$12.4061
85
72
95
85
262K
78HumanEval 92.7% and GPQA-D 42.4% map to 85 coding and 55 logic. IFEval 89.6% yields 88 instruction. Speed is 95 based on 162 tok/s. As an 80B (3B active) efficient model, logic is appropriately scaled.
xAI: Grok 4 Fast$13.0041
40
43
95
45
2.0M
43Evidence cites a 43.5 average across HumanEval, GPQA, and MMLU for this 1B-scale Fast model. Mapped to ~40-45 for coding and logic. As a lightweight tier, speed is rated very high.
Google: Gemini 2.5 Flash Lite$8.0054
45
68
95
65
1.0M
61Evidence lacks exact Flash-Lite scores but notes it underperforms Flash (GPQA 78.3%, MMMU 76.7%). As a Lite tier, scores are adjusted downward (Logic 65, Coding 45). Speed is heavily weighted (95) due to 68 tok/s and ultra-low latency.
Mistral: Ministral 3 8B 2512$7.5052
45
58
95
65
262K
56GPQA 47.1 and MMLU 76.1% map to Logic 55. LiveCodeBench 30.3 maps to Coding 45. MATH 62.6% maps to Math 65. As an 8B lightweight tier, it scores lower on reasoning but achieves 161 tok/s (Speed 95).
OpenAI: GPT-5.1 Chat$150.0054
85
78
95
80
128K
80HumanEval 91% maps to 85 coding. GPQA 53.6% maps to 75 logic. MMMU 69.1% maps to 75 multimodal. As a lightweight 'Instant' tier model, speed is rated 95, with capabilities adjusted below flagship levels.
Baidu: Qianfan-OCR-Fast$55.3037
40
50
95
40
66K
45No benchmarks provided. Inferred scores based on 'Fast' tier and OCR specialization. Speed is exceptional (claimed 1M tokens/s). Multimodal scored high for OCR focus; coding and logic scored lower as a specialized lightweight model.
Claude Haiku 4.5$90.0059
92
85
95
80
200K
86SWE-bench Verified 73.3% maps to 92 coding. Though a lightweight Haiku tier, evidence explicitly states it matches Sonnet 4's reasoning and coding, justifying high logic (85) alongside top-tier speed (95).
Qwen: Qwen3.5-Flash$5.2075
88
92
95
95
1.0M
92SWE-bench Verified at 69.2% maps to 88 coding. GPQA Diamond at 84.2% maps to 92 logic. IFEval 91.9% maps to 92 instruction. As a Flash tier, speed is 95, though its reasoning capabilities rival flagship models.
Google: Nano Banana 2 (Gemini 3.1 Flash Image Preview)$50.0047
60
65
95
60
131K
63No text benchmarks provided; inferred Coding/Logic at 60 for Flash-tier. Speed set to 95 (102 tok/s). Multimodal set to 95 based on 'Pro-level visual quality' and 88% Graphic Design ELO. Vision price derived from $0.0005/K images.
Google: Gemini 2.5 Flash Lite Preview 09-2025$8.0052
45
67
95
55
1.0M
58Based on GPQA (65.1-70.9%) and LiveCodeBench (64.1-68.8%), logic and coding map to 68 and 45. As a 'Flash Lite' tier, speed is heavily weighted (95), reflecting its ultra-low latency design over flagship-level reasoning.
Meta: Llama 3.2 1B Instruct$3.0945
25
35
95
30
131K
31MMLU 49.3%, GSM8K 44.4%, MATH 30.6%. As a 1B lightweight model, Logic (35) and Math (30) reflect sub-50% benchmarks. Coding (25) based on 0.6 index. Speed (95) is maximized for this ultra-small tier.
StepFun: Step 3.5 Flash$6.6070
88
83
95
95
262K
87SWE-bench Verified 74.4% maps to 88 coding; AIME 99.8% maps to 95 math. Despite Flash tier, explicit evidence shows frontier-level SWE-bench Verified, elevating coding score. Speed is 95 (143 tok/s).
Mistral: Ministral 3 14B 2512$10.0058
60
72
95
75
262K
70GPQA Diamond 58.6% maps to Logic 65; IFEval 77.3% to Instruction 78; MMMU 55.3% to Multimodal 60. As a 14B lightweight tier, Coding is inferred at 60. Speed is 95 based on reported 2512 tokens/s.
OpenAI: GPT-5.4 Mini$75.0055
70
80
95
80
400K
78No exact benchmarks for GPT-5.4 Mini in evidence. Inferred coding (70) and logic (75) based on its 'Mini' tier status and reasoning capabilities. Speed (95) reflects high-throughput optimization. Vision price estimated from $0.75/1M input cost.
Gemini 2.0 Flash (001)$8.0059
65
73
95
70
1.0M
70MMLU 76.4%, MMMU 71.7%, and MATH 53.2% map to Logic 76, Multimodal 72, and Math 70. As a Flash-tier model, Coding (65) is adjusted lower than heavyweights, while Speed (95) reflects its highly optimized latency.
Anthropic Claude Haiku Latest$90.0056
88
83
95
75
200K
82SWE-bench Verified at 73.3% maps to 88 coding. MATH at 69.4% maps to 75 math. As a lightweight tier, it excels in speed (95) while native reasoning boosts its logic (80) to near-frontier levels.
Morph: Morph V3 Large$55.0048
80
65
95
50
262K
65No standard benchmarks available. Evidence cites 98% accuracy for code transformations and ~4,500 tok/s. Mapped coding to 80 for specialized code-edit focus, speed to 95 for extreme throughput. Logic/math inferred cautiously due to missing data.
Google: Gemma 3 4B$3.0064
45
68
95
75
131K
64Based on IFEval (90.2%) mapped to 90 Instruction, MATH (75.6%) to 75 Math, and MMLU-Pro (43.6%) to 45 Logic. As a 4B lightweight tier, Coding (MBPP 63.2%) maps to 45, while Speed is rated 95.
Morph: Morph V3 Fast$44.0045
80
55
95
40
82K
58Evidence lacks standard SWE-bench/GPQA scores, citing only 96% accuracy for rapid code transformations. Mapped coding to 80 for specialized apply tasks. As a 'Fast' tier model, speed is rated 95 (10,500 tok/s claimed), with logic/math inferred lower.
Google: Gemini 3 Flash Preview$50.0061
88
89
95
85
1.0M
88GPQA Diamond 90.4% maps to Logic 92; SWE-bench 78% maps to Coding 88. As a Flash-tier model, Speed is rated very high (95). Multimodal inferred at 80 due to extensive video/audio/image support.
Anthropic: Claude 3.5 Haiku$72.0052
75
73
95
75
200K
74SWE-bench Verified 40.6% (Coding ~75), MMLU-Pro 65% (Logic ~70), MATH 69.4% (Math ~75). As a lightweight Haiku tier, it scores lower than flagships on reasoning but achieves exceptional speed.
Meta: Llama 3.2 3B Instruct$5.3951
35
55
95
55
131K
50Lightweight 3B tier. Mapped MMLU 63.4% to Logic 40, IFEval 73.9% to Instruction 70, and GSM8K 77.7% to Math 55. Coding inferred low (35) lacking SWE-bench. Speed rated 95 for 3B size.
StepFun: Step 3.7 Flash$19.5060
88
75
95
85
256K
81SWE-bench Verified at 74.4% maps to 88 coding. AIME 2025 win implies strong math (85). Speed is 143 tok/s (95). As a Flash tier, logic/instruction are estimated ~75 despite high coding/math peaks.
Google: Nano Banana (Gemini 2.5 Flash Image)$37.0037
40
45
95
40
33K
43Evidence explicitly states 'Benchmark not available' for MMLU/MMMU/GSM8K. Inferred Flash-tier baseline scores (40-50). Speed scored 95 due to 172 tok/s throughput. Multimodal inferred at 70 for a lightweight image model.
OpenAI: GPT-4o-mini (2024-07-18)$12.0058
70
73
95
75
128K
73Evidence lacks raw benchmarks. Scores inferred cautiously from the 'Mini' lightweight tier profile. Speed is heavily weighted (95), while coding (70) and logic (70) are adjusted downward to reflect its distilled nature compared to flagship models.
Amazon: Nova Micro 1.0$2.8066
68
63
95
75
128K
67HumanEval 81.1% (Coding 68), GPQA 40% (Logic 45), IFEval 87.2% (Instruction 80), GSM8K 92.3% (Math 75). As a 'Micro' tier model, speed is rated very high (95) while coding and logic reflect its lightweight, text-only nature.
OpenAI: GPT-5 Nano$6.0058
60
65
95
65
400K
64No exact GPT-5 Nano scores provided; inferred from predecessor GPT-4.1 Nano (GPQA 50.3%, MMLU 80.1%). Mapped Logic to 60, Coding to 60. As a Nano tier, it prioritizes speed (100 tok/s -> 95) over heavyweight reasoning.
OpenAI: gpt-oss-120b$3.3670
70
80
95
75
131K
76HumanEval 71% maps to 70 coding. MMLU 66-90% maps to 80 logic. GSM8K 75% maps to 75 math. 500 tok/s throughput maps to 95 speed. Native reasoning supported via OpenRouter reasoning parameter.
Google: Gemini 2.0 Flash Lite$6.0058
65
65
95
65
1.0M
65Evidence lacks exact percentages but confirms 2.0 Flash-Lite outperforms 1.5 Flash and trails 2.0 Flash on GPQA, MATH, and MMMU. Scores inferred cautiously for this lightweight tier, prioritizing its high speed and lower reasoning/coding capabilities.
OpenAI: GPT-4.1 Mini$32.0056
68
82
95
75
1.0M
77SWE-bench Verified 23.6% (mapped to 68), GPQA Diamond 65% (mapped to 80). As a Mini tier, speed is rated high (95) while coding/logic reflect its lightweight nature compared to flagship models.
OpenAI: GPT-4.1 Nano$8.0056
65
65
95
65
1.0M
65GPQA 50.3% and MMLU 80.1% map to ~60 logic. HumanEval 86.6% and Aider 9.8% map to ~65 coding. As a 'Nano' lightweight tier, it prioritizes speed (~95) over flagship reasoning.
Google: Gemini 2.5 Flash$37.0056
70
78
92
85
1.0M
78GPQA Diamond 78.3% (Logic 80), LiveCodeBench 63.5% (Coding 70), MMMU 76.7%. As a Flash-tier model, it excels in speed (93 tok/s) and math (AIME 78%), but trails Pro in heavy coding.
MiniMax: MiniMax M2.1$21.1059
75
83
92
75
205K
79Multi-SWE-Bench (49.4%) maps to Coding 75; MMLU-Pro (88.0%) maps to Logic 85. As a 10B lightweight model, speed is heavily weighted (92), while coding and logic reflect its size class despite strong benchmark claims.

Need a shareable artifact?

Get a print-ready PDF of your results and a CSV spreadsheet. Tap the button, then enter your work email. We use it to build your files and start the download—and to email you a copy if the site owner enabled that.

AI ROI Leaderboard & Discovery by LeadsCalc

Detailed analysis

PDF Breakdown

Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: Speed).

Instant setup
No CC required

By submitting, you agree to our Privacy Policy and Terms.

Agency accelerator

Whitelabel Speed Leaderboard
for your site

Embed the interactive speed view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.

1-Click CRM sync
Custom branding
Branded reports
Lead analytics

Free to start

$0/mo*
GET STARTED

NO CREDIT CARD REQUIRED

How it works

Methodology: How we rank Speed LLMs

Transparent, benchmark-driven rankings—same craft as our single-model deep dives.

How we infer speed class from benchmarks and model tiers

Our speed rankings synthesize throughput (Tokens Per Second) and Time To First Token (TTFT) metrics derived from benchmark metadata and model tier classifications (e.g., Flash, Haiku, Mini). We prioritize models that can sustain interactive-class latency under production loads, allowing product teams to build snappy, real-time user experiences without sacrificing core conversational quality.

Battle Arena

Compare up to four LLMs side by side

Tick up to four models in the leaderboard table, then open Battle Arena for API pricing, benchmarks, and workload math in one view—perfect when you are shortlisting vendors for a pilot in the US, Canada, or Australia.

Prefer a head start? Jump into high-intent comparisons people search for every day—same interactive calculator, zero signup.

Open Battle ArenaUp to 4 models · Live estimates
Signals & spend

Value analysis

Benchmarks vs. estimated API cost—read the story your CFO cares about.

Trading speed for quality on customer-facing chat

A blazing fast model that fails quality checks still loses users. Use speed to narrow candidates, then validate on your own p95 traces from regional endpoints. Canadian and Australian operators often measure from local POPs; US teams frequently split traffic between a latency-optimized small model and a larger fallback.

Production deployment

Real-Time & Voice Applications

How teams in the US, Canada, and Australia deploy these models in production.

Voice agents, real-time autocomplete, and live translation

When building voice-to-voice AI agents or inline code autocomplete, Time To First Token (TTFT) and generation speed are the only metrics that matter. Developers in the US and Canada use this latency-focused leaderboard to select models that can sustain >80 tokens per second, ensuring conversational interfaces feel human and autocomplete suggestions appear instantly.

Architecture

Latency Reduction Strategies

Strategies to reduce monthly API spend without sacrificing capability.

Geographic POP selection and streaming optimization

Model size isn't the only factor in speed. To optimize latency, teams must stream responses to the client, minimize input payload size, and select API providers with geographic Points of Presence (POPs) close to their users. For teams in Australia, choosing a provider with local Sydney endpoints often yields better real-world speed than picking a theoretically faster model hosted in the US.

Embed-ready

Need this live Speed data on your website?

Join 500+ agencies in the US and Australia using LeadsCalc to capture high-intent leads. Embed this interactive Speed leaderboard on your site in about a minute—Canadian teams use the same flows for CAD-priced proposals and compliance-friendly landing pages.

Customize & Embed this ToolWhite-label · No code required
United StatesCanadaAustralia
Live preview

Your visitors compare Speed models without leaving your domain.

Support & clarity

Frequently Asked Questions

Focused on teams in the United States, Canada, and Australia.

The 'fastest' model depends heavily on geographic routing and payload size. Models in the 'Flash' or '8B' parameter class typically offer the highest throughput. However, if your users are in Australia, a slightly slower model hosted in a Sydney data center will often feel faster than a theoretically faster model hosted in the US due to network latency.