Interactive leaderboard

Fastest LLM APIs 2026: Low-Latency AI Models Compared

Rank the fastest LLM APIs in 2026 using throughput and TTFT latency signals with pricing context. Shortlist responsive AI models for real-time apps in the US, Canada, and Australia.

Speed-ranked LLMs with API cost on the same canvas in 2026

Speed scores reflect interactive-class behavior: smaller fast tiers vs. heavy flagships, grounded in benchmark metadata and tier cues—not a single vendor’s marketing latency claim. Product teams across the United States, Canada, and Australia use this tab to protect UX on chat surfaces while still eyeballing what responsiveness costs at production token volumes.

Workload & pricing toggles

Workload presets

Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.

Include Vision / Image Processing

Off — no image fees in cost estimates for vision-capable models.

Turn On to include image fees.

OffOn

Use Cached Pricing

Enable to get 50% off input tokens where cached rates apply

OffOn

Deep Reasoning / Thinking Mode

Model hidden reasoning / extended thinking charged like output tokens when enabled.

OffOn

Batch Pricing

Enable for 50% off input & output where batch/async pricing applies

OffOn
≈ $100.00/mo
8K
1K1.0M
≈ $100.00/mo
2K
100500K
≈ $200.00 total
5K
10100K

Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.

Magic quadrant (top 15)

X: est. monthly · Y: Speed · Dot: provider color · Hover for rank, model & details

Full leaderboard

Showing 48 of 327 models.

PickModelEst. monthlyROI scoreCodingReasoningSpeedMathContextOverall
LiquidAI: LFM2-24B-A2B$2.4043
0
44
97
0
33K
22
Relace: Relace Search$70.0052
85
73
95
65
256K
74
DeepSeek: DeepSeek V4 Flash$8.4067
90
83
95
80
1.0M
84
Amazon: Nova Micro 1.0$2.8072
69
82
95
69
128K
76
xAI: Grok 3 Mini Beta$17.0063
90
87
95
77
131K
85
Inception: Mercury 2$17.5051
67
73
95
38
128K
63
AionLabs: Aion-1.0-Mini$42.0061
85
85
95
88
131K
86
xAI: Grok 3 Mini$17.0063
88
86
92
76
131K
84
Google: Gemini 3.1 Flash Lite Preview$25.0024
0
39
90
0
1.0M
19
Qwen: Qwen3.5-Flash$5.2023
0
0
90
0
1.0M
0
Mistral: Mistral Small Creative$7.0066
88
83
90
69
33K
81
Z.ai: GLM 4.7 Flash$6.4066
85
79
90
75
203K
80
Morph: Morph V3 Fast$44.0057
96
78
90
70
82K
80
Mistral: Mistral 7B Instruct v0.1$6.3065
85
79
90
70
3K
78
StepFun: Step 3.5 Flash$7.0062
81
73
90
71
262K
74
Google: Nano Banana 2 (Gemini 3.1 Flash Image Preview)$50.0011
0
0
90
0
66K
0
OpenAI: GPT-5.2 Chat$210.0054
85
83
90
73
128K
81
Mancer: Weaver (alpha)$40.0012
0
0
85
0
8K
0
ByteDance Seed: Seed-2.0-Mini$8.0060
85
66
85
70
262K
72
MiniMax: MiniMax M2-her$24.0014
0
0
85
0
66K
0
OpenAI: gpt-oss-20b$2.6084
96
97
85
98
131K
97
Free Models Router
Free
59
43
55
85
55
200K
52
MiniMax: MiniMax M2.5$17.5060
80
83
85
72
197K
79
Mistral: Ministral 3 3B 2512$5.0031
0
28
85
0
131K
14
IBM: Granite 4.0 Micro$1.7878
81
76
85
81
131K
78
Mistral: Voxtral Small 24B 2507$7.0020
0
0
85
0
32K
0
Arcee AI: Trinity Mini$3.3073
82
80
85
80
131K
81
xAI: Grok 4.1 Fast$13.0050
39
81
85
34
2.0M
59
Xiaomi: MiMo-V2-Flash$6.5056
65
60
85
63
262K
62
Writer: Palmyra X5$84.0049
65
70
85
70
1.0M
69
ByteDance Seed: Seed 1.6 Flash$6.0066
87
76
85
77
262K
79
Xiaomi: MiMo-V2.5$36.0057
89
75
85
75
1.0M
78
Pareto Code Router
VARIABLE
74
88
73
85
80
200K
78
MiniMax: MiniMax M2$20.2050
70
63
85
55
197K
63
Mistral: Ministral 3 8B 2512$7.5040
0
38
85
67
262K
36
Mistral: Codestral 2508$21.0024
70
0
85
0
256K
18
OpenAI: GPT-5 Mini$30.0058
90
78
85
74
400K
80
Z.ai: GLM 4.7$32.6064
94
87
85
96
203K
91
NVIDIA: Nemotron Nano 9B V2$3.2075
72
84
85
98
131K
85
Z.ai: GLM 4.5 Air$13.7051
24
71
85
81
131K
62
Switchpoint Router
VARIABLE
30
0
0
85
0
131K
0
Mistral: Devstral Small 1.1$7.0050
54
53
85
53
131K
53
Mistral: Mistral Small 3.2 24B$5.0069
92
83
85
69
128K
82
OpenAI: GPT-5.1-Codex-Mini$30.0061
84
82
85
92
400K
85
OpenAI: GPT-5.1 Chat$150.0058
91
88
85
77
128K
86
EssentialAI: Rnj 1 Instruct$7.5066
75
81
85
89
33K
81
TheDrummer: Skyfall 36B V2$30.0013
0
0
85
0
33K
0
Meta: Llama 3 8B Instruct$1.6068
62
69
85
30
8K
58

Need a shareable artifact?

Download a print-ready PDF from the leaderboard and workload above. No email step—lead capture is off.

Detailed analysis

PDF Breakdown

Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: Speed).

Instant setup
No CC required

By submitting, you agree to our Privacy Policy and Terms.

Agency accelerator

Whitelabel Speed Leaderboard
for your site

Embed the interactive speed view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.

1-Click CRM sync
Custom branding
Branded reports
Lead analytics

Free to start

$0/mo*
GET STARTED

NO CREDIT CARD REQUIRED

How it works

Methodology: How we rank Speed LLMs

Transparent, benchmark-driven rankings—same craft as our single-model deep dives.

How we infer speed class from benchmarks and model tiers

Our speed rankings synthesize throughput (Tokens Per Second) and Time To First Token (TTFT) metrics derived from benchmark metadata and model tier classifications (e.g., Flash, Haiku, Mini). We prioritize models that can sustain interactive-class latency under production loads, allowing product teams to build snappy, real-time user experiences without sacrificing core conversational quality.

Battle Arena

Compare up to four LLMs side by side

Tick up to four models in the leaderboard table, then open Battle Arena for API pricing, benchmarks, and workload math in one view—perfect when you are shortlisting vendors for a pilot in the US, Canada, or Australia.

Prefer a head start? Jump into high-intent comparisons people search for every day—same interactive calculator, zero signup.

Open Battle ArenaUp to 4 models · Live estimates
Signals & spend

Value analysis

Benchmarks vs. estimated API cost—read the story your CFO cares about.

Trading speed for quality on customer-facing chat

A blazing fast model that fails quality checks still loses users. Use speed to narrow candidates, then validate on your own p95 traces from regional endpoints. Canadian and Australian operators often measure from local POPs; US teams frequently split traffic between a latency-optimized small model and a larger fallback.

Production deployment

Real-Time & Voice Applications

How teams in the US, Canada, and Australia deploy these models in production.

Voice agents, real-time autocomplete, and live translation

When building voice-to-voice AI agents or inline code autocomplete, Time To First Token (TTFT) and generation speed are the only metrics that matter. Developers in the US and Canada use this latency-focused leaderboard to select models that can sustain >80 tokens per second, ensuring conversational interfaces feel human and autocomplete suggestions appear instantly.

Architecture

Latency Reduction Strategies

Strategies to reduce monthly API spend without sacrificing capability.

Geographic POP selection and streaming optimization

Model size isn't the only factor in speed. To optimize latency, teams must stream responses to the client, minimize input payload size, and select API providers with geographic Points of Presence (POPs) close to their users. For teams in Australia, choosing a provider with local Sydney endpoints often yields better real-world speed than picking a theoretically faster model hosted in the US.

Embed-ready

Need this live Speed data on your website?

Join 500+ agencies in the US and Australia using LeadsCalc to capture high-intent leads. Embed this interactive Speed leaderboard on your site in about a minute—Canadian teams use the same flows for CAD-priced proposals and compliance-friendly landing pages.

Customize & Embed this ToolWhite-label · No code required
United StatesCanadaAustralia
Live preview

Your visitors compare Speed models without leaving your domain.

Support & clarity

Frequently Asked Questions

Focused on teams in the United States, Canada, and Australia.

The 'fastest' model depends heavily on geographic routing and payload size. Models in the 'Flash' or '8B' parameter class typically offer the highest throughput. However, if your users are in Australia, a slightly slower model hosted in a Sydney data center will often feel faster than a theoretically faster model hosted in the US due to network latency.