Why is Time To First Token (TTFT) so important?

TTFT measures how long it takes for the model to start streaming its response. For voice AI agents or inline code autocomplete, a TTFT above 500ms feels sluggish and breaks the illusion of real-time interaction. Use this speed leaderboard to shortlist models optimized for low-latency streaming.

Does enabling Batch API discounts affect model speed?

Yes, drastically. Batch APIs (which often offer a 50% discount) are asynchronous and typically guarantee turnaround within 24 hours. They are designed for offline data processing, not real-time applications. If speed is your priority, you must use standard synchronous endpoints and pay the list price.

Interactive leaderboard

Fastest LLM APIs 2026: Low-Latency AI Models Compared

Rank the fastest LLM APIs in 2026 using throughput and TTFT latency signals with pricing context. Shortlist responsive AI models for real-time apps in the US, Canada, and Australia.

Speed-ranked LLMs with API cost on the same canvas in 2026

Speed scores reflect interactive-class behavior: smaller fast tiers vs. heavy flagships, grounded in benchmark metadata and tier cues—not a single vendor’s marketing latency claim. Product teams across the United States, Canada, and Australia use this tab to protect UX on chat surfaces while still eyeballing what responsiveness costs at production token volumes.

Est. monthly ROI score Coding Reasoning Speed Math Context Overall Open-weight

Workload & pricing toggles

Workload presets

Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.

Include Vision / Image Processing

Off — no image fees in cost estimates for vision-capable models.

Turn On to include image fees.

OffOn

Use Cached Pricing

Enable to get 50% off input tokens where cached rates apply

OffOn

Deep Reasoning / Thinking Mode

Model hidden reasoning / extended thinking charged like output tokens when enabled.

OffOn

Batch Pricing

Enable for 50% off input & output where batch/async pricing applies

OffOn

Input Tokens≈ $100.00/mo

1K—1.0M

Output Tokens≈ $100.00/mo

100—500K

Monthly API Requests≈ $200.00 total

10—100K

Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.

Magic quadrant (top 15)

X: est. monthly · Y: Speed · Dot: provider color · Hover for rank, model & details

Full leaderboard

Showing 48 of 327 models.

Model	Est. monthly	ROI score	Coding	Reasoning	Speed	Math	Context	Overall
LiquidAI: LFM2-24B-A2B	$2.40	43	0	44	97	0	33K	22
Relace: Relace Search	$70.00	52	85	73	95	65	256K	74
DeepSeek: DeepSeek V4 Flash	$8.40	67	90	83	95	80	1.0M	84
Amazon: Nova Micro 1.0	$2.80	72	69	82	95	69	128K	76
xAI: Grok 3 Mini Beta	$17.00	63	90	87	95	77	131K	85
Inception: Mercury 2	$17.50	51	67	73	95	38	128K	63
AionLabs: Aion-1.0-Mini	$42.00	61	85	85	95	88	131K	86
xAI: Grok 3 Mini	$17.00	63	88	86	92	76	131K	84
Google: Gemini 3.1 Flash Lite Preview	$25.00	24	0	39	90	0	1.0M	19
Qwen: Qwen3.5-Flash	$5.20	23	0	0	90	0	1.0M	0
Mistral: Mistral Small Creative	$7.00	66	88	83	90	69	33K	81
Z.ai: GLM 4.7 Flash	$6.40	66	85	79	90	75	203K	80
Morph: Morph V3 Fast	$44.00	57	96	78	90	70	82K	80
Mistral: Mistral 7B Instruct v0.1	$6.30	65	85	79	90	70	3K	78
StepFun: Step 3.5 Flash	$7.00	62	81	73	90	71	262K	74
Google: Nano Banana 2 (Gemini 3.1 Flash Image Preview)	$50.00	11	0	0	90	0	66K	0
OpenAI: GPT-5.2 Chat	$210.00	54	85	83	90	73	128K	81
Mancer: Weaver (alpha)	$40.00	12	0	0	85	0	8K	0
ByteDance Seed: Seed-2.0-Mini	$8.00	60	85	66	85	70	262K	72
MiniMax: MiniMax M2-her	$24.00	14	0	0	85	0	66K	0
OpenAI: gpt-oss-20b	$2.60	84	96	97	85	98	131K	97
Free Models Router	Free	59	43	55	85	55	200K	52
MiniMax: MiniMax M2.5	$17.50	60	80	83	85	72	197K	79
Mistral: Ministral 3 3B 2512	$5.00	31	0	28	85	0	131K	14
IBM: Granite 4.0 Micro	$1.78	78	81	76	85	81	131K	78
Mistral: Voxtral Small 24B 2507	$7.00	20	0	0	85	0	32K	0
Arcee AI: Trinity Mini	$3.30	73	82	80	85	80	131K	81
xAI: Grok 4.1 Fast	$13.00	50	39	81	85	34	2.0M	59
Xiaomi: MiMo-V2-Flash	$6.50	56	65	60	85	63	262K	62
Writer: Palmyra X5	$84.00	49	65	70	85	70	1.0M	69
ByteDance Seed: Seed 1.6 Flash	$6.00	66	87	76	85	77	262K	79
Xiaomi: MiMo-V2.5	$36.00	57	89	75	85	75	1.0M	78
Pareto Code Router	VARIABLE	74	88	73	85	80	200K	78
MiniMax: MiniMax M2	$20.20	50	70	63	85	55	197K	63
Mistral: Ministral 3 8B 2512	$7.50	40	0	38	85	67	262K	36
Mistral: Codestral 2508	$21.00	24	70	0	85	0	256K	18
OpenAI: GPT-5 Mini	$30.00	58	90	78	85	74	400K	80
Z.ai: GLM 4.7	$32.60	64	94	87	85	96	203K	91
NVIDIA: Nemotron Nano 9B V2	$3.20	75	72	84	85	98	131K	85
Z.ai: GLM 4.5 Air	$13.70	51	24	71	85	81	131K	62
Switchpoint Router	VARIABLE	30	0	0	85	0	131K	0
Mistral: Devstral Small 1.1	$7.00	50	54	53	85	53	131K	53
Mistral: Mistral Small 3.2 24B	$5.00	69	92	83	85	69	128K	82
OpenAI: GPT-5.1-Codex-Mini	$30.00	61	84	82	85	92	400K	85
OpenAI: GPT-5.1 Chat	$150.00	58	91	88	85	77	128K	86
EssentialAI: Rnj 1 Instruct	$7.50	66	75	81	85	89	33K	81
TheDrummer: Skyfall 36B V2	$30.00	13	0	0	85	0	33K	0
Meta: Llama 3 8B Instruct	$1.60	68	62	69	85	30	8K	58

Need a shareable artifact?

Download a print-ready PDF from the leaderboard and workload above. No email step—lead capture is off.

Detailed analysis

PDF Breakdown

Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: Speed).

Instant setup

No CC required

By submitting, you agree to our Privacy Policy and Terms.

Agency accelerator

Whitelabel Speed Leaderboard
for your site

Embed the interactive speed view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.

1-Click CRM sync

Custom branding

Branded reports

Lead analytics

Free to start

$0/mo*

GET STARTED

NO CREDIT CARD REQUIRED

How it works

Methodology: How we rank Speed LLMs

Transparent, benchmark-driven rankings—same craft as our single-model deep dives.

How we infer speed class from benchmarks and model tiers

Our speed rankings synthesize throughput (Tokens Per Second) and Time To First Token (TTFT) metrics derived from benchmark metadata and model tier classifications (e.g., Flash, Haiku, Mini). We prioritize models that can sustain interactive-class latency under production loads, allowing product teams to build snappy, real-time user experiences without sacrificing core conversational quality.

Battle Arena

Compare up to four LLMs side by side

Tick up to four models in the leaderboard table, then open Battle Arena for API pricing, benchmarks, and workload math in one view—perfect when you are shortlisting vendors for a pilot in the US, Canada, or Australia.

Prefer a head start? Jump into high-intent comparisons people search for every day—same interactive calculator, zero signup.

Open Battle ArenaUp to 4 models · Live estimates

Popular comparisons

Signals & spend

Value analysis

Benchmarks vs. estimated API cost—read the story your CFO cares about.

Trading speed for quality on customer-facing chat

A blazing fast model that fails quality checks still loses users. Use speed to narrow candidates, then validate on your own p95 traces from regional endpoints. Canadian and Australian operators often measure from local POPs; US teams frequently split traffic between a latency-optimized small model and a larger fallback.

Production deployment

Real-Time & Voice Applications

How teams in the US, Canada, and Australia deploy these models in production.

Voice agents, real-time autocomplete, and live translation

When building voice-to-voice AI agents or inline code autocomplete, Time To First Token (TTFT) and generation speed are the only metrics that matter. Developers in the US and Canada use this latency-focused leaderboard to select models that can sustain >80 tokens per second, ensuring conversational interfaces feel human and autocomplete suggestions appear instantly.

Architecture

Latency Reduction Strategies

Strategies to reduce monthly API spend without sacrificing capability.

Geographic POP selection and streaming optimization

Model size isn't the only factor in speed. To optimize latency, teams must stream responses to the client, minimize input payload size, and select API providers with geographic Points of Presence (POPs) close to their users. For teams in Australia, choosing a provider with local Sydney endpoints often yields better real-world speed than picking a theoretically faster model hosted in the US.

Embed-ready

Need this live Speed data on your website?

Join 500+ agencies in the US and Australia using LeadsCalc to capture high-intent leads. Embed this interactive Speed leaderboard on your site in about a minute—Canadian teams use the same flows for CAD-priced proposals and compliance-friendly landing pages.

Customize & Embed this ToolWhite-label · No code required

United StatesCanadaAustralia

Live preview

Your visitors compare Speed models without leaving your domain.

Support & clarity

Frequently Asked Questions

Focused on teams in the United States, Canada, and Australia.

Speed

The 'fastest' model depends heavily on geographic routing and payload size. Models in the 'Flash' or '8B' parameter class typically offer the highest throughput. However, if your users are in Australia, a slightly slower model hosted in a Sydney data center will often feel faster than a theoretically faster model hosted in the US due to network latency.

Speed-ranked LLMs with API cost on the same canvas in 2026

Workload & pricing toggles

Include Vision / Image Processing

Use Cached Pricing

Deep Reasoning / Thinking Mode

Batch Pricing

Magic quadrant (top 15)

Full leaderboard

PDF Breakdown

Whitelabel Speed Leaderboardfor your site

Methodology: How we rank Speed LLMs

How we infer speed class from benchmarks and model tiers

Compare up to four LLMs side by side

Value analysis

Trading speed for quality on customer-facing chat

Real-Time & Voice Applications

Voice agents, real-time autocomplete, and live translation

Latency Reduction Strategies

Geographic POP selection and streaming optimization

Need this live Speed data on your website?

Frequently Asked Questions

1What is the fastest LLM API for real-time chat?

2Why is Time To First Token (TTFT) so important?

3Does enabling Batch API discounts affect model speed?

Whitelabel Speed Leaderboard
for your site