Interactive leaderboard

Best Long-Context LLMs 2026: Large-Window AI APIs

Explore long-context LLMs in 2026: 1M+ token context windows, RAG-friendly SKUs, and estimated API pricing. Built for legal, docs, and enterprise RAG in the US, Canada, and Australia.

Long-context LLMs ranked for RAG, docs, and codebases in 2026

Large context reduces chunking pain for legal bundles, multi-file repos, and executive briefs—but token cost scales with what you paste. This view foregrounds window size and fitness for retrieval-heavy stacks while keeping monthly estimates honest for teams in the US, Canada, and Australia planning enterprise rollouts.

Workload & pricing toggles

Workload presets

Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.

Include Vision / Image Processing

Off — no image fees in cost estimates for vision-capable models.

Turn On to include image fees.

OffOn

Use Cached Pricing

Enable to get 50% off input tokens where cached rates apply

OffOn

Deep Reasoning / Thinking Mode

Model hidden reasoning / extended thinking charged like output tokens when enabled.

OffOn

Batch Pricing

Enable for 50% off input & output where batch/async pricing applies

OffOn
≈ $100.00/mo
8K
1K1.0M
≈ $100.00/mo
2K
100500K
≈ $200.00 total
5K
10100K

Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.

Magic quadrant (top 15)

X: est. monthly · Y: Long context · Dot: provider color · Hover for rank, model & details

Full leaderboard

Showing 48 of 365 models.

PickModelEst. monthlyROI scoreCodingReasoningSpeedMathContextOverall
Auto Router
VARIABLE
81
90
90
70
90
2.0M
90Auto Router optimizes across models for best output. Evidence cites top models reaching 92.3% MMLU. Mapped to 90 across logic, coding, and math to reflect frontier routing capabilities. Vision price defaulted to $0.007 per tier guidelines.
xAI: Grok 4.1 Fast$13.0061
75
83
90
75
2.0M
79Evidence lacks raw benchmarks. Inferred scores based on 'Fast' lightweight tier and agentic focus. Speed rated high (90); coding (75) and logic (80) adjusted lower than flagship models. Native reasoning enabled.
xAI: Grok 4 Fast$13.0041
40
43
95
45
2.0M
43Evidence cites a 43.5 average across HumanEval, GPQA, and MMLU for this 1B-scale Fast model. Mapped to ~40-45 for coding and logic. As a lightweight tier, speed is rated very high.
Llama 4 Scout$7.0060
65
73
85
70
10.0M
70MMMU 69.4% and ChartQA 88.8% map to 70 multimodal. Lacking SWE-bench or GPQA, coding and logic are inferred (65-70) for this 17B-active lightweight MoE tier. Speed is high (85) due to small active parameter count.
xAI: Grok 4.20 Multi-Agent$140.0058
85
87
45
85
2.0M
86Evidence lacks explicit Grok benchmarks (listed as 'Not available'). Inferred as a 2026 flagship reasoning model competing with Opus 4.6. Scores estimated for a heavy multi-agent reasoning tier; speed is lower due to 16-agent coordination.
Pareto Code Router
VARIABLE
78
88
85
70
85
2.0M
86OpenRouter docs state this is a router defaulting to High tier coding models based on Artificial Analysis percentiles. Lacking specific raw benchmarks, scores are mapped to ~85 reflecting flagship-level routed performance. Text-only inputs confirmed.
xAI: Grok 4.20$75.0062
95
91
80
90
2.0M
92SWE-bench (78.0%) maps to 95 coding. GPQA Diamond (74.5%) maps to 92 logic. As a flagship model, instruction and math align with frontier scores. Speed is high (80) due to being the low-latency, non-reasoning variant.
OpenAI: GPT-5.4$250.0060
95
91
70
88
1.1M
91Evidence states GPT-5.4 outperforms GPT-4.1 (SWE-bench Verified 54.6%, GPQA 66.3%). Mapped coding to 95 and logic to 92 for this frontier flagship. Multimodal inferred from native computer-use screenshots and GPT-4.1's MMMU 74.8%.
OpenAI: GPT-5.5$500.0057
85
89
60
90
1.1M
88Evidence states GPT-5.5 outperforms GPT-4.1 (SWE-bench Verified 54.6%, GPQA 66.3%). As a frontier model with native real-time reasoning, scores are mapped to elite flagship tiers (85-90+). Speed is typical for heavyweights.
OpenAI GPT Latest$500.0055
85
86
77
80
1.1M
84Based on GPT-4.1 data: SWE-bench Verified 54.6% maps to 85 coding, GPQA 66.3% maps to 85 logic, IFEval 87.4% maps to 87 instruction. MMMU 74.8% maps to 80 multimodal. Flagship tier, no lightweight adjustment.
OpenAI: GPT-5.4 Pro$3,000.0058
95
93
60
92
1.1M
93GPT-5.4 Pro lacks exact SWE-bench/GPQA scores but beats Gemini 3.1 Pro on SWE-Bench Pro and MMMU-Pro. Inferred as a flagship model: Coding ~95, Logic ~92 (GDPval 83%). Multimodal inferred ~85.
OpenAI: GPT-5.5 Pro$3,000.0058
90
94
60
92
1.1M
92LLM Benchmarks reports 94.8 overall score, mapped to 95 logic. Outperforms in GPQA and MathVista. As a 1T parameter Pro flagship, coding and math are estimated at 90-92. Speed is standard for heavyweights (60).
Owl Alpha
Free
67
65
68
85
60
1.0M
65No exact scores for Owl Alpha; inferred as a lightweight reasoning model ('fewer parameters', 'designed for speed'). Mapped to mid-tier 0-100 scale (Coding 65, Logic 65) reflecting its agentic focus but smaller size.
Google: Gemini 3.1 Pro Preview Custom Tools$200.0063
98
97
65
95
1.0M
97SWE-bench Verified at 80.6% maps to 98 coding. GPQA Diamond at 94.3% maps to 98 logic. As a flagship Pro model, it receives high multimodal (95) and math (95) scores, with standard heavyweight speed (65).
Google: Gemini 2.5 Flash$37.0056
70
78
92
85
1.0M
78GPQA Diamond 78.3% (Logic 80), LiveCodeBench 63.5% (Coding 70), MMMU 76.7%. As a Flash-tier model, it excels in speed (93 tok/s) and math (AIME 78%), but trails Pro in heavy coding.
MiniMax: MiniMax M3$24.0061
82
82
80
85
1.0M
83Based on cited evidence: GPQA 54.4% maps to Logic 75, HumanEval 86.9% to Coding 82, IFEval 89.1% to Instruction 89, and MATH 77.4% to Math 85. Native reasoning tokens are supported.
DeepSeek: DeepSeek V4 Pro$26.1066
98
91
70
88
1.0M
92SWE-bench Verified at 81.0% maps to 98 coding. GPQA Diamond at 66.3% maps to 92 logic. Flagship MoE tier justifies high scores; speed is standard for large MoE.
Gemini 2.5 Pro$150.0058
88
85
50
90
1.0M
87Evidence cites MMLU at 81.7% for Gemini 2.5 Pro, mapping to 85 Logic. It leads GPQA and AIME 2025, mapping to 90 Math. Coding maps to 88 based on major improvements over prior versions. Heavyweight tier.
Google: Gemini 3.5 Flash$150.0057
85
88
95
75
1.0M
84SWE-bench Verified 78.0% and GPQA Diamond 90.4% map to 85 and 90. Despite being a lightweight Flash tier (speed 95), explicit evidence dictates high capability scores, though typically lower than Pro.
Google Gemini Flash Latest$150.0048
70
65
95
75
1.0M
69Evidence cites HumanEval 74.3% (Coding ~70), GPQA 51.0% (Logic ~60), and MMMU 62.3% (Multimodal ~65). As a lightweight Flash tier, scores are adjusted lower than Pro flagships, while Speed is rated high (~95) for its class.
Qwen: Qwen3 Coder 480B A35B$26.8060
92
85
70
65
1.0M
82Evidence shows HumanEval 93.9% and GPQA 61.8%, mapping to high coding (92) and logic (85) for this heavyweight 480B MoE. Math is mapped to 65 based on MATH 39.3%. Speed is 70 (67 tok/s).
Xiaomi: MiMo-V2-Pro$70.0062
95
90
60
88
1.0M
91SWE-bench Verified at 78.0% maps to 95 coding. GPQA Diamond at 87.0% maps to 95 logic. IFBench at 68.8% maps to 85 instruction. As a 1T+ flagship, speed is lower (60).
Google: Gemini 3.1 Flash Lite Preview$25.0053
55
78
98
70
1.0M
70GPQA Diamond at 86.9% maps logic to 85. Coding score of 30.1 maps to 55. MMMU Pro at 76.8% sets multimodal to 77. As a Lite tier, speed is exceptionally high (381 tok/s) mapping to 98.
Google: Gemini 2.5 Flash Lite$8.0054
45
68
95
65
1.0M
61Evidence lacks exact Flash-Lite scores but notes it underperforms Flash (GPQA 78.3%, MMMU 76.7%). As a Lite tier, scores are adjusted downward (Logic 65, Coding 45). Speed is heavily weighted (95) due to 68 tok/s and ultra-low latency.
Google: Gemini 3.1 Flash Lite$25.0050
25
83
98
66
1.0M
64SWE-bench Verified at 22% maps to 25 coding. GPQA Diamond at 86.9% maps to 87 logic. As a Lite tier, it excels in speed (381 t/s, 98) but trails flagships in coding.
Google: Gemini 3.1 Pro Preview$200.0058
88
89
65
88
1.0M
88Evidence lacks raw benchmarks. Inferred scores based on 'Pro' flagship tier and 'frontier reasoning' claims, assigning high 80s for coding, logic, and math. Speed estimated at 65 for a heavyweight.
Google: Lyria 3 Pro Preview
Free
68
70
70
55
60
1.0M
68Evidence notes Lyria 3 Pro scores well on SWE-bench and MMLU without exact figures. Mapped to 70s for Pro tier. Speed is 39.5 tok/s (55). Multimodal audio generation from images supported; default Pro vision price applied.
Google: Gemini 2.5 Pro Preview 06-05$150.0063
96
93
45
96
1.0M
95SWE-bench (59.6%) maps to 96 coding. GPQA (86.4%) maps to 96 logic. AIME (88.0%) maps to 96 math. MMMU (82.0%) maps to 90 multimodal. Flagship tier model with native reasoning; speed adjusted for thinking overhead.
Google: Gemini 2.5 Flash Lite Preview 09-2025$8.0052
45
67
95
55
1.0M
58Based on GPQA (65.1-70.9%) and LiveCodeBench (64.1-68.8%), logic and coding map to 68 and 45. As a 'Flash Lite' tier, speed is heavily weighted (95), reflecting its ultra-low latency design over flagship-level reasoning.
DeepSeek: DeepSeek V4 Flash$5.9066
68
79
90
85
1.0M
78V4 flagship claims 80%+ SWE-bench; Flash tier (13B active) lacks explicit scores but is inferred ~68 for coding. Logic and Math scaled down for Flash efficiency. Speed rated 90 for fast inference design.
Google: Gemini 2.5 Pro Preview 05-06$150.0062
95
93
45
95
1.0M
94Evidence notes it outperforms Claude 3.5 Sonnet on SWE-bench Verified, GPQA, and MMMU, though exact percentages are omitted. Mapped to frontier-level 95s for coding, logic, and multimodal. Speed is reduced due to mandatory native thought reasoning.
Xiaomi: MiMo-V2.5-Pro$26.1065
95
91
65
88
1.0M
91SWE-bench Verified 78.9% maps to 95 coding. GPQA Diamond 66.7% maps to 92 logic. Flagship tier model with native reasoning and caching support.
Gemini 2.0 Flash (001)$8.0059
65
73
95
70
1.0M
70MMLU 76.4%, MMMU 71.7%, and MATH 53.2% map to Logic 76, Multimodal 72, and Math 70. As a Flash-tier model, Coding (65) is adjusted lower than heavyweights, while Speed (95) reflects its highly optimized latency.
Google Gemini Pro Latest$200.0053
84
73
60
88
1.0M
79HumanEval 84.1% maps to 84 coding. GPQA 59.1% maps to 65 logic. MATH 86.5% maps to 88 math. MMMU 65.9% maps to 75 multimodal. Flagship tier model with native reasoning capabilities.
Xiaomi: MiMo-V2.5$8.4067
78
86
60
85
1.0M
84Pro-level agentic performance maps to MiMo-V2-Pro's SWE-bench Verified (78.0%) and GPQA Diamond (87.0%), yielding ~78 Coding and ~87 Logic. Speed reflects 29 tok/s. Multimodal is strong per omnimodal claims.
Google: Gemini 3 Flash Preview$50.0061
88
89
95
85
1.0M
88GPQA Diamond 90.4% maps to Logic 92; SWE-bench 78% maps to Coding 88. As a Flash-tier model, Speed is rated very high (95). Multimodal inferred at 80 due to extensive video/audio/image support.
Google: Lyria 3 Clip Preview
Free
31
0
5
50
0
1.0M
3Lyria 3 is a specialized music generation model lacking standard LLM benchmarks (SWE-bench, GPQA). Assigned 0 for coding/logic/math. Speed mapped to 50 from 38 tok/s. Multimodal scored 85 for native image-to-audio generation.
Llama 4 Maverick$12.0063
80
87
75
75
1.0M
82HumanEval 82.6% maps to coding 80. MMLU 88.5% maps to logic 85. IFEval 92.1% maps to instruction 88. MATH 77% maps to math 75. MMMU 73.4% maps to multimodal 85. 17B active MoE yields mid-tier speed.
Google: Gemini 2.0 Flash Lite$6.0058
65
65
95
65
1.0M
65Evidence lacks exact percentages but confirms 2.0 Flash-Lite outperforms 1.5 Flash and trails 2.0 Flash on GPQA, MATH, and MMMU. Scores inferred cautiously for this lightweight tier, prioritizing its high speed and lower reasoning/coding capabilities.
OpenAI: GPT-4.1$160.0057
88
86
65
80
1.0M
85SWE-bench Verified at 54.6% maps to 88 coding. GPQA Diamond at 66.3% maps to 85 logic. IFEval 87.4% maps to 87 instruction. Flagship tier model; no tier penalty applied.
OpenAI: GPT-4.1 Mini$32.0056
68
82
95
75
1.0M
77SWE-bench Verified 23.6% (mapped to 68), GPQA Diamond 65% (mapped to 80). As a Mini tier, speed is rated high (95) while coding/logic reflect its lightweight nature compared to flagship models.
OpenAI: GPT-4.1 Nano$8.0056
65
65
95
65
1.0M
65GPQA 50.3% and MMLU 80.1% map to ~60 logic. HumanEval 86.6% and Aider 9.8% map to ~65 coding. As a 'Nano' lightweight tier, it prioritizes speed (~95) over flagship reasoning.
Writer: Palmyra X5$84.0043
55
43
75
90
1.0M
58Mapped BigCodeBench (48.7%) to coding (55), GPQA (38.26%) to logic (45), IFEval (36.57%) to instruction (40), and MATH500 (88.6%) to math (90). Flagship tier model with native reasoning tokens but mixed benchmark performance.
MiniMax: MiniMax-01$19.0063
85
87
65
85
1.0M
86HumanEval 86.9%, MMLU 88.5%, and MATH 84.6% map to high 80s. Heavyweight 456B MoE tier dictates strong logic/coding but moderate speed. Vision supported; default frontier price applied.
Qwen: Qwen3.7 Plus$32.0059
80
81
80
82
1.0M
81Evidence lacks exact Qwen3.7 Plus benchmark scores. Inferred mid-tier capabilities (Coding 80, Logic 80) from 'Plus' designation and 71 tok/s throughput (Speed 80). Multimodal inferred at 70. Always-on reasoning noted in lineage.
xAI: Grok 4.3$75.0058
82
85
90
88
1.0M
85Using Grok 4.20's GPQA 78.5% and MATH-500 87.3% as proxies, mapped to Logic 85 and Math 88. As a fast/cost-efficient tier ($0.20/1M), Coding is inferred at 82. Speed is rated 90 for ultra-fast throughput.
Anthropic: Claude Fable Latest$900.0055
85
85
50
85
1.0M
85Evidence states Fable is competitive with top 2026 models on SWE-bench and MMLU, featuring native reasoning tokens. Lacking exact Fable scores, inferred frontier-level capabilities (~85) across coding and logic. Speed adjusted for reasoning overhead.
Qwen: Qwen3 Coder Flash$17.5554
70
68
95
65
1.0M
68Evidence lacks exact benchmark numbers but notes Qwen3 Coder Flash is a speed-optimized, lightweight tier. Scores inferred cautiously for a Flash model, prioritizing speed (95) over coding/logic compared to the flagship Coder Plus.

Need a shareable artifact?

Get a print-ready PDF of your results and a CSV spreadsheet. Tap the button, then enter your work email. We use it to build your files and start the download—and to email you a copy if the site owner enabled that.

AI ROI Leaderboard & Discovery by LeadsCalc

Detailed analysis

PDF Breakdown

Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: Long context).

Instant setup
No CC required

By submitting, you agree to our Privacy Policy and Terms.

Agency accelerator

Whitelabel Context Leaderboard
for your site

Embed the interactive long context view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.

1-Click CRM sync
Custom branding
Branded reports
Lead analytics

Free to start

$0/mo*
GET STARTED

NO CREDIT CARD REQUIRED

How it works

Methodology: How we rank Long-Context LLMs

Transparent, benchmark-driven rankings—same craft as our single-model deep dives.

Context window data and how it interacts with price

Our long-context rankings evaluate models based on their maximum supported context window (ranging from 128k to 2M+ tokens) and their proven ability to retrieve information accurately at high fill rates (Needle In A Haystack performance). This view is purpose-built for teams building enterprise RAG systems, legal document analyzers, and repository-scale coding assistants.

Battle Arena

Compare up to four LLMs side by side

Tick up to four models in the leaderboard table, then open Battle Arena for API pricing, benchmarks, and workload math in one view—perfect when you are shortlisting vendors for a pilot in the US, Canada, or Australia.

Prefer a head start? Jump into high-intent comparisons people search for every day—same interactive calculator, zero signup.

Open Battle ArenaUp to 4 models · Live estimates
Signals & spend

Value analysis

Benchmarks vs. estimated API cost—read the story your CFO cares about.

When a mega-context model beats clever chunking

If your prompts repeat the same long system preamble, prompt caching may matter as much as raw window size—toggle cached pricing when supported. Canadian and Australian enterprises often evaluate sovereignty and subprocessors alongside context; US teams may prioritize vendor BAAs and retention policies.

Production deployment

Retrieval-Augmented Generation (RAG)

How teams in the US, Canada, and Australia deploy these models in production.

Needle-in-a-haystack Q&A and book-length summarization

Models with 1M+ token windows are transforming how enterprises handle unstructured data. Legal teams in the US and compliance officers in Canada use these massive-context models to ingest entire case files, financial prospectuses, or sprawling codebases in a single prompt, bypassing the complexity and retrieval loss associated with traditional vector database chunking.

Architecture

Long-Context Cost Management

Strategies to reduce monthly API spend without sacrificing capability.

Prefix caching economics and context compression

Sending 500,000 tokens per request is financially ruinous without optimization. The key to affordable long-context architecture is Prompt Caching. By keeping the massive document at the start of the prompt (the prefix), providers can cache it and discount subsequent queries by up to 90%. Ensure you toggle 'Use Cached Pricing' on this leaderboard to accurately forecast long-context RAG costs.

Embed-ready

Need this live Long-Context data on your website?

Join 500+ agencies in the US and Australia using LeadsCalc to capture high-intent leads. Embed this interactive Long-Context leaderboard on your site in about a minute—Canadian teams use the same flows for CAD-priced proposals and compliance-friendly landing pages.

Customize & Embed this ToolWhite-label · No code required
United StatesCanadaAustralia
Live preview

Your visitors compare Long-Context models without leaving your domain.

Support & clarity

Frequently Asked Questions

Focused on teams in the United States, Canada, and Australia.

Models with 1M+ token windows (like Gemini 1.5 Pro or Claude 3.5 Sonnet) are currently the industry standard for massive document analysis. They allow teams in the US, Canada, and Australia to ingest entire legal briefs or financial prospectuses in a single prompt, bypassing the complexity of traditional vector database chunking.