Best Long-Context LLMs 2026: Large-Window AI APIs
Explore long-context LLMs in 2026: 1M+ token context windows, RAG-friendly SKUs, and estimated API pricing. Built for legal, docs, and enterprise RAG in the US, Canada, and Australia.
Long-context LLMs ranked for RAG, docs, and codebases in 2026
Large context reduces chunking pain for legal bundles, multi-file repos, and executive briefs—but token cost scales with what you paste. This view foregrounds window size and fitness for retrieval-heavy stacks while keeping monthly estimates honest for teams in the US, Canada, and Australia planning enterprise rollouts.
Workload & pricing toggles
Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.
Include Vision / Image Processing
Off — no image fees in cost estimates for vision-capable models.
Turn On to include image fees.
Use Cached Pricing
Enable to get 50% off input tokens where cached rates apply
Deep Reasoning / Thinking Mode
Model hidden reasoning / extended thinking charged like output tokens when enabled.
Batch Pricing
Enable for 50% off input & output where batch/async pricing applies
Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.
Magic quadrant (top 15)
X: est. monthly · Y: Long context · Dot: provider color · Hover for rank, model & detailsFull leaderboard
Showing 48 of 365 models.
| Pick | Model | Est. monthly | ROI score | Coding | Reasoning | Speed | Math | Context | Overall |
|---|---|---|---|---|---|---|---|---|---|
| Auto Router | VARIABLE | 81 | 90 | 90 | 70 | 90 | 2.0M | 90Auto Router optimizes across models for best output. Evidence cites top models reaching 92.3% MMLU. Mapped to 90 across logic, coding, and math to reflect frontier routing capabilities. Vision price defaulted to $0.007 per tier guidelines. | |
| xAI: Grok 4.1 Fast | $13.00 | 61 | 75 | 83 | 90 | 75 | 2.0M | 79Evidence lacks raw benchmarks. Inferred scores based on 'Fast' lightweight tier and agentic focus. Speed rated high (90); coding (75) and logic (80) adjusted lower than flagship models. Native reasoning enabled. | |
| xAI: Grok 4 Fast | $13.00 | 41 | 40 | 43 | 95 | 45 | 2.0M | 43Evidence cites a 43.5 average across HumanEval, GPQA, and MMLU for this 1B-scale Fast model. Mapped to ~40-45 for coding and logic. As a lightweight tier, speed is rated very high. | |
| Llama 4 Scout | $7.00 | 60 | 65 | 73 | 85 | 70 | 10.0M | 70MMMU 69.4% and ChartQA 88.8% map to 70 multimodal. Lacking SWE-bench or GPQA, coding and logic are inferred (65-70) for this 17B-active lightweight MoE tier. Speed is high (85) due to small active parameter count. | |
| xAI: Grok 4.20 Multi-Agent | $140.00 | 58 | 85 | 87 | 45 | 85 | 2.0M | 86Evidence lacks explicit Grok benchmarks (listed as 'Not available'). Inferred as a 2026 flagship reasoning model competing with Opus 4.6. Scores estimated for a heavy multi-agent reasoning tier; speed is lower due to 16-agent coordination. | |
| Pareto Code Router | VARIABLE | 78 | 88 | 85 | 70 | 85 | 2.0M | 86OpenRouter docs state this is a router defaulting to High tier coding models based on Artificial Analysis percentiles. Lacking specific raw benchmarks, scores are mapped to ~85 reflecting flagship-level routed performance. Text-only inputs confirmed. | |
| xAI: Grok 4.20 | $75.00 | 62 | 95 | 91 | 80 | 90 | 2.0M | 92SWE-bench (78.0%) maps to 95 coding. GPQA Diamond (74.5%) maps to 92 logic. As a flagship model, instruction and math align with frontier scores. Speed is high (80) due to being the low-latency, non-reasoning variant. | |
| OpenAI: GPT-5.4 | $250.00 | 60 | 95 | 91 | 70 | 88 | 1.1M | 91Evidence states GPT-5.4 outperforms GPT-4.1 (SWE-bench Verified 54.6%, GPQA 66.3%). Mapped coding to 95 and logic to 92 for this frontier flagship. Multimodal inferred from native computer-use screenshots and GPT-4.1's MMMU 74.8%. | |
| OpenAI: GPT-5.5 | $500.00 | 57 | 85 | 89 | 60 | 90 | 1.1M | 88Evidence states GPT-5.5 outperforms GPT-4.1 (SWE-bench Verified 54.6%, GPQA 66.3%). As a frontier model with native real-time reasoning, scores are mapped to elite flagship tiers (85-90+). Speed is typical for heavyweights. | |
| OpenAI GPT Latest | $500.00 | 55 | 85 | 86 | 77 | 80 | 1.1M | 84Based on GPT-4.1 data: SWE-bench Verified 54.6% maps to 85 coding, GPQA 66.3% maps to 85 logic, IFEval 87.4% maps to 87 instruction. MMMU 74.8% maps to 80 multimodal. Flagship tier, no lightweight adjustment. | |
| OpenAI: GPT-5.4 Pro | $3,000.00 | 58 | 95 | 93 | 60 | 92 | 1.1M | 93GPT-5.4 Pro lacks exact SWE-bench/GPQA scores but beats Gemini 3.1 Pro on SWE-Bench Pro and MMMU-Pro. Inferred as a flagship model: Coding ~95, Logic ~92 (GDPval 83%). Multimodal inferred ~85. | |
| OpenAI: GPT-5.5 Pro | $3,000.00 | 58 | 90 | 94 | 60 | 92 | 1.1M | 92LLM Benchmarks reports 94.8 overall score, mapped to 95 logic. Outperforms in GPQA and MathVista. As a 1T parameter Pro flagship, coding and math are estimated at 90-92. Speed is standard for heavyweights (60). | |
| Owl Alpha | Free | 67 | 65 | 68 | 85 | 60 | 1.0M | 65No exact scores for Owl Alpha; inferred as a lightweight reasoning model ('fewer parameters', 'designed for speed'). Mapped to mid-tier 0-100 scale (Coding 65, Logic 65) reflecting its agentic focus but smaller size. | |
| Google: Gemini 3.1 Pro Preview Custom Tools | $200.00 | 63 | 98 | 97 | 65 | 95 | 1.0M | 97SWE-bench Verified at 80.6% maps to 98 coding. GPQA Diamond at 94.3% maps to 98 logic. As a flagship Pro model, it receives high multimodal (95) and math (95) scores, with standard heavyweight speed (65). | |
| Google: Gemini 2.5 Flash | $37.00 | 56 | 70 | 78 | 92 | 85 | 1.0M | 78GPQA Diamond 78.3% (Logic 80), LiveCodeBench 63.5% (Coding 70), MMMU 76.7%. As a Flash-tier model, it excels in speed (93 tok/s) and math (AIME 78%), but trails Pro in heavy coding. | |
| MiniMax: MiniMax M3 | $24.00 | 61 | 82 | 82 | 80 | 85 | 1.0M | 83Based on cited evidence: GPQA 54.4% maps to Logic 75, HumanEval 86.9% to Coding 82, IFEval 89.1% to Instruction 89, and MATH 77.4% to Math 85. Native reasoning tokens are supported. | |
| DeepSeek: DeepSeek V4 Pro | $26.10 | 66 | 98 | 91 | 70 | 88 | 1.0M | 92SWE-bench Verified at 81.0% maps to 98 coding. GPQA Diamond at 66.3% maps to 92 logic. Flagship MoE tier justifies high scores; speed is standard for large MoE. | |
| Gemini 2.5 Pro | $150.00 | 58 | 88 | 85 | 50 | 90 | 1.0M | 87Evidence cites MMLU at 81.7% for Gemini 2.5 Pro, mapping to 85 Logic. It leads GPQA and AIME 2025, mapping to 90 Math. Coding maps to 88 based on major improvements over prior versions. Heavyweight tier. | |
| Google: Gemini 3.5 Flash | $150.00 | 57 | 85 | 88 | 95 | 75 | 1.0M | 84SWE-bench Verified 78.0% and GPQA Diamond 90.4% map to 85 and 90. Despite being a lightweight Flash tier (speed 95), explicit evidence dictates high capability scores, though typically lower than Pro. | |
| Google Gemini Flash Latest | $150.00 | 48 | 70 | 65 | 95 | 75 | 1.0M | 69Evidence cites HumanEval 74.3% (Coding ~70), GPQA 51.0% (Logic ~60), and MMMU 62.3% (Multimodal ~65). As a lightweight Flash tier, scores are adjusted lower than Pro flagships, while Speed is rated high (~95) for its class. | |
| Qwen: Qwen3 Coder 480B A35B | $26.80 | 60 | 92 | 85 | 70 | 65 | 1.0M | 82Evidence shows HumanEval 93.9% and GPQA 61.8%, mapping to high coding (92) and logic (85) for this heavyweight 480B MoE. Math is mapped to 65 based on MATH 39.3%. Speed is 70 (67 tok/s). | |
| Xiaomi: MiMo-V2-Pro | $70.00 | 62 | 95 | 90 | 60 | 88 | 1.0M | 91SWE-bench Verified at 78.0% maps to 95 coding. GPQA Diamond at 87.0% maps to 95 logic. IFBench at 68.8% maps to 85 instruction. As a 1T+ flagship, speed is lower (60). | |
| Google: Gemini 3.1 Flash Lite Preview | $25.00 | 53 | 55 | 78 | 98 | 70 | 1.0M | 70GPQA Diamond at 86.9% maps logic to 85. Coding score of 30.1 maps to 55. MMMU Pro at 76.8% sets multimodal to 77. As a Lite tier, speed is exceptionally high (381 tok/s) mapping to 98. | |
| Google: Gemini 2.5 Flash Lite | $8.00 | 54 | 45 | 68 | 95 | 65 | 1.0M | 61Evidence lacks exact Flash-Lite scores but notes it underperforms Flash (GPQA 78.3%, MMMU 76.7%). As a Lite tier, scores are adjusted downward (Logic 65, Coding 45). Speed is heavily weighted (95) due to 68 tok/s and ultra-low latency. | |
| Google: Gemini 3.1 Flash Lite | $25.00 | 50 | 25 | 83 | 98 | 66 | 1.0M | 64SWE-bench Verified at 22% maps to 25 coding. GPQA Diamond at 86.9% maps to 87 logic. As a Lite tier, it excels in speed (381 t/s, 98) but trails flagships in coding. | |
| Google: Gemini 3.1 Pro Preview | $200.00 | 58 | 88 | 89 | 65 | 88 | 1.0M | 88Evidence lacks raw benchmarks. Inferred scores based on 'Pro' flagship tier and 'frontier reasoning' claims, assigning high 80s for coding, logic, and math. Speed estimated at 65 for a heavyweight. | |
| Google: Lyria 3 Pro Preview | Free | 68 | 70 | 70 | 55 | 60 | 1.0M | 68Evidence notes Lyria 3 Pro scores well on SWE-bench and MMLU without exact figures. Mapped to 70s for Pro tier. Speed is 39.5 tok/s (55). Multimodal audio generation from images supported; default Pro vision price applied. | |
| Google: Gemini 2.5 Pro Preview 06-05 | $150.00 | 63 | 96 | 93 | 45 | 96 | 1.0M | 95SWE-bench (59.6%) maps to 96 coding. GPQA (86.4%) maps to 96 logic. AIME (88.0%) maps to 96 math. MMMU (82.0%) maps to 90 multimodal. Flagship tier model with native reasoning; speed adjusted for thinking overhead. | |
| Google: Gemini 2.5 Flash Lite Preview 09-2025 | $8.00 | 52 | 45 | 67 | 95 | 55 | 1.0M | 58Based on GPQA (65.1-70.9%) and LiveCodeBench (64.1-68.8%), logic and coding map to 68 and 45. As a 'Flash Lite' tier, speed is heavily weighted (95), reflecting its ultra-low latency design over flagship-level reasoning. | |
| DeepSeek: DeepSeek V4 Flash | $5.90 | 66 | 68 | 79 | 90 | 85 | 1.0M | 78V4 flagship claims 80%+ SWE-bench; Flash tier (13B active) lacks explicit scores but is inferred ~68 for coding. Logic and Math scaled down for Flash efficiency. Speed rated 90 for fast inference design. | |
| Google: Gemini 2.5 Pro Preview 05-06 | $150.00 | 62 | 95 | 93 | 45 | 95 | 1.0M | 94Evidence notes it outperforms Claude 3.5 Sonnet on SWE-bench Verified, GPQA, and MMMU, though exact percentages are omitted. Mapped to frontier-level 95s for coding, logic, and multimodal. Speed is reduced due to mandatory native thought reasoning. | |
| Xiaomi: MiMo-V2.5-Pro | $26.10 | 65 | 95 | 91 | 65 | 88 | 1.0M | 91SWE-bench Verified 78.9% maps to 95 coding. GPQA Diamond 66.7% maps to 92 logic. Flagship tier model with native reasoning and caching support. | |
| Gemini 2.0 Flash (001) | $8.00 | 59 | 65 | 73 | 95 | 70 | 1.0M | 70MMLU 76.4%, MMMU 71.7%, and MATH 53.2% map to Logic 76, Multimodal 72, and Math 70. As a Flash-tier model, Coding (65) is adjusted lower than heavyweights, while Speed (95) reflects its highly optimized latency. | |
| Google Gemini Pro Latest | $200.00 | 53 | 84 | 73 | 60 | 88 | 1.0M | 79HumanEval 84.1% maps to 84 coding. GPQA 59.1% maps to 65 logic. MATH 86.5% maps to 88 math. MMMU 65.9% maps to 75 multimodal. Flagship tier model with native reasoning capabilities. | |
| Xiaomi: MiMo-V2.5 | $8.40 | 67 | 78 | 86 | 60 | 85 | 1.0M | 84Pro-level agentic performance maps to MiMo-V2-Pro's SWE-bench Verified (78.0%) and GPQA Diamond (87.0%), yielding ~78 Coding and ~87 Logic. Speed reflects 29 tok/s. Multimodal is strong per omnimodal claims. | |
| Google: Gemini 3 Flash Preview | $50.00 | 61 | 88 | 89 | 95 | 85 | 1.0M | 88GPQA Diamond 90.4% maps to Logic 92; SWE-bench 78% maps to Coding 88. As a Flash-tier model, Speed is rated very high (95). Multimodal inferred at 80 due to extensive video/audio/image support. | |
| Google: Lyria 3 Clip Preview | Free | 31 | 0 | 5 | 50 | 0 | 1.0M | 3Lyria 3 is a specialized music generation model lacking standard LLM benchmarks (SWE-bench, GPQA). Assigned 0 for coding/logic/math. Speed mapped to 50 from 38 tok/s. Multimodal scored 85 for native image-to-audio generation. | |
| Llama 4 Maverick | $12.00 | 63 | 80 | 87 | 75 | 75 | 1.0M | 82HumanEval 82.6% maps to coding 80. MMLU 88.5% maps to logic 85. IFEval 92.1% maps to instruction 88. MATH 77% maps to math 75. MMMU 73.4% maps to multimodal 85. 17B active MoE yields mid-tier speed. | |
| Google: Gemini 2.0 Flash Lite | $6.00 | 58 | 65 | 65 | 95 | 65 | 1.0M | 65Evidence lacks exact percentages but confirms 2.0 Flash-Lite outperforms 1.5 Flash and trails 2.0 Flash on GPQA, MATH, and MMMU. Scores inferred cautiously for this lightweight tier, prioritizing its high speed and lower reasoning/coding capabilities. | |
| OpenAI: GPT-4.1 | $160.00 | 57 | 88 | 86 | 65 | 80 | 1.0M | 85SWE-bench Verified at 54.6% maps to 88 coding. GPQA Diamond at 66.3% maps to 85 logic. IFEval 87.4% maps to 87 instruction. Flagship tier model; no tier penalty applied. | |
| OpenAI: GPT-4.1 Mini | $32.00 | 56 | 68 | 82 | 95 | 75 | 1.0M | 77SWE-bench Verified 23.6% (mapped to 68), GPQA Diamond 65% (mapped to 80). As a Mini tier, speed is rated high (95) while coding/logic reflect its lightweight nature compared to flagship models. | |
| OpenAI: GPT-4.1 Nano | $8.00 | 56 | 65 | 65 | 95 | 65 | 1.0M | 65GPQA 50.3% and MMLU 80.1% map to ~60 logic. HumanEval 86.6% and Aider 9.8% map to ~65 coding. As a 'Nano' lightweight tier, it prioritizes speed (~95) over flagship reasoning. | |
| Writer: Palmyra X5 | $84.00 | 43 | 55 | 43 | 75 | 90 | 1.0M | 58Mapped BigCodeBench (48.7%) to coding (55), GPQA (38.26%) to logic (45), IFEval (36.57%) to instruction (40), and MATH500 (88.6%) to math (90). Flagship tier model with native reasoning tokens but mixed benchmark performance. | |
| MiniMax: MiniMax-01 | $19.00 | 63 | 85 | 87 | 65 | 85 | 1.0M | 86HumanEval 86.9%, MMLU 88.5%, and MATH 84.6% map to high 80s. Heavyweight 456B MoE tier dictates strong logic/coding but moderate speed. Vision supported; default frontier price applied. | |
| Qwen: Qwen3.7 Plus | $32.00 | 59 | 80 | 81 | 80 | 82 | 1.0M | 81Evidence lacks exact Qwen3.7 Plus benchmark scores. Inferred mid-tier capabilities (Coding 80, Logic 80) from 'Plus' designation and 71 tok/s throughput (Speed 80). Multimodal inferred at 70. Always-on reasoning noted in lineage. | |
| xAI: Grok 4.3 | $75.00 | 58 | 82 | 85 | 90 | 88 | 1.0M | 85Using Grok 4.20's GPQA 78.5% and MATH-500 87.3% as proxies, mapped to Logic 85 and Math 88. As a fast/cost-efficient tier ($0.20/1M), Coding is inferred at 82. Speed is rated 90 for ultra-fast throughput. | |
| Anthropic: Claude Fable Latest | $900.00 | 55 | 85 | 85 | 50 | 85 | 1.0M | 85Evidence states Fable is competitive with top 2026 models on SWE-bench and MMLU, featuring native reasoning tokens. Lacking exact Fable scores, inferred frontier-level capabilities (~85) across coding and logic. Speed adjusted for reasoning overhead. | |
| Qwen: Qwen3 Coder Flash | $17.55 | 54 | 70 | 68 | 95 | 65 | 1.0M | 68Evidence lacks exact benchmark numbers but notes Qwen3 Coder Flash is a speed-optimized, lightweight tier. Scores inferred cautiously for a Flash model, prioritizing speed (95) over coding/logic compared to the flagship Coder Plus. |
Need a shareable artifact?
Get a print-ready PDF of your results and a CSV spreadsheet. Tap the button, then enter your work email. We use it to build your files and start the download—and to email you a copy if the site owner enabled that.
AI ROI Leaderboard & Discovery by LeadsCalc
PDF Breakdown
Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: Long context).
By submitting, you agree to our Privacy Policy and Terms.
Whitelabel Context Leaderboard
for your site
Embed the interactive long context view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.