Interactive leaderboard

Cheapest LLM APIs 2026: Low-Cost AI Models Ranked by Workload

Discover the cheapest LLM APIs in 2026: blended input/output cost, batch discounts, and prompt caching. Compare budget AI models for startups and agencies in the US, Canada, and Australia.

Lowest estimated monthly API cost for the same workload in 2026

The cheapest tab ranks models by estimated spend for your exact monthly requests and token pattern, including batch and cached pricing only when our database confirms eligibility. Founders and agencies in the United States, Canada, and Australia use it to protect margins on high-volume chat, summarization, and RAG pipelines without guessing list prices from blog posts.

Workload & pricing toggles

Workload presets

Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.

Include Vision / Image Processing

Off — no image fees in cost estimates for vision-capable models.

Turn On to include image fees.

OffOn

Use Cached Pricing

Enable to get 50% off input tokens where cached rates apply

OffOn

Deep Reasoning / Thinking Mode

Model hidden reasoning / extended thinking charged like output tokens when enabled.

OffOn

Batch Pricing

Enable for 50% off input & output where batch/async pricing applies

OffOn
≈ $100.00/mo
8K
1K1.0M
≈ $100.00/mo
2K
100500K
≈ $200.00 total
5K
10100K

Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.

Magic quadrant (top 15)

X: est. monthly · Y: Cheapest (est. monthly) · Dot: provider color · Hover for rank, model & details

Full leaderboard

Showing 48 of 365 models.

PickModelEst. monthlyROI scoreCodingReasoningSpeedMathContextOverall
Auto Router
VARIABLE
81
90
90
70
90
2.0M
90Auto Router optimizes across models for best output. Evidence cites top models reaching 92.3% MMLU. Mapped to 90 across logic, coding, and math to reflect frontier routing capabilities. Vision price defaulted to $0.007 per tier guidelines.
OpenRouter: Fusion
VARIABLE
78
85
85
40
85
128K
85No explicit Fusion scores provided. Inferred as a heavyweight ensemble ('panel of expert models'), mapping to ~85 across coding (SWE-bench) and logic (GPQA). Speed is rated lower (40) due to multi-model deliberation and web search overhead.
Elephant
Free
78
90
83
70
88
262K
86
Body Builder (beta)
VARIABLE
54
45
43
90
40
128K
43Lacking specific benchmarks, inferred from cited 1B-scale Fast-dLLM v2 (43.5 avg across HumanEval, GPQA, GSM8K). Mapped to 40-45 for coding, logic, math. Speed rated 90 for lightweight specialized API tool.
Owl Alpha
Free
67
65
68
85
60
1.0M
65No exact scores for Owl Alpha; inferred as a lightweight reasoning model ('fewer parameters', 'designed for speed'). Mapped to mid-tier 0-100 scale (Coding 65, Logic 65) reflecting its agentic focus but smaller size.
Free Models Router
Free
56
45
48
90
45
200K
46No specific benchmarks cited for the Free Router. Inferred lightweight tier scores (Coding/Logic ~45) based on typical free 8B-class models. Speed rated high (90). Vision price is $0 as the endpoint is free.
Google: Lyria 3 Pro Preview
Free
68
70
70
55
60
1.0M
68Evidence notes Lyria 3 Pro scores well on SWE-bench and MMLU without exact figures. Mapped to 70s for Pro tier. Speed is 39.5 tok/s (55). Multimodal audio generation from images supported; default Pro vision price applied.
Google: Lyria 3 Clip Preview
Free
31
0
5
50
0
1.0M
3Lyria 3 is a specialized music generation model lacking standard LLM benchmarks (SWE-bench, GPQA). Assigned 0 for coding/logic/math. Speed mapped to 50 from 38 tok/s. Multimodal scored 85 for native image-to-audio generation.
Pareto Code Router
VARIABLE
78
88
85
70
85
2.0M
86OpenRouter docs state this is a router defaulting to High tier coding models based on Artificial Analysis percentiles. Lacking specific raw benchmarks, scores are mapped to ~85 reflecting flagship-level routed performance. Text-only inputs confirmed.
inclusionAI: Ling-2.6-flash$0.7082
65
68
90
65
262K
66Evidence cites GPQA, AIME, and LiveCodeBench without raw scores. Mapped to ~65 for coding/logic based on claimed ~40B dense equivalence. Speed scored 90 due to 200+ tokens/s. Flash tier adjustment applied.
Meta: Llama 3.1 8B Instruct$1.1063
45
48
90
25
131K
41MMLU 66.7%, GPQA 8.72%, MATH 15.56%, IFEval 49.22%. As an 8B lightweight tier, scores map to low/moderate logic (45) and math (25). Speed is rated high (90) due to its small, efficient architecture.
Mistral: Mistral Nemo$1.1065
45
50
90
35
131K
45GPQA 5.37% maps to Logic 35. IFEval 63.8% maps to Instruction 65. MATH Lvl 5 is 12.69%. As a 12B lightweight model, it scores lower on coding/logic than flagships but achieves high speed (90).
IBM: Granite 4.0 Micro$1.8065
45
58
85
65
131K
56Based on GPQA 32.15% (Logic ~35) and HumanEval 81.00% (Coding ~45). IFEval averages 84.32% (Instruction ~80). As a 3B 'Micro' tier model, scores reflect lower reasoning capacity compared to flagships, but strong instruction following.
Sao10K: Llama 3 8B Lunaris$2.1062
45
63
90
45
8K
54Evidence lacks specific benchmark scores. Inferred from Llama 3 8B lightweight tier: coding and math mapped to ~45, logic ~60. Speed rated high (90) due to small 8B parameter size.
LiquidAI: LFM2-24B-A2B$2.4061
71
44
97
55
128K
54Benchable.ai cites Coding 71%, Reasoning 50%, Instruction 38%, and Speed at 97th percentile. Mapped directly to 0-100 scale. As a lightweight 2B active MoE, it prioritizes speed over flagship-level logic and coding.
OpenAI: gpt-oss-20b$2.5675
70
80
90
95
131K
81GPQA 71.5% and MMLU 85.3% map to Logic 80. AIME 2025 98.7 maps to Math 95. As a 21B lightweight MoE, Speed is 90. Coding inferred at 70 due to lack of explicit SWE-bench.
Qwen: Qwen2.5 7B Instruct$2.6066
65
60
90
75
131K
65HumanEval 84.8% and GPQA 36.4% map to 65 coding and 45 logic. IFEval 71.2% maps to 75 instruction. As a 7B lightweight model, it scores lower than flagships but achieves high speed (138 tokens/s, mapped to 90).
Qwen: Qwen-Turbo$2.6058
50
50
50
50
131K
50
Mistral: Mistral Small 3$2.8071
75
75
90
75
33K
75HumanEval 88.41% maps to coding 75. GPQA Diamond 45.96% maps to logic 70. As a 24B 'Small' tier model, it scores lower than flagships but achieves high speed (90).
Amazon: Nova Micro 1.0$2.8066
68
63
95
75
128K
67HumanEval 81.1% (Coding 68), GPQA 40% (Logic 45), IFEval 87.2% (Instruction 80), GSM8K 92.3% (Math 75). As a 'Micro' tier model, speed is rated very high (95) while coding and logic reflect its lightweight, text-only nature.
Cohere: Command R7B (12-2024)$3.0052
35
48
90
40
128K
43Evidence explicitly states no benchmark data is available, requiring a conservative rating. As a 7B lightweight model, capabilities are estimated lower than flagships, while speed is rated high (90) due to its small, fast architecture.
IBM: Granite 4.1 8B$3.0065
68
64
90
65
131K
65HumanEval >89.7% (coding ~68), MMLU >66% (logic ~58), IFEval >74.8% (instruction ~70) based on 3.3 baseline. Lightweight 8B tier adjustments applied for high speed and scaled capabilities.
MythoMax 13B$3.0050
35
45
85
30
4K
39Evidence cites SWE-bench and HumanEval without exact scores. As a 13B Llama-2 fine-tune, capabilities are inferred as lightweight tier. Assigned ~35-40 for coding/logic, with high speed (85) reflecting its small parameter count and fast inference claims.
Google: Gemma 3 4B$3.0064
45
68
95
75
131K
64Based on IFEval (90.2%) mapped to 90 Instruction, MATH (75.6%) to 75 Math, and MMLU-Pro (43.6%) to 45 Logic. As a 4B lightweight tier, Coding (MBPP 63.2%) maps to 45, while Speed is rated 95.
Meta: Llama 3.2 1B Instruct$3.0945
25
35
95
30
131K
31MMLU 49.3%, GSM8K 44.4%, MATH 30.6%. As a 1B lightweight model, Logic (35) and Math (30) reflect sub-50% benchmarks. Coding (25) based on 0.6 index. Speed (95) is maximized for this ultra-small tier.
NVIDIA: Nemotron Nano 9B V2$3.2067
70
63
90
85
131K
70Evidence shows GPQA at 64.0% (Logic ~65) and LiveCodeBench at 72.4% (Coding ~70). MATH-500 is 97.8% (Math ~85). As a 9B lightweight reasoning model, it achieves high speed (115 tok/s, Speed ~90) but lacks multimodal support.
Arcee AI: Trinity Mini$3.3076
82
89
90
88
131K
87GPQA Diamond at 92.1% maps to 92 Logic. AIME 2025 at 58.6% maps to 88 Math. LM Market Cap coding score of 82 maps to 82 Coding. As a 'Mini' tier, speed is rated high (90).
OpenAI: gpt-oss-120b$3.3670
70
80
95
75
131K
76HumanEval 71% maps to 70 coding. MMLU 66-90% maps to 80 logic. GSM8K 75% maps to 75 math. 500 tok/s throughput maps to 95 speed. Native reasoning supported via OpenRouter reasoning parameter.
Google: Gemma 3 12B$3.5068
70
70
88
85
131K
74HumanEval 85.4% (Coding ~70), GPQA 40.9% (Logic ~55), IFEval 88.9% (Instruction ~85), MATH 83.8% (Math ~85). As a 12B lightweight tier, scores reflect strong math/instruction but moderate logic/coding compared to flagships.
Google: Gemma 3n 4B$3.6062
60
63
90
65
33K
63ARC-E 81.6% and HellaSwag 78.6% map to ~55 Logic. Outperforms Gemma 2 9B on HumanEval, mapping to ~60 Coding. As a 4B lightweight model, Speed is rated high (~90) while reasoning is scaled down.
Qwen: Qwen3 30B A3B Instruct 2507$3.8665
68
70
85
70
131K
70Evidence cites HumanEval, GPQA, and IFEval testing but lacks exact scores. As a 30B (3.3B active) lightweight MoE, scores are inferred: Coding ~68, Logic ~65. Speed is high (77 tok/s), mapped to 85.
NVIDIA: Nemotron 3 Nano 30B A3B$4.0065
60
68
90
85
262K
70Evidence lacks exact percentages but notes AIME 2025 wins over DeepSeek-V3.1 (Math: 85) and GPQA/SWE-Bench Verified losses to GLM-4.7 (Logic: 70, Coding: 60). As a 3B-active Nano tier, speed is heavily weighted (90).
Microsoft: Phi 4$4.0064
65
68
85
70
16K
68Evidence lacks exact benchmark scores for base Phi-4 14B. Inferred mid-tier scores (Coding 65, Logic 65) based on its 14B parameter size. Multimodal and native reasoning are false as those belong to separate Phi-4-Vision and Phi-4-Reasoning variants.
Qwen: Qwen3 235B A22B Instruct 2507$4.6076
92
93
70
92
262K
92SWE-bench Verified 55.6% maps to 92 coding. GPQA 77.5% maps to 95 logic. IFEval 93.3% maps to 90 instruction. Flagship 235B MoE tier yields 70 speed.
Tencent: Hy3 preview$4.6271
80
83
70
85
262K
83No benchmark scores provided in evidence. Inferred coding (80) and logic (85) based on its high-efficiency MoE architecture and explicit support for configurable reasoning modes designed for agentic workflows.
Amazon: Nova Lite 1.0$4.8065
75
73
90
75
300K
74HumanEval 85.4% (Coding ~75), GPQA 42% (Logic ~60), IFEval 89.7% (Instruction ~85), MATH 73.3% (Math ~75). As a 'Lite' tier model, it prioritizes speed (~90) over flagship reasoning, reflecting lower GPQA and coding capabilities.
Google: Gemma 3 27B$4.8062
60
73
75
65
131K
68MMLU-Pro 67.5 and LiveCodeBench 29.7 map to logic 70 and coding 60. Multimodal supported (free API tier). No native reasoning tokens confirmed. Mid-weight 27B tier offers balanced speed and capability.
Reka Edge$5.0049
45
48
85
45
16K
46No exact benchmark scores provided in evidence. Inferred capabilities (Coding 45, Logic 45) based on its 7B lightweight tier. Speed rated high (85) for 7B efficiency. Multimodal scored 65 due to strong vision optimization claims.
Mistral: Mistral Small 3.2 24B$5.0063
75
70
88
70
128K
71GPQA Diamond 46.13% maps to 65 logic; HumanEval+ 92.90% maps to 75 coding; MATH 69.42% maps to 70 math. As a 24B Small tier model, speed is rated high (88) reflecting its lightweight, cost-optimized architecture.
Z.ai: GLM 4 32B$5.0064
67
78
60
65
128K
72HumanEval 67.3% (Coding 67), MMLU 81.0% (Logic 80), MATH 52.1% (Math 65). Mid-tier 32B model; scores reflect solid but non-frontier capabilities. Caching and batch supported.
Mistral: Ministral 3 3B 2512$5.0047
35
45
90
45
131K
43Evidence lacks exact benchmark scores. As a 3B lightweight tier, coding (35) and logic (40) are inferred significantly below Mistral Large. Speed (90) is high due to tiny size. Native reasoning tokens are explicitly supported.
Qwen: Qwen3 235B A22B Thinking 2507$5.0075
88
93
60
95
262K
92GPQA 81.1% maps to Logic 95. MMLU-Pro 84.4% and HMMT25 83.9% map to Instruction 90 and Math 95. Coding inferred high (88) via LiveCodeBench. Speed 55 tok/s maps to 60. Flagship 235B reasoning model.
Qwen: Qwen3.5-Flash$5.2075
88
92
95
95
1.0M
92SWE-bench Verified at 69.2% maps to 88 coding. GPQA Diamond at 84.2% maps to 92 logic. IFEval 91.9% maps to 92 instruction. As a Flash tier, speed is 95, though its reasoning capabilities rival flagship models.
Meta: Llama 3.2 3B Instruct$5.3951
35
55
95
55
131K
50Lightweight 3B tier. Mapped MMLU 63.4% to Logic 40, IFEval 73.9% to Instruction 70, and GSM8K 77.7% to Math 55. Coding inferred low (35) lacking SWE-bench. Speed rated 95 for 3B size.
Qwen: Qwen3.5-9B$5.5069
70
87
85
83
262K
82GPQA Diamond 81.7% maps to 82 logic. IFEval 91.5% maps to 91 instruction. LiveCodeBench 65.6% maps to 70 coding. MMMU 78.4% maps to 78 multimodal. As a 9B lightweight model, speed is high (85).
Qwen: Qwen3 Coder 30B A3B Instruct$5.5058
65
65
90
60
160K
64GPQA 51.6% (Logic 60) and LiveCodeBench 40.3% (Coding 65) reflect its 3B active parameter lightweight MoE architecture. Speed is high at 112 tok/s (Speed 90). Math maps to 60 based on AIME 29.7%.
Baidu: ERNIE 4.5 21B A3B Thinking$5.6057
55
63
70
65
131K
61Evidence lacks specific benchmark scores. Inferred as a lightweight 21B MoE 'thinking' tier. Assigned moderate coding (55) and logic (65) due to small size but native reasoning tokens. Speed (70) reflects lightweight architecture with thinking overhead.
Baidu: ERNIE 4.5 21B A3B$5.6054
50
58
85
60
131K
56Evidence lacks exact SWE-bench/GPQA scores. Mapped 21B (3B active) lightweight MoE tier to Coding 50, Logic 55. Speed rated 85 for sparse architecture. 'Thinking' SKU indicates native reasoning. Multimodal claimed; used small-tier default.

Need a shareable artifact?

Get a print-ready PDF of your results and a CSV spreadsheet. Tap the button, then enter your work email. We use it to build your files and start the download—and to email you a copy if the site owner enabled that.

AI ROI Leaderboard & Discovery by LeadsCalc

Detailed analysis

PDF Breakdown

Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: Cheapest (est. monthly)).

Instant setup
No CC required

By submitting, you agree to our Privacy Policy and Terms.

Agency accelerator

Whitelabel Est. monthly Leaderboard
for your site

Embed the interactive cheapest (est. monthly) view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.

1-Click CRM sync
Custom branding
Branded reports
Lead analytics

Free to start

$0/mo*
GET STARTED

NO CREDIT CARD REQUIRED

How it works

Methodology: How we rank Cost-Optimized LLMs

Transparent, benchmark-driven rankings—same craft as our single-model deep dives.

How batch and prompt caching affect “cheapest” rankings

Rankings are based on the blended cost of input and output tokens for a comparable monthly workload. Where our database verifies eligibility, we factor in typical 50% batch API discounts and prompt-caching savings (caching is applied in estimates only when our pipeline has confirmed supports_caching for that model). Figures are indicative for buyers comparing vendors transparently—not a quote from any single provider.

Battle Arena

Compare up to four LLMs side by side

Tick up to four models in the leaderboard table, then open Battle Arena for API pricing, benchmarks, and workload math in one view—perfect when you are shortlisting vendors for a pilot in the US, Canada, or Australia.

Prefer a head start? Jump into high-intent comparisons people search for every day—same interactive calculator, zero signup.

Open Battle ArenaUp to 4 models · Live estimates
Signals & spend

Value analysis

Benchmarks vs. estimated API cost—read the story your CFO cares about.

When the cheapest model is not the best business choice

Extreme savings can hide weaker reliability or context limits. Use this tab to find a cost floor, then cross-check coding or reasoning tabs for quality gates. Enterprise buyers in Australia and Canada often layer vendor DPAs and region routing on top of list economics; US teams frequently validate support SLAs before migrating traffic.

Production deployment

High-Volume Data Processing

How teams in the US, Canada, and Australia deploy these models in production.

Bulk extraction, sentiment analysis, and content moderation at scale

When processing millions of rows of unstructured data, token economics dictate feasibility. Data engineering teams in the US and Canada use these cost-optimized models for bulk entity extraction, continuous social media sentiment analysis, and automated content moderation, where 'good enough' accuracy at 1/100th the price of a frontier model creates massive business value.

Architecture

Maximizing API Cost Efficiency

Strategies to reduce monthly API spend without sacrificing capability.

100% batch API utilization and aggressive prompt caching

To achieve the absolute lowest cost per million tokens, modern architectures combine two features: Batch APIs (which typically offer a 50% discount for 24-hour turnaround) and Prompt Caching (which discounts large, static system prompts by up to 90%). Filter this leaderboard by 'Batch pricing' to identify vendors that support these aggressive cost-reduction features.

Embed-ready

Need this live Cost-Optimized data on your website?

Join 500+ agencies in the US and Australia using LeadsCalc to capture high-intent leads. Embed this interactive Cost-Optimized leaderboard on your site in about a minute—Canadian teams use the same flows for CAD-priced proposals and compliance-friendly landing pages.

Customize & Embed this ToolWhite-label · No code required
United StatesCanadaAustralia
Live preview

Your visitors compare Cost-Optimized models without leaving your domain.

Support & clarity

Frequently Asked Questions

Focused on teams in the United States, Canada, and Australia.

Start with realistic token and request volumes, enable batch pricing only when your integration supports it, and turn on cached pricing when your model and provider document cache tiers. Agencies across the US, Canada, and Australia use this table to shortlist models before negotiating enterprise commits.