Interactive leaderboard

Best LLM ROI 2026: Value vs. API Cost for AI Teams

See the best LLM ROI in 2026: value density from benchmarks divided by estimated API spend. Compare cost-effective AI models for product teams in the US, Canada, and Australia.

Value-for-money LLM rankings you can explain to finance in 2026

ROI mode highlights models that punch above their price: strong benchmark signals relative to estimated monthly API bills. Revenue teams and consultancies in the United States, Canada, and Australia use it to defend model choices in proposals—especially when clients ask why you did not default to the most famous flagship.

Workload & pricing toggles

Workload presets

Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.

Include Vision / Image Processing

Off — no image fees in cost estimates for vision-capable models.

Turn On to include image fees.

OffOn

Use Cached Pricing

Enable to get 50% off input tokens where cached rates apply

OffOn

Deep Reasoning / Thinking Mode

Model hidden reasoning / extended thinking charged like output tokens when enabled.

OffOn

Batch Pricing

Enable for 50% off input & output where batch/async pricing applies

OffOn
≈ $100.00/mo
8K
1K1.0M
≈ $100.00/mo
2K
100500K
≈ $200.00 total
5K
10100K

Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.

Magic quadrant (top 15)

X: est. monthly · Y: ROI / Value · Dot: provider color · Hover for rank, model & details

Full leaderboard

Showing 48 of 365 models.

PickModelEst. monthlyROI scoreCodingReasoningSpeedMathContextOverall
inclusionAI: Ling-2.6-flash$0.7082
65
68
90
65
262K
66Evidence cites GPQA, AIME, and LiveCodeBench without raw scores. Mapped to ~65 for coding/logic based on claimed ~40B dense equivalence. Speed scored 90 due to 200+ tokens/s. Flash tier adjustment applied.
Auto Router
VARIABLE
81
90
90
70
90
2.0M
90Auto Router optimizes across models for best output. Evidence cites top models reaching 92.3% MMLU. Mapped to 90 across logic, coding, and math to reflect frontier routing capabilities. Vision price defaulted to $0.007 per tier guidelines.
Elephant
Free
78
90
83
70
88
262K
86
Pareto Code Router
VARIABLE
78
88
85
70
85
2.0M
86OpenRouter docs state this is a router defaulting to High tier coding models based on Artificial Analysis percentiles. Lacking specific raw benchmarks, scores are mapped to ~85 reflecting flagship-level routed performance. Text-only inputs confirmed.
OpenRouter: Fusion
VARIABLE
78
85
85
40
85
128K
85No explicit Fusion scores provided. Inferred as a heavyweight ensemble ('panel of expert models'), mapping to ~85 across coding (SWE-bench) and logic (GPQA). Speed is rated lower (40) due to multi-model deliberation and web search overhead.
Switchpoint Router
VARIABLE
78
85
85
30
85
131K
85No raw benchmarks provided for the router; inferred frontier-level scores as it routes to top models (evidence cites Opus 4.1 at 74.5% SWE-bench Verified). Speed is scored low due to 2.5-6.0 tok/s reported throughput.
Arcee AI: Trinity Mini$3.3076
82
89
90
88
131K
87GPQA Diamond at 92.1% maps to 92 Logic. AIME 2025 at 58.6% maps to 88 Math. LM Market Cap coding score of 82 maps to 82 Coding. As a 'Mini' tier, speed is rated high (90).
Qwen: Qwen3 235B A22B Instruct 2507$4.6076
92
93
70
92
262K
92SWE-bench Verified 55.6% maps to 92 coding. GPQA 77.5% maps to 95 logic. IFEval 93.3% maps to 90 instruction. Flagship 235B MoE tier yields 70 speed.
OpenAI: gpt-oss-20b$2.5675
70
80
90
95
131K
81GPQA 71.5% and MMLU 85.3% map to Logic 80. AIME 2025 98.7 maps to Math 95. As a 21B lightweight MoE, Speed is 90. Coding inferred at 70 due to lack of explicit SWE-bench.
Qwen: Qwen3 235B A22B Thinking 2507$5.0075
88
93
60
95
262K
92GPQA 81.1% maps to Logic 95. MMLU-Pro 84.4% and HMMT25 83.9% map to Instruction 90 and Math 95. Coding inferred high (88) via LiveCodeBench. Speed 55 tok/s maps to 60. Flagship 235B reasoning model.
Qwen: Qwen3.5-Flash$5.2075
88
92
95
95
1.0M
92SWE-bench Verified at 69.2% maps to 88 coding. GPQA Diamond at 84.2% maps to 92 logic. IFEval 91.9% maps to 92 instruction. As a Flash tier, speed is 95, though its reasoning capabilities rival flagship models.
Xiaomi: MiMo-V2-Flash$7.0072
92
88
90
95
262K
91SWE-bench Verified 73.4% maps to 92 coding. GPQA 83.7% maps to 90 logic. AIME 94.1% maps to 95 math. Despite the 'Flash' name, its 309B MoE architecture and native reasoning deliver flagship-level scores.
Mistral: Mistral Small 3$2.8071
75
75
90
75
33K
75HumanEval 88.41% maps to coding 75. GPQA Diamond 45.96% maps to logic 70. As a 24B 'Small' tier model, it scores lower than flagships but achieves high speed (90).
NVIDIA: Nemotron 3 Super$8.1071
92
89
75
95
1.0M
91SWE-bench at 60.47% maps to 92 coding. GPQA at 79.23% maps to 92 logic. AIME25 at 90.21% yields 95 math. High scores reflect its 120B frontier reasoning capabilities.
Tencent: Hy3 preview$4.6271
80
83
70
85
262K
83No benchmark scores provided in evidence. Inferred coding (80) and logic (85) based on its high-efficiency MoE architecture and explicit support for configurable reasoning modes designed for agentic workflows.
StepFun: Step 3.5 Flash$6.6070
88
83
95
95
262K
87SWE-bench Verified 74.4% maps to 88 coding; AIME 99.8% maps to 95 math. Despite Flash tier, explicit evidence shows frontier-level SWE-bench Verified, elevating coding score. Speed is 95 (143 tok/s).
inclusionAI: Ling-2.6-1T$9.2570
92
90
75
92
262K
91Evidence claims state-of-the-art on SWE-bench Verified and AIME26 without raw scores. As a 1T flagship, mapped coding and math to 92. Logic and instruction mapped to 90. Speed mapped to 75 due to 'fast execution' claims.
Qwen: Qwen3 30B A3B Thinking 2507$7.2070
80
89
80
95
131K
88GPQA Diamond 71.50 maps to logic 88. IFEval 90.09 maps to instruction 90. AIME 86.67 maps to math 95. Coding inferred at 80 (no SWE-bench). Speed 80 based on 75.5 tok/s throughput.
DeepSeek: DeepSeek V3.2 Speciale$15.7970
95
97
45
96
164K
96GPQA 87.1% maps to 98 logic. LiveCodeBench 89.6% and Aider 88.0% map to 95 coding. HMMT 99.0% maps to 96 math. Flagship reasoning model with native thinking tokens; speed scored 45 due to extended reasoning overhead.
Nex AGI: DeepSeek V3.1 Nex N1$10.4070
96
90
70
90
131K
92Flagship tier. SWE-bench Verified at 70.6 maps to 96 coding. BFCL v4 at 65.3 maps to 90 instruction. Logic and math inferred high (90) from flagship status. Speed set to 70 for heavy MoE.
Qwen: Qwen3 Next 80B A3B Thinking$11.7070
90
93
45
97
262K
93GPQA at 77.2% maps to 96 logic. IFEval at 88.9% maps to 90 instruction. AIME 2025 at 87.8% maps to 97 math. As an 80B thinking model, speed is lower (45). No SWE-bench cited; coding estimated at 90.
OpenAI: gpt-oss-120b$3.3670
70
80
95
75
131K
76HumanEval 71% maps to 70 coding. MMLU 66-90% maps to 80 logic. GSM8K 75% maps to 75 math. 500 tok/s throughput maps to 95 speed. Native reasoning supported via OpenRouter reasoning parameter.
Qwen: Qwen3 32B$6.0069
78
85
70
88
131K
84No exact percentages cited; evidence notes Qwen3 32B outperforms Qwen2.5 72B on GPQA and IFEval. Mapped Logic/Instruction to 85. Speed mapped to 70 based on 57 tok/s. Native reasoning supported via dual-mode architecture.
Gemma 4 31B$8.4069
78
89
45
95
262K
88HumanEval 76.8% maps to coding 78. MMLU 87.1% maps to logic 85. IFEval 93.7% maps to instruction 92. GSM8k 97.6% maps to math 95. Speed 8.52 t/s maps to 45. Native reasoning confirmed via 'reasoning_details'.
Arcee AI: Trinity Large Thinking$17.3069
95
94
45
98
262K
95SWE-bench Verified at 63.2% maps to 95 coding. GPQA-Diamond at 76.3% maps to 95 logic. AIME 2025 at 96.3% maps to 98 math. As a heavy reasoning model (398B MoE), speed is mapped to 45.
Meta: Llama 3.3 70B Instruct$7.2069
85
86
75
85
131K
86SWE-bench Verified 54.6% maps to 85 coding. GPQA 50.5% and MMLU 86% map to 80 logic. IFEval 92.1% yields 92 instruction. MATH 77% gives 85 math. 70B tier implies 75 speed. Text-only model.
Qwen: Qwen3.5-9B$5.5069
70
87
85
83
262K
82GPQA Diamond 81.7% maps to 82 logic. IFEval 91.5% maps to 91 instruction. LiveCodeBench 65.6% maps to 70 coding. MMMU 78.4% maps to 78 multimodal. As a 9B lightweight model, speed is high (85).
Gemma 4 26B A4B$5.7068
77
81
90
88
262K
82GPQA Diamond 82.3% maps to 82 logic. LiveCodeBench 77.1% maps to 77 coding. AIME 88.3% maps to 88 math. As a 3.8B active MoE, speed is rated 90. MMMU Pro 73.8% yields 78 multimodal.
inclusionAI: Ring-2.6-1T$9.2568
88
86
60
90
262K
88Evidence claims SOTA on SWE-bench Verified and AIME26 but lacks exact percentages. As a 1T-parameter flagship MoE, coding and math are scored high (88-90). Speed is 60 based on 54.4 tokens/s throughput.
DeepSeek V3.2$12.5868
92
89
45
95
131K
91SWE-bench at 67.8% maps to 92 for coding. MMLU-Pro 85.0% maps to 90 for logic. AIME 2025 at 89.3% yields 95 for math. As a frontier reasoning model, speed is 45.
Google: Lyria 3 Pro Preview
Free
68
70
70
55
60
1.0M
68Evidence notes Lyria 3 Pro scores well on SWE-bench and MMLU without exact figures. Mapped to 70s for Pro tier. Speed is 39.5 tok/s (55). Multimodal audio generation from images supported; default Pro vision price applied.
Google: Gemma 3 12B$3.5068
70
70
88
85
131K
74HumanEval 85.4% (Coding ~70), GPQA 40.9% (Logic ~55), IFEval 88.9% (Instruction ~85), MATH 83.8% (Math ~85). As a 12B lightweight tier, scores reflect strong math/instruction but moderate logic/coding compared to flagships.
ByteDance Seed: Seed 1.6 Flash$6.0068
87
82
85
77
262K
82Benchable.ai cites Coding 87%, Reasoning 94%, Math 77%, and Instruction 70%, mapped directly to 0-100. As a Flash-tier model, it is optimized for speed (85) with native reasoning tokens, reflecting lightweight capabilities versus flagship models.
DeepSeek: DeepSeek V3.1$16.3068
90
93
75
92
164K
92GPQA Diamond at 74.9% maps to 95 logic. AIME 2025 at 49.8% maps to 92 math. SWE-bench Verified noted as strength, mapping to 90 coding. Flagship 671B MoE tier yields ~75 speed.
Qwen3.6 35B A3B$15.6067
92
90
85
90
262K
91SWE-bench Verified 73.4% maps to 92 coding. GPQA 86.0% maps to 92 logic. MMMU 81.7% maps to 90 multimodal. 3B active MoE architecture ensures high speed (85).
DeepSeek: DeepSeek V3.1 Terminus$20.3067
95
93
70
92
164K
93SWE Verified 68.4% maps to 95 coding. GPQA-Diamond 80.7% maps to 95 logic. Flagship tier model with native thinking mode; no vision or caching evidence found.
NVIDIA: Nemotron Nano 9B V2$3.2067
70
63
90
85
131K
70Evidence shows GPQA at 64.0% (Logic ~65) and LiveCodeBench at 72.4% (Coding ~70). MATH-500 is 97.8% (Math ~85). As a 9B lightweight reasoning model, it achieves high speed (115 tok/s, Speed ~90) but lacks multimodal support.
Xiaomi: MiMo-V2.5$8.4067
78
86
60
85
1.0M
84Pro-level agentic performance maps to MiMo-V2-Pro's SWE-bench Verified (78.0%) and GPQA Diamond (87.0%), yielding ~78 Coding and ~87 Logic. Speed reflects 29 tok/s. Multimodal is strong per omnimodal claims.
Qwen: Qwen3.5-122B-A10B$31.2067
95
94
45
95
262K
95SWE-bench Verified at 72.0% maps to 95 coding. GPQA Diamond at 86.6% maps to 96 logic. IFEval 92% maps to 92 instruction. As a native reasoning model, speed is mapped to 45.
Qwen: Qwen3.5-35B-A3B$15.6067
85
90
90
95
262K
90SWE-bench (69.2%) maps to 85 coding; GPQA Diamond (84.2%) maps to 88 logic. As a 35B MoE activating 3B parameters (lightweight tier), it achieves high efficiency and speed (90) while maintaining strong native reasoning capabilities.
Owl Alpha
Free
67
65
68
85
60
1.0M
65No exact scores for Owl Alpha; inferred as a lightweight reasoning model ('fewer parameters', 'designed for speed'). Mapped to mid-tier 0-100 scale (Coding 65, Logic 65) reflecting its agentic focus but smaller size.
Amazon: Nova Micro 1.0$2.8066
68
63
95
75
128K
67HumanEval 81.1% (Coding 68), GPQA 40% (Logic 45), IFEval 87.2% (Instruction 80), GSM8K 92.3% (Math 75). As a 'Micro' tier model, speed is rated very high (95) while coding and logic reflect its lightweight, text-only nature.
MoonshotAI: Kimi K2.5$35.0066
95
93
65
98
262K
95SWE-bench Verified 76.8% maps to 95 coding. GPQA Diamond 87.9% maps to 96 logic. AIME 2025 96.1% maps to 98 math. 1T MoE flagship tier; native reasoning supported.
Qwen: Qwen2.5 7B Instruct$2.6066
65
60
90
75
131K
65HumanEval 84.8% and GPQA 36.4% map to 65 coding and 45 logic. IFEval 71.2% maps to 75 instruction. As a 7B lightweight model, it scores lower than flagships but achieves high speed (138 tokens/s, mapped to 90).
Tencent: Hunyuan A13B Instruct$11.3066
80
84
85
95
131K
86GPQA-Diamond 71.2 and MMLU 88.17 map to high logic (85). MATH 94.3 indicates elite math (95). IFEval 84.7 maps to solid instruction (82). LiveCodeBench 63.9 maps to strong coding (80). 13B active MoE ensures fast speed (85).
DeepSeek: DeepSeek V4 Flash$5.9066
68
79
90
85
1.0M
78V4 flagship claims 80%+ SWE-bench; Flash tier (13B active) lacks explicit scores but is inferred ~68 for coding. Logic and Math scaled down for Flash efficiency. Speed rated 90 for fast inference design.
DeepSeek: R1 0528$41.5066
92
94
45
98
164K
95GPQA 81.0% maps to 98 logic. AIME 2024 91.4% maps to 98 math. LiveCodeBench 73.3% maps to 92 coding. Speed reflects 287 c/s but heavy reasoning overhead (23K thinking tokens).
Z.ai: GLM 4.5 Air$13.5066
88
87
75
85
131K
87SWE-bench Verified at 57.6% maps to 88 coding. GPQA at 75.0% maps to 88 logic. As an 'Air' lightweight tier, speed is 50 tokens/s (mapped to 75). Native reasoning is supported via a thinking mode boolean.

Need a shareable artifact?

Get a print-ready PDF of your results and a CSV spreadsheet. Tap the button, then enter your work email. We use it to build your files and start the download—and to email you a copy if the site owner enabled that.

AI ROI Leaderboard & Discovery by LeadsCalc

Detailed analysis

PDF Breakdown

Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: ROI / Value).

Instant setup
No CC required

By submitting, you agree to our Privacy Policy and Terms.

Agency accelerator

Whitelabel ROI score Leaderboard
for your site

Embed the interactive roi / value view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.

1-Click CRM sync
Custom branding
Branded reports
Lead analytics

Free to start

$0/mo*
GET STARTED

NO CREDIT CARD REQUIRED

How it works

Methodology: How we rank ROI / Value LLMs

Transparent, benchmark-driven rankings—same craft as our single-model deep dives.

Defining ROI on this leaderboard (quality per dollar)

ROI rankings combine normalized quality benchmarks with estimated monthly API spend for the same interactive workload. The goal is value density: which models deliver the most capability per dollar—helping growth, support, and product teams in the US, Canada, and Australia justify stack decisions with numbers, not hype.

Battle Arena

Compare up to four LLMs side by side

Tick up to four models in the leaderboard table, then open Battle Arena for API pricing, benchmarks, and workload math in one view—perfect when you are shortlisting vendors for a pilot in the US, Canada, or Australia.

Prefer a head start? Jump into high-intent comparisons people search for every day—same interactive calculator, zero signup.

Open Battle ArenaUp to 4 models · Live estimates
Signals & spend

Value analysis

Benchmarks vs. estimated API cost—read the story your CFO cares about.

Interpreting ROI when your workload changes seasonally

ROI shifts when traffic spikes or when you enable vision and reasoning. Re-tune sliders to match launch vs. steady-state; if you bill clients in CAD or AUD, use relative rankings first, then apply your FX and tax reality. Many agencies embed this view so stakeholders see the same ROI story the technical team uses internally.

Production deployment

B2B SaaS & Customer Experience

How teams in the US, Canada, and Australia deploy these models in production.

Customer support bots, copilot features, and workflow automation

High-ROI models are the backbone of profitable AI features in B2B SaaS. Product teams in the US and Australia deploy these 'sweet spot' models for customer support chatbots, in-app writing copilots, and automated email drafting—use cases that require high reliability and instruction-following, but cannot justify the margin hit of a premium frontier model on every API call.

Architecture

Value-Driven Architecture

Strategies to reduce monthly API spend without sacrificing capability.

Blending models based on query complexity

Maximizing ROI requires dynamic routing. By using a fast, cheap model to classify user intent, you can route 80% of queries to a high-ROI mid-tier model, and only escalate the remaining 20% to an expensive reasoning model. This leaderboard helps you identify the perfect mid-tier anchor for your architecture, balancing acceptable latency with sustainable API margins.

Embed-ready

Need this live ROI / Value data on your website?

Join 500+ agencies in the US and Australia using LeadsCalc to capture high-intent leads. Embed this interactive ROI / Value leaderboard on your site in about a minute—Canadian teams use the same flows for CAD-priced proposals and compliance-friendly landing pages.

Customize & Embed this ToolWhite-label · No code required
United StatesCanadaAustralia
Live preview

Your visitors compare ROI / Value models without leaving your domain.

Support & clarity

Frequently Asked Questions

Focused on teams in the United States, Canada, and Australia.

We pair composite quality signals with the monthly cost estimate from your token sliders. That highlights underpriced capability tiers—useful when agencies quote fixed-fee projects in USD, CAD, or AUD and need margin-friendly model choices.