Best LLM ROI 2026: Value vs. API Cost for AI Teams
See the best LLM ROI in 2026: value density from benchmarks divided by estimated API spend. Compare cost-effective AI models for product teams in the US, Canada, and Australia.
Value-for-money LLM rankings you can explain to finance in 2026
ROI mode highlights models that punch above their price: strong benchmark signals relative to estimated monthly API bills. Revenue teams and consultancies in the United States, Canada, and Australia use it to defend model choices in proposals—especially when clients ask why you did not default to the most famous flagship.
Workload & pricing toggles
Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.
Include Vision / Image Processing
Off — no image fees in cost estimates for vision-capable models.
Turn On to include image fees.
Use Cached Pricing
Enable to get 50% off input tokens where cached rates apply
Deep Reasoning / Thinking Mode
Model hidden reasoning / extended thinking charged like output tokens when enabled.
Batch Pricing
Enable for 50% off input & output where batch/async pricing applies
Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.
Magic quadrant (top 15)
X: est. monthly · Y: ROI / Value · Dot: provider color · Hover for rank, model & detailsFull leaderboard
Showing 48 of 365 models.
| Pick | Model | Est. monthly | ROI score | Coding | Reasoning | Speed | Math | Context | Overall |
|---|---|---|---|---|---|---|---|---|---|
| inclusionAI: Ling-2.6-flash | $0.70 | 82 | 65 | 68 | 90 | 65 | 262K | 66Evidence cites GPQA, AIME, and LiveCodeBench without raw scores. Mapped to ~65 for coding/logic based on claimed ~40B dense equivalence. Speed scored 90 due to 200+ tokens/s. Flash tier adjustment applied. | |
| Auto Router | VARIABLE | 81 | 90 | 90 | 70 | 90 | 2.0M | 90Auto Router optimizes across models for best output. Evidence cites top models reaching 92.3% MMLU. Mapped to 90 across logic, coding, and math to reflect frontier routing capabilities. Vision price defaulted to $0.007 per tier guidelines. | |
| Elephant | Free | 78 | 90 | 83 | 70 | 88 | 262K | 86 | |
| Pareto Code Router | VARIABLE | 78 | 88 | 85 | 70 | 85 | 2.0M | 86OpenRouter docs state this is a router defaulting to High tier coding models based on Artificial Analysis percentiles. Lacking specific raw benchmarks, scores are mapped to ~85 reflecting flagship-level routed performance. Text-only inputs confirmed. | |
| OpenRouter: Fusion | VARIABLE | 78 | 85 | 85 | 40 | 85 | 128K | 85No explicit Fusion scores provided. Inferred as a heavyweight ensemble ('panel of expert models'), mapping to ~85 across coding (SWE-bench) and logic (GPQA). Speed is rated lower (40) due to multi-model deliberation and web search overhead. | |
| Switchpoint Router | VARIABLE | 78 | 85 | 85 | 30 | 85 | 131K | 85No raw benchmarks provided for the router; inferred frontier-level scores as it routes to top models (evidence cites Opus 4.1 at 74.5% SWE-bench Verified). Speed is scored low due to 2.5-6.0 tok/s reported throughput. | |
| Arcee AI: Trinity Mini | $3.30 | 76 | 82 | 89 | 90 | 88 | 131K | 87GPQA Diamond at 92.1% maps to 92 Logic. AIME 2025 at 58.6% maps to 88 Math. LM Market Cap coding score of 82 maps to 82 Coding. As a 'Mini' tier, speed is rated high (90). | |
| Qwen: Qwen3 235B A22B Instruct 2507 | $4.60 | 76 | 92 | 93 | 70 | 92 | 262K | 92SWE-bench Verified 55.6% maps to 92 coding. GPQA 77.5% maps to 95 logic. IFEval 93.3% maps to 90 instruction. Flagship 235B MoE tier yields 70 speed. | |
| OpenAI: gpt-oss-20b | $2.56 | 75 | 70 | 80 | 90 | 95 | 131K | 81GPQA 71.5% and MMLU 85.3% map to Logic 80. AIME 2025 98.7 maps to Math 95. As a 21B lightweight MoE, Speed is 90. Coding inferred at 70 due to lack of explicit SWE-bench. | |
| Qwen: Qwen3 235B A22B Thinking 2507 | $5.00 | 75 | 88 | 93 | 60 | 95 | 262K | 92GPQA 81.1% maps to Logic 95. MMLU-Pro 84.4% and HMMT25 83.9% map to Instruction 90 and Math 95. Coding inferred high (88) via LiveCodeBench. Speed 55 tok/s maps to 60. Flagship 235B reasoning model. | |
| Qwen: Qwen3.5-Flash | $5.20 | 75 | 88 | 92 | 95 | 95 | 1.0M | 92SWE-bench Verified at 69.2% maps to 88 coding. GPQA Diamond at 84.2% maps to 92 logic. IFEval 91.9% maps to 92 instruction. As a Flash tier, speed is 95, though its reasoning capabilities rival flagship models. | |
| Xiaomi: MiMo-V2-Flash | $7.00 | 72 | 92 | 88 | 90 | 95 | 262K | 91SWE-bench Verified 73.4% maps to 92 coding. GPQA 83.7% maps to 90 logic. AIME 94.1% maps to 95 math. Despite the 'Flash' name, its 309B MoE architecture and native reasoning deliver flagship-level scores. | |
| Mistral: Mistral Small 3 | $2.80 | 71 | 75 | 75 | 90 | 75 | 33K | 75HumanEval 88.41% maps to coding 75. GPQA Diamond 45.96% maps to logic 70. As a 24B 'Small' tier model, it scores lower than flagships but achieves high speed (90). | |
| NVIDIA: Nemotron 3 Super | $8.10 | 71 | 92 | 89 | 75 | 95 | 1.0M | 91SWE-bench at 60.47% maps to 92 coding. GPQA at 79.23% maps to 92 logic. AIME25 at 90.21% yields 95 math. High scores reflect its 120B frontier reasoning capabilities. | |
| Tencent: Hy3 preview | $4.62 | 71 | 80 | 83 | 70 | 85 | 262K | 83No benchmark scores provided in evidence. Inferred coding (80) and logic (85) based on its high-efficiency MoE architecture and explicit support for configurable reasoning modes designed for agentic workflows. | |
| StepFun: Step 3.5 Flash | $6.60 | 70 | 88 | 83 | 95 | 95 | 262K | 87SWE-bench Verified 74.4% maps to 88 coding; AIME 99.8% maps to 95 math. Despite Flash tier, explicit evidence shows frontier-level SWE-bench Verified, elevating coding score. Speed is 95 (143 tok/s). | |
| inclusionAI: Ling-2.6-1T | $9.25 | 70 | 92 | 90 | 75 | 92 | 262K | 91Evidence claims state-of-the-art on SWE-bench Verified and AIME26 without raw scores. As a 1T flagship, mapped coding and math to 92. Logic and instruction mapped to 90. Speed mapped to 75 due to 'fast execution' claims. | |
| Qwen: Qwen3 30B A3B Thinking 2507 | $7.20 | 70 | 80 | 89 | 80 | 95 | 131K | 88GPQA Diamond 71.50 maps to logic 88. IFEval 90.09 maps to instruction 90. AIME 86.67 maps to math 95. Coding inferred at 80 (no SWE-bench). Speed 80 based on 75.5 tok/s throughput. | |
| DeepSeek: DeepSeek V3.2 Speciale | $15.79 | 70 | 95 | 97 | 45 | 96 | 164K | 96GPQA 87.1% maps to 98 logic. LiveCodeBench 89.6% and Aider 88.0% map to 95 coding. HMMT 99.0% maps to 96 math. Flagship reasoning model with native thinking tokens; speed scored 45 due to extended reasoning overhead. | |
| Nex AGI: DeepSeek V3.1 Nex N1 | $10.40 | 70 | 96 | 90 | 70 | 90 | 131K | 92Flagship tier. SWE-bench Verified at 70.6 maps to 96 coding. BFCL v4 at 65.3 maps to 90 instruction. Logic and math inferred high (90) from flagship status. Speed set to 70 for heavy MoE. | |
| Qwen: Qwen3 Next 80B A3B Thinking | $11.70 | 70 | 90 | 93 | 45 | 97 | 262K | 93GPQA at 77.2% maps to 96 logic. IFEval at 88.9% maps to 90 instruction. AIME 2025 at 87.8% maps to 97 math. As an 80B thinking model, speed is lower (45). No SWE-bench cited; coding estimated at 90. | |
| OpenAI: gpt-oss-120b | $3.36 | 70 | 70 | 80 | 95 | 75 | 131K | 76HumanEval 71% maps to 70 coding. MMLU 66-90% maps to 80 logic. GSM8K 75% maps to 75 math. 500 tok/s throughput maps to 95 speed. Native reasoning supported via OpenRouter reasoning parameter. | |
| Qwen: Qwen3 32B | $6.00 | 69 | 78 | 85 | 70 | 88 | 131K | 84No exact percentages cited; evidence notes Qwen3 32B outperforms Qwen2.5 72B on GPQA and IFEval. Mapped Logic/Instruction to 85. Speed mapped to 70 based on 57 tok/s. Native reasoning supported via dual-mode architecture. | |
| Gemma 4 31B | $8.40 | 69 | 78 | 89 | 45 | 95 | 262K | 88HumanEval 76.8% maps to coding 78. MMLU 87.1% maps to logic 85. IFEval 93.7% maps to instruction 92. GSM8k 97.6% maps to math 95. Speed 8.52 t/s maps to 45. Native reasoning confirmed via 'reasoning_details'. | |
| Arcee AI: Trinity Large Thinking | $17.30 | 69 | 95 | 94 | 45 | 98 | 262K | 95SWE-bench Verified at 63.2% maps to 95 coding. GPQA-Diamond at 76.3% maps to 95 logic. AIME 2025 at 96.3% maps to 98 math. As a heavy reasoning model (398B MoE), speed is mapped to 45. | |
| Meta: Llama 3.3 70B Instruct | $7.20 | 69 | 85 | 86 | 75 | 85 | 131K | 86SWE-bench Verified 54.6% maps to 85 coding. GPQA 50.5% and MMLU 86% map to 80 logic. IFEval 92.1% yields 92 instruction. MATH 77% gives 85 math. 70B tier implies 75 speed. Text-only model. | |
| Qwen: Qwen3.5-9B | $5.50 | 69 | 70 | 87 | 85 | 83 | 262K | 82GPQA Diamond 81.7% maps to 82 logic. IFEval 91.5% maps to 91 instruction. LiveCodeBench 65.6% maps to 70 coding. MMMU 78.4% maps to 78 multimodal. As a 9B lightweight model, speed is high (85). | |
| Gemma 4 26B A4B | $5.70 | 68 | 77 | 81 | 90 | 88 | 262K | 82GPQA Diamond 82.3% maps to 82 logic. LiveCodeBench 77.1% maps to 77 coding. AIME 88.3% maps to 88 math. As a 3.8B active MoE, speed is rated 90. MMMU Pro 73.8% yields 78 multimodal. | |
| inclusionAI: Ring-2.6-1T | $9.25 | 68 | 88 | 86 | 60 | 90 | 262K | 88Evidence claims SOTA on SWE-bench Verified and AIME26 but lacks exact percentages. As a 1T-parameter flagship MoE, coding and math are scored high (88-90). Speed is 60 based on 54.4 tokens/s throughput. | |
| DeepSeek V3.2 | $12.58 | 68 | 92 | 89 | 45 | 95 | 131K | 91SWE-bench at 67.8% maps to 92 for coding. MMLU-Pro 85.0% maps to 90 for logic. AIME 2025 at 89.3% yields 95 for math. As a frontier reasoning model, speed is 45. | |
| Google: Lyria 3 Pro Preview | Free | 68 | 70 | 70 | 55 | 60 | 1.0M | 68Evidence notes Lyria 3 Pro scores well on SWE-bench and MMLU without exact figures. Mapped to 70s for Pro tier. Speed is 39.5 tok/s (55). Multimodal audio generation from images supported; default Pro vision price applied. | |
| Google: Gemma 3 12B | $3.50 | 68 | 70 | 70 | 88 | 85 | 131K | 74HumanEval 85.4% (Coding ~70), GPQA 40.9% (Logic ~55), IFEval 88.9% (Instruction ~85), MATH 83.8% (Math ~85). As a 12B lightweight tier, scores reflect strong math/instruction but moderate logic/coding compared to flagships. | |
| ByteDance Seed: Seed 1.6 Flash | $6.00 | 68 | 87 | 82 | 85 | 77 | 262K | 82Benchable.ai cites Coding 87%, Reasoning 94%, Math 77%, and Instruction 70%, mapped directly to 0-100. As a Flash-tier model, it is optimized for speed (85) with native reasoning tokens, reflecting lightweight capabilities versus flagship models. | |
| DeepSeek: DeepSeek V3.1 | $16.30 | 68 | 90 | 93 | 75 | 92 | 164K | 92GPQA Diamond at 74.9% maps to 95 logic. AIME 2025 at 49.8% maps to 92 math. SWE-bench Verified noted as strength, mapping to 90 coding. Flagship 671B MoE tier yields ~75 speed. | |
| Qwen3.6 35B A3B | $15.60 | 67 | 92 | 90 | 85 | 90 | 262K | 91SWE-bench Verified 73.4% maps to 92 coding. GPQA 86.0% maps to 92 logic. MMMU 81.7% maps to 90 multimodal. 3B active MoE architecture ensures high speed (85). | |
| DeepSeek: DeepSeek V3.1 Terminus | $20.30 | 67 | 95 | 93 | 70 | 92 | 164K | 93SWE Verified 68.4% maps to 95 coding. GPQA-Diamond 80.7% maps to 95 logic. Flagship tier model with native thinking mode; no vision or caching evidence found. | |
| NVIDIA: Nemotron Nano 9B V2 | $3.20 | 67 | 70 | 63 | 90 | 85 | 131K | 70Evidence shows GPQA at 64.0% (Logic ~65) and LiveCodeBench at 72.4% (Coding ~70). MATH-500 is 97.8% (Math ~85). As a 9B lightweight reasoning model, it achieves high speed (115 tok/s, Speed ~90) but lacks multimodal support. | |
| Xiaomi: MiMo-V2.5 | $8.40 | 67 | 78 | 86 | 60 | 85 | 1.0M | 84Pro-level agentic performance maps to MiMo-V2-Pro's SWE-bench Verified (78.0%) and GPQA Diamond (87.0%), yielding ~78 Coding and ~87 Logic. Speed reflects 29 tok/s. Multimodal is strong per omnimodal claims. | |
| Qwen: Qwen3.5-122B-A10B | $31.20 | 67 | 95 | 94 | 45 | 95 | 262K | 95SWE-bench Verified at 72.0% maps to 95 coding. GPQA Diamond at 86.6% maps to 96 logic. IFEval 92% maps to 92 instruction. As a native reasoning model, speed is mapped to 45. | |
| Qwen: Qwen3.5-35B-A3B | $15.60 | 67 | 85 | 90 | 90 | 95 | 262K | 90SWE-bench (69.2%) maps to 85 coding; GPQA Diamond (84.2%) maps to 88 logic. As a 35B MoE activating 3B parameters (lightweight tier), it achieves high efficiency and speed (90) while maintaining strong native reasoning capabilities. | |
| Owl Alpha | Free | 67 | 65 | 68 | 85 | 60 | 1.0M | 65No exact scores for Owl Alpha; inferred as a lightweight reasoning model ('fewer parameters', 'designed for speed'). Mapped to mid-tier 0-100 scale (Coding 65, Logic 65) reflecting its agentic focus but smaller size. | |
| Amazon: Nova Micro 1.0 | $2.80 | 66 | 68 | 63 | 95 | 75 | 128K | 67HumanEval 81.1% (Coding 68), GPQA 40% (Logic 45), IFEval 87.2% (Instruction 80), GSM8K 92.3% (Math 75). As a 'Micro' tier model, speed is rated very high (95) while coding and logic reflect its lightweight, text-only nature. | |
| MoonshotAI: Kimi K2.5 | $35.00 | 66 | 95 | 93 | 65 | 98 | 262K | 95SWE-bench Verified 76.8% maps to 95 coding. GPQA Diamond 87.9% maps to 96 logic. AIME 2025 96.1% maps to 98 math. 1T MoE flagship tier; native reasoning supported. | |
| Qwen: Qwen2.5 7B Instruct | $2.60 | 66 | 65 | 60 | 90 | 75 | 131K | 65HumanEval 84.8% and GPQA 36.4% map to 65 coding and 45 logic. IFEval 71.2% maps to 75 instruction. As a 7B lightweight model, it scores lower than flagships but achieves high speed (138 tokens/s, mapped to 90). | |
| Tencent: Hunyuan A13B Instruct | $11.30 | 66 | 80 | 84 | 85 | 95 | 131K | 86GPQA-Diamond 71.2 and MMLU 88.17 map to high logic (85). MATH 94.3 indicates elite math (95). IFEval 84.7 maps to solid instruction (82). LiveCodeBench 63.9 maps to strong coding (80). 13B active MoE ensures fast speed (85). | |
| DeepSeek: DeepSeek V4 Flash | $5.90 | 66 | 68 | 79 | 90 | 85 | 1.0M | 78V4 flagship claims 80%+ SWE-bench; Flash tier (13B active) lacks explicit scores but is inferred ~68 for coding. Logic and Math scaled down for Flash efficiency. Speed rated 90 for fast inference design. | |
| DeepSeek: R1 0528 | $41.50 | 66 | 92 | 94 | 45 | 98 | 164K | 95GPQA 81.0% maps to 98 logic. AIME 2024 91.4% maps to 98 math. LiveCodeBench 73.3% maps to 92 coding. Speed reflects 287 c/s but heavy reasoning overhead (23K thinking tokens). | |
| Z.ai: GLM 4.5 Air | $13.50 | 66 | 88 | 87 | 75 | 85 | 131K | 87SWE-bench Verified at 57.6% maps to 88 coding. GPQA at 75.0% maps to 88 logic. As an 'Air' lightweight tier, speed is 50 tokens/s (mapped to 75). Native reasoning is supported via a thinking mode boolean. |
Need a shareable artifact?
Get a print-ready PDF of your results and a CSV spreadsheet. Tap the button, then enter your work email. We use it to build your files and start the download—and to email you a copy if the site owner enabled that.
AI ROI Leaderboard & Discovery by LeadsCalc
PDF Breakdown
Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: ROI / Value).
By submitting, you agree to our Privacy Policy and Terms.
Whitelabel ROI score Leaderboard
for your site
Embed the interactive roi / value view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.