Best Open-Source LLMs 2026: Open-Weight Models vs. Hosted API Cost
Compare open-weight LLMs in 2026 for self-hosting and dedicated deployments, with pricing context for hosted variants. For teams in the US, Canada, and Australia evaluating open vs. proprietary APIs.
Open-source friendly LLM comparison with TCO in mind in 2026
Open-weight models can unlock on-prem and dedicated-cloud strategies for residency-sensitive workloads. This tab filters the open-weight subset so platform teams in the United States, Canada, and Australia can compare capability signals while still grounding decisions in realistic engineering and GPU spend—not list hype alone.
Workload & pricing toggles
Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.
Include Vision / Image Processing
Off — no image fees in cost estimates for vision-capable models.
Turn On to include image fees.
Use Cached Pricing
Enable to get 50% off input tokens where cached rates apply
Deep Reasoning / Thinking Mode
Model hidden reasoning / extended thinking charged like output tokens when enabled.
Batch Pricing
Enable for 50% off input & output where batch/async pricing applies
Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.
Magic quadrant (top 15)
X: est. monthly · Y: Open-weight · Dot: provider color · Hover for rank, model & detailsFull leaderboard
Showing 48 of 81 models (open-weight / self-hostable catalog hints).
| Pick | Model | Est. monthly | ROI score | Coding | Reasoning | Speed | Math | Context | Overall |
|---|---|---|---|---|---|---|---|---|---|
| DeepSeek: DeepSeek V3.2 Speciale | $15.79 | 70 | 95 | 97 | 45 | 96 | 164K | 96GPQA 87.1% maps to 98 logic. LiveCodeBench 89.6% and Aider 88.0% map to 95 coding. HMMT 99.0% maps to 96 math. Flagship reasoning model with native thinking tokens; speed scored 45 due to extended reasoning overhead. | |
| Qwen: Qwen3.6 Max Preview | $104.00 | 64 | 96 | 95 | 55 | 97 | 262K | 96Based on Qwen3.6 Plus scoring SWE-bench 78.8% and GPQA 90.4%, Max (1T MoE flagship) maps to 96 for coding and logic. Native reasoning is supported via <think> tags. Vision defaults to frontier pricing. | |
| Qwen: Qwen3.5-122B-A10B | $31.20 | 67 | 95 | 94 | 45 | 95 | 262K | 95SWE-bench Verified at 72.0% maps to 95 coding. GPQA Diamond at 86.6% maps to 96 logic. IFEval 92% maps to 92 instruction. As a native reasoning model, speed is mapped to 45. | |
| DeepSeek: DeepSeek V3.1 Terminus | $20.30 | 67 | 95 | 93 | 70 | 92 | 164K | 93SWE Verified 68.4% maps to 95 coding. GPQA-Diamond 80.7% maps to 95 logic. Flagship tier model with native thinking mode; no vision or caching evidence found. | |
| Qwen: Qwen3 Max | $70.20 | 63 | 96 | 91 | 55 | 95 | 262K | 93SWE-bench Verified at 69.6% maps to 96 coding. SuperGPQA at 65.1% maps to 92 logic. AIME 2025 at 81.6% maps to 95 math. Flagship tier model; speed mapped to 55 based on 26 tok/s throughput. | |
| DeepSeek: DeepSeek V4 Pro | $26.10 | 66 | 98 | 91 | 70 | 88 | 1.0M | 92SWE-bench Verified at 81.0% maps to 98 coding. GPQA Diamond at 66.3% maps to 92 logic. Flagship MoE tier justifies high scores; speed is standard for large MoE. | |
| Nex AGI: DeepSeek V3.1 Nex N1 | $10.40 | 70 | 96 | 90 | 70 | 90 | 131K | 92Flagship tier. SWE-bench Verified at 70.6 maps to 96 coding. BFCL v4 at 65.3 maps to 90 instruction. Logic and math inferred high (90) from flagship status. Speed set to 70 for heavy MoE. | |
| Qwen: Qwen3.6 Plus | $32.50 | 65 | 95 | 91 | 70 | 92 | 1.0M | 92OpenRouter docs cite a 78.8% on SWE-bench Verified, mapping to a 95 coding score. As a reasoning-enabled Plus MoE model, logic and math score highly (~92), while speed is balanced (~70). Multimodal capabilities are explicitly noted. | |
| DeepSeek: DeepSeek V3.1 | $16.30 | 68 | 90 | 93 | 75 | 92 | 164K | 92GPQA Diamond at 74.9% maps to 95 logic. AIME 2025 at 49.8% maps to 92 math. SWE-bench Verified noted as strength, mapping to 90 coding. Flagship 671B MoE tier yields ~75 speed. | |
| Qwen: Qwen3 235B A22B Instruct 2507 | $4.60 | 76 | 92 | 93 | 70 | 92 | 262K | 92SWE-bench Verified 55.6% maps to 92 coding. GPQA 77.5% maps to 95 logic. IFEval 93.3% maps to 90 instruction. Flagship 235B MoE tier yields 70 speed. | |
| DeepSeek V3.2 | $12.58 | 68 | 92 | 89 | 45 | 95 | 131K | 91SWE-bench at 67.8% maps to 92 for coding. MMLU-Pro 85.0% maps to 90 for logic. AIME 2025 at 89.3% yields 95 for math. As a frontier reasoning model, speed is 45. | |
| Qwen3.6 35B A3B | $15.60 | 67 | 92 | 90 | 85 | 90 | 262K | 91SWE-bench Verified 73.4% maps to 92 coding. GPQA 86.0% maps to 92 logic. MMMU 81.7% maps to 90 multimodal. 3B active MoE architecture ensures high speed (85). | |
| Qwen: Qwen3.5-35B-A3B | $15.60 | 67 | 85 | 90 | 90 | 95 | 262K | 90SWE-bench (69.2%) maps to 85 coding; GPQA Diamond (84.2%) maps to 88 logic. As a 35B MoE activating 3B parameters (lightweight tier), it achieves high efficiency and speed (90) while maintaining strong native reasoning capabilities. | |
| Mistral Large (2512) | $35.00 | 63 | 90 | 89 | 65 | 92 | 262K | 90Evidence lacks exact Large 3 scores but cites Large 2 MMLU (84%) and Small 3.1 HumanEval (88.4%). As the 675B flagship, scores are inferred higher (Coding 90, Logic 88). Speed reflects heavy MoE architecture. | |
| Qwen: Qwen3.5-27B | $23.40 | 65 | 88 | 92 | 85 | 90 | 262K | 90SWE-bench Verified at 72.4% maps to 88 coding. GPQA Diamond at 85.5% maps to 88 logic. IFEval 95.0% maps to 95 instruction. As a 27B mid-tier model, speed is rated 85. | |
| Qwen: Qwen3 Coder Plus | $58.50 | 61 | 92 | 88 | 60 | 88 | 1.0M | 89Evidence lacks exact benchmark percentages but describes a 480B flagship coding model matching GPT-4 on SWE-bench and HumanEval+. Assigned high flagship scores (Coding 92, Logic 88). Speed reflects 29-34 tok/s throughput. Vision supported; default frontier image price applied. | |
| Qwen: Qwen3.7 Max | $87.50 | 60 | 88 | 89 | 60 | 90 | 1.0M | 89Evidence lacks exact Qwen3.7 Max scores but confirms it as a flagship model evaluated on SWE-bench, GPQA, and MMLU. Scores inferred for a heavyweight tier (e.g., MATH 75-90% for frontier models), mapping to high 80s/90s. | |
| Gemma 4 31B | $8.40 | 69 | 78 | 89 | 45 | 95 | 262K | 88HumanEval 76.8% maps to coding 78. MMLU 87.1% maps to logic 85. IFEval 93.7% maps to instruction 92. GSM8k 97.6% maps to math 95. Speed 8.52 t/s maps to 45. Native reasoning confirmed via 'reasoning_details'. | |
| Qwen: Qwen3 VL 235B A22B Instruct | $16.80 | 65 | 85 | 88 | 65 | 85 | 262K | 87Evidence lists wins on MMLU, SuperGPQA, IFEval, and LiveCodeBench without exact percentages. As a 235B flagship MoE, scores are inferred high (85-92). Multimodal is strong (MMMU-Pro). No native reasoning (Instruct version). | |
| Meta: Llama 3.3 70B Instruct | $7.20 | 69 | 85 | 86 | 75 | 85 | 131K | 86SWE-bench Verified 54.6% maps to 85 coding. GPQA 50.5% and MMLU 86% map to 80 logic. IFEval 92.1% yields 92 instruction. MATH 77% gives 85 math. 70B tier implies 75 speed. Text-only model. | |
| Qwen: Qwen3.6 27B | $35.56 | 61 | 82 | 90 | 80 | 80 | 262K | 86Inferred from Qwen3.5-27B: IFEval 95.0 maps to Instruction 95. Artificial Analysis Coding 82% maps to Coding 82. Intelligence 85% maps to Logic 85. 27B mid-tier adjustments applied for Math and Speed. | |
| Mistral Large 2411 | $140.00 | 57 | 85 | 85 | 65 | 85 | 131K | 85Evidence lists SWE-bench, GPQA, and MMLU but omits exact scores. As a flagship 'Large' model upgrading 24.07, scores are inferred high (85s). Speed is estimated at 65 for heavyweights. Vision is explicitly unsupported for this specific SKU. | |
| Nous: Hermes 3 405B Instruct | $50.00 | 60 | 87 | 83 | 60 | 85 | 131K | 85Evidence shows HumanEval at 89.0% (mapped to 87 coding) and GPQA at 50.7% (mapped to 78 logic). IFEval 88.6% maps to 88 instruction, and MATH 73.8% maps to 85 math. Heavyweight 405B model yields moderate speed (60). | |
| Mistral Large 2407 | $140.00 | 57 | 82 | 85 | 65 | 88 | 131K | 85Evidence shows HumanEval 92% and LiveCodeBench 26.7 (Coding ~82), GPQA 47.2 and MMLU 84% (Logic ~84), MATH-500 71.4 (Math ~88), MT-Bench 8.63 (Instruction ~85). As a 123B flagship, speed is moderate (~65). No vision or native reasoning. | |
| Qwen: Qwen3 Next 80B A3B Instruct | $14.60 | 64 | 88 | 84 | 65 | 82 | 262K | 85HumanEval 95.1% maps to 88 coding. GPQA-D 47.0% maps to 75 logic. IFEval 93.4% maps to 93 instruction. As an 80B heavyweight, it lacks native reasoning and vision, prioritizing fast text generation at 32 tok/s. | |
| Mistral: Mistral Medium 3 | $36.00 | 60 | 80 | 83 | 75 | 90 | 131K | 84HumanEval 92.1% (Coding ~80), GPQA Diamond 57.1% (Logic ~75), IFEval 89.4% (Instruction ~90), Math500 91.0% (Math ~90). Mid-tier model balancing cost and performance; speed ~69 tok/s (Speed ~75). Vision supported but price inferred. | |
| Qwen: Qwen3 32B | $6.00 | 69 | 78 | 85 | 70 | 88 | 131K | 84No exact percentages cited; evidence notes Qwen3 32B outperforms Qwen2.5 72B on GPQA and IFEval. Mapped Logic/Instruction to 85. Speed mapped to 70 based on 57 tok/s. Native reasoning supported via dual-mode architecture. | |
| DeepSeek: DeepSeek V3 0324 | $15.70 | 63 | 84 | 87 | 65 | 75 | 131K | 83Based on DeepSeek-V3 baseline (HumanEval 82.6%, GPQA 59.1%, MATH 61.6%, IFEval 86.1%), with evidence noting 0324 improves on GPQA and MATH. Mapped to flagship 0-100 scale. Speed reflects 685B MoE architecture. | |
| Qwen: Qwen3.5 397B A17B | $39.00 | 59 | 82 | 84 | 65 | 82 | 262K | 83Evidence cites SWE-bench Verified and GPQA without exact percentages. Inferred as 397B flagship MoE: coding and logic mapped to 82, speed to 65 for heavy MoE. | |
| NVIDIA: Llama 3.3 Nemotron Super 49B V1.5 | $20.00 | 62 | 75 | 80 | 70 | 95 | 131K | 83GPQA 71.97% maps to Logic 80. MATH500 97.4% maps to Math 95. LiveCodeBench 73.58% maps to Coding 75. Speed 50.6 tok/s maps to 70. Mid-tier 49B model with explicit reasoning modes and vision support. | |
| Mistral: Mistral Medium 3.5 | $135.00 | 56 | 90 | 80 | 70 | 80 | 262K | 83Evidence lacks exact 3.5 benchmarks, so scores are inferred above Mistral Small 3.1 (HumanEval 88.4%, GPQA 46%). As a 128B model, it maps to strong mid-tier/frontier performance (Coding 90, Logic 75). | |
| Qwen: Qwen3 Coder 480B A35B | $26.80 | 60 | 92 | 85 | 70 | 65 | 1.0M | 82Evidence shows HumanEval 93.9% and GPQA 61.8%, mapping to high coding (92) and logic (85) for this heavyweight 480B MoE. Math is mapped to 65 based on MATH 39.3%. Speed is 70 (67 tok/s). | |
| Qwen: Qwen3.5-9B | $5.50 | 69 | 70 | 87 | 85 | 83 | 262K | 82GPQA Diamond 81.7% maps to 82 logic. IFEval 91.5% maps to 91 instruction. LiveCodeBench 65.6% maps to 70 coding. MMMU 78.4% maps to 78 multimodal. As a 9B lightweight model, speed is high (85). | |
| Llama 4 Maverick | $12.00 | 63 | 80 | 87 | 75 | 75 | 1.0M | 82HumanEval 82.6% maps to coding 80. MMLU 88.5% maps to logic 85. IFEval 92.1% maps to instruction 88. MATH 77% maps to math 75. MMMU 73.4% maps to multimodal 85. 17B active MoE yields mid-tier speed. | |
| Gemma 4 26B A4B | $5.70 | 68 | 77 | 81 | 90 | 88 | 262K | 82GPQA Diamond 82.3% maps to 82 logic. LiveCodeBench 77.1% maps to 77 coding. AIME 88.3% maps to 88 math. As a 3.8B active MoE, speed is rated 90. MMMU Pro 73.8% yields 78 multimodal. | |
| Qwen2.5 72B Instruct | $18.40 | 61 | 82 | 77 | 65 | 88 | 131K | 81Evidence cites HumanEval at 86.6% (mapped to 82 coding) and GPQA at 49.0% (mapped to 70 logic). IFEval is 84.1% (84 instruction) and MATH is 83.1% (88 math). As a 72B heavyweight, speed is standard (65). | |
| Qwen: Qwen3.7 Plus | $32.00 | 59 | 80 | 81 | 80 | 82 | 1.0M | 81Evidence lacks exact Qwen3.7 Plus benchmark scores. Inferred mid-tier capabilities (Coding 80, Logic 80) from 'Plus' designation and 71 tok/s throughput (Speed 80). Multimodal inferred at 70. Always-on reasoning noted in lineage. | |
| Qwen: Qwen3.5 Plus 2026-04-20 | $30.00 | 59 | 80 | 83 | 70 | 80 | 1.0M | 81No exact benchmark percentages provided in evidence. Inferred coding (80) and logic (80) based on 'Plus' mid-tier status and strong comparative claims. Speed (70) reflects 45-51 tok/s throughput. Native reasoning explicitly supported. | |
| Qwen3 235B A22B | $36.40 | 58 | 65 | 83 | 50 | 88 | 131K | 80AA Index reports 20.9% coding (mapped to 65) and 88.3% math (mapped to 88). Speed is 35 t/s (mapped to 50). Heavyweight MoE with native thinking mode. | |
| Qwen: Qwen2.5 VL 72B Instruct | $17.50 | 60 | 82 | 75 | 65 | 88 | 131K | 80HumanEval 86.6% maps coding to 82. GPQA 49.0% maps logic to 65. IFEval 84.1% maps instruction to 84. GSM8K 95.8% maps math to 88. Heavyweight 72B tier dictates speed ~65 and strong multimodal ~85. | |
| Qwen: Qwen3 VL 30B A3B Instruct | $10.40 | 63 | 75 | 86 | 90 | 75 | 262K | 80GPQA 70.1% maps to Logic 85; IFEval 85.8% maps to Instruction 86. As a 30B/3B active MoE mid-tier model, coding and math are inferred ~75. Speed is high (~90) due to sparse architecture. | |
| Qwen: Qwen VL Max | $41.60 | 57 | 75 | 80 | 60 | 80 | 131K | 79Primary source: Evidence lists benchmarks as 'not available'. Inferred from 'Max' flagship tier, mapping coding, logic, and math to ~75-80. Speed reflects heavyweight class (~60). | |
| Qwen: Qwen-Plus | $18.20 | 60 | 74 | 83 | 80 | 75 | 1.0M | 79No exact Qwen-Plus benchmarks cited; inferred from Qwen-72B proxy (HumanEval 74.2%, MMLU 86.5%, MATH 64.0%). Mapped to mid-tier 0-100 scale. Evidence confirms native reasoning ('reasoning_details' array) and vision support, but lacks caching or batch details. | |
| Meta: Llama 3.1 70B Instruct | $20.00 | 59 | 80 | 84 | 70 | 68 | 131K | 79HumanEval 80.5% maps to 80 coding. MMLU 83.6% maps to 83 logic. MATH 68% maps to 68 math. 70B heavyweight tier yields 70 speed. No vision or native reasoning supported. | |
| Qwen: Qwen3 Coder Next | $12.40 | 61 | 85 | 72 | 95 | 85 | 262K | 78HumanEval 92.7% and GPQA-D 42.4% map to 85 coding and 55 logic. IFEval 89.6% yields 88 instruction. Speed is 95 based on 162 tok/s. As an 80B (3B active) efficient model, logic is appropriately scaled. | |
| Qwen: Qwen3.5 Plus 2026-02-15 | $26.00 | 57 | 75 | 79 | 80 | 75 | 1.0M | 77Evidence notes Qwen3.6 Plus scores 78.8 on SWE-bench, implying Qwen3.5 Plus is slightly lower (mapped to 75). OpenRouter confirms 'reasoning' parameter support. Vision supported; estimated 1K tokens at $0.26/M input = $0.00026 per image. | |
| Qwen: Qwen3 VL 32B Instruct | $8.32 | 62 | 75 | 78 | 70 | 75 | 262K | 76Evidence lacks exact Instruct scores but cites Qwen2.5 72B (HumanEval 86.6%, GSM8K 95.8%). As a 32B mid-tier model, scores are inferred lower (~75). Multimodal is strong (VL architecture). Speed reflects 32B size. | |
| Qwen2.5 Coder 32B Instruct | $36.40 | 55 | 85 | 75 | 80 | 65 | 128K | 75Deepranking.ai reports HumanEval 92.7%, MMLU 75.1%, and MATH 57.2%, mapped to Coding 85, Logic 75, and Math 65. As a 32B mid-tier model, speed is rated 80. No vision or native reasoning features cited. |
Need a shareable artifact?
Get a print-ready PDF of your results and a CSV spreadsheet. Tap the button, then enter your work email. We use it to build your files and start the download—and to email you a copy if the site owner enabled that.
AI ROI Leaderboard & Discovery by LeadsCalc
PDF Breakdown
Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: Open-weight).
By submitting, you agree to our Privacy Policy and Terms.
Whitelabel Open-weight Leaderboard
for your site
Embed the interactive open-weight view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.