Interactive leaderboard

Best Open-Source LLMs 2026: Open-Weight Models vs. Hosted API Cost

Compare open-weight LLMs in 2026 for self-hosting and dedicated deployments, with pricing context for hosted variants. For teams in the US, Canada, and Australia evaluating open vs. proprietary APIs.

Open-source friendly LLM comparison with TCO in mind in 2026

Open-weight models can unlock on-prem and dedicated-cloud strategies for residency-sensitive workloads. This tab filters the open-weight subset so platform teams in the United States, Canada, and Australia can compare capability signals while still grounding decisions in realistic engineering and GPU spend—not list hype alone.

Workload & pricing toggles

Workload presets

Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.

Include Vision / Image Processing

Off — no image fees in cost estimates for vision-capable models.

Turn On to include image fees.

OffOn

Use Cached Pricing

Enable to get 50% off input tokens where cached rates apply

OffOn

Deep Reasoning / Thinking Mode

Model hidden reasoning / extended thinking charged like output tokens when enabled.

OffOn

Batch Pricing

Enable for 50% off input & output where batch/async pricing applies

OffOn
≈ $100.00/mo
8K
1K1.0M
≈ $100.00/mo
2K
100500K
≈ $200.00 total
5K
10100K

Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.

Magic quadrant (top 15)

X: est. monthly · Y: Open-weight · Dot: provider color · Hover for rank, model & details

Full leaderboard

Showing 48 of 81 models (open-weight / self-hostable catalog hints).

PickModelEst. monthlyROI scoreCodingReasoningSpeedMathContextOverall
DeepSeek: DeepSeek V3.2 Speciale$15.7970
95
97
45
96
164K
96GPQA 87.1% maps to 98 logic. LiveCodeBench 89.6% and Aider 88.0% map to 95 coding. HMMT 99.0% maps to 96 math. Flagship reasoning model with native thinking tokens; speed scored 45 due to extended reasoning overhead.
Qwen: Qwen3.6 Max Preview$104.0064
96
95
55
97
262K
96Based on Qwen3.6 Plus scoring SWE-bench 78.8% and GPQA 90.4%, Max (1T MoE flagship) maps to 96 for coding and logic. Native reasoning is supported via <think> tags. Vision defaults to frontier pricing.
Qwen: Qwen3.5-122B-A10B$31.2067
95
94
45
95
262K
95SWE-bench Verified at 72.0% maps to 95 coding. GPQA Diamond at 86.6% maps to 96 logic. IFEval 92% maps to 92 instruction. As a native reasoning model, speed is mapped to 45.
DeepSeek: DeepSeek V3.1 Terminus$20.3067
95
93
70
92
164K
93SWE Verified 68.4% maps to 95 coding. GPQA-Diamond 80.7% maps to 95 logic. Flagship tier model with native thinking mode; no vision or caching evidence found.
Qwen: Qwen3 Max$70.2063
96
91
55
95
262K
93SWE-bench Verified at 69.6% maps to 96 coding. SuperGPQA at 65.1% maps to 92 logic. AIME 2025 at 81.6% maps to 95 math. Flagship tier model; speed mapped to 55 based on 26 tok/s throughput.
DeepSeek: DeepSeek V4 Pro$26.1066
98
91
70
88
1.0M
92SWE-bench Verified at 81.0% maps to 98 coding. GPQA Diamond at 66.3% maps to 92 logic. Flagship MoE tier justifies high scores; speed is standard for large MoE.
Nex AGI: DeepSeek V3.1 Nex N1$10.4070
96
90
70
90
131K
92Flagship tier. SWE-bench Verified at 70.6 maps to 96 coding. BFCL v4 at 65.3 maps to 90 instruction. Logic and math inferred high (90) from flagship status. Speed set to 70 for heavy MoE.
Qwen: Qwen3.6 Plus$32.5065
95
91
70
92
1.0M
92OpenRouter docs cite a 78.8% on SWE-bench Verified, mapping to a 95 coding score. As a reasoning-enabled Plus MoE model, logic and math score highly (~92), while speed is balanced (~70). Multimodal capabilities are explicitly noted.
DeepSeek: DeepSeek V3.1$16.3068
90
93
75
92
164K
92GPQA Diamond at 74.9% maps to 95 logic. AIME 2025 at 49.8% maps to 92 math. SWE-bench Verified noted as strength, mapping to 90 coding. Flagship 671B MoE tier yields ~75 speed.
Qwen: Qwen3 235B A22B Instruct 2507$4.6076
92
93
70
92
262K
92SWE-bench Verified 55.6% maps to 92 coding. GPQA 77.5% maps to 95 logic. IFEval 93.3% maps to 90 instruction. Flagship 235B MoE tier yields 70 speed.
DeepSeek V3.2$12.5868
92
89
45
95
131K
91SWE-bench at 67.8% maps to 92 for coding. MMLU-Pro 85.0% maps to 90 for logic. AIME 2025 at 89.3% yields 95 for math. As a frontier reasoning model, speed is 45.
Qwen3.6 35B A3B$15.6067
92
90
85
90
262K
91SWE-bench Verified 73.4% maps to 92 coding. GPQA 86.0% maps to 92 logic. MMMU 81.7% maps to 90 multimodal. 3B active MoE architecture ensures high speed (85).
Qwen: Qwen3.5-35B-A3B$15.6067
85
90
90
95
262K
90SWE-bench (69.2%) maps to 85 coding; GPQA Diamond (84.2%) maps to 88 logic. As a 35B MoE activating 3B parameters (lightweight tier), it achieves high efficiency and speed (90) while maintaining strong native reasoning capabilities.
Mistral Large (2512)$35.0063
90
89
65
92
262K
90Evidence lacks exact Large 3 scores but cites Large 2 MMLU (84%) and Small 3.1 HumanEval (88.4%). As the 675B flagship, scores are inferred higher (Coding 90, Logic 88). Speed reflects heavy MoE architecture.
Qwen: Qwen3.5-27B$23.4065
88
92
85
90
262K
90SWE-bench Verified at 72.4% maps to 88 coding. GPQA Diamond at 85.5% maps to 88 logic. IFEval 95.0% maps to 95 instruction. As a 27B mid-tier model, speed is rated 85.
Qwen: Qwen3 Coder Plus$58.5061
92
88
60
88
1.0M
89Evidence lacks exact benchmark percentages but describes a 480B flagship coding model matching GPT-4 on SWE-bench and HumanEval+. Assigned high flagship scores (Coding 92, Logic 88). Speed reflects 29-34 tok/s throughput. Vision supported; default frontier image price applied.
Qwen: Qwen3.7 Max$87.5060
88
89
60
90
1.0M
89Evidence lacks exact Qwen3.7 Max scores but confirms it as a flagship model evaluated on SWE-bench, GPQA, and MMLU. Scores inferred for a heavyweight tier (e.g., MATH 75-90% for frontier models), mapping to high 80s/90s.
Gemma 4 31B$8.4069
78
89
45
95
262K
88HumanEval 76.8% maps to coding 78. MMLU 87.1% maps to logic 85. IFEval 93.7% maps to instruction 92. GSM8k 97.6% maps to math 95. Speed 8.52 t/s maps to 45. Native reasoning confirmed via 'reasoning_details'.
Qwen: Qwen3 VL 235B A22B Instruct$16.8065
85
88
65
85
262K
87Evidence lists wins on MMLU, SuperGPQA, IFEval, and LiveCodeBench without exact percentages. As a 235B flagship MoE, scores are inferred high (85-92). Multimodal is strong (MMMU-Pro). No native reasoning (Instruct version).
Meta: Llama 3.3 70B Instruct$7.2069
85
86
75
85
131K
86SWE-bench Verified 54.6% maps to 85 coding. GPQA 50.5% and MMLU 86% map to 80 logic. IFEval 92.1% yields 92 instruction. MATH 77% gives 85 math. 70B tier implies 75 speed. Text-only model.
Qwen: Qwen3.6 27B$35.5661
82
90
80
80
262K
86Inferred from Qwen3.5-27B: IFEval 95.0 maps to Instruction 95. Artificial Analysis Coding 82% maps to Coding 82. Intelligence 85% maps to Logic 85. 27B mid-tier adjustments applied for Math and Speed.
Mistral Large 2411$140.0057
85
85
65
85
131K
85Evidence lists SWE-bench, GPQA, and MMLU but omits exact scores. As a flagship 'Large' model upgrading 24.07, scores are inferred high (85s). Speed is estimated at 65 for heavyweights. Vision is explicitly unsupported for this specific SKU.
Nous: Hermes 3 405B Instruct$50.0060
87
83
60
85
131K
85Evidence shows HumanEval at 89.0% (mapped to 87 coding) and GPQA at 50.7% (mapped to 78 logic). IFEval 88.6% maps to 88 instruction, and MATH 73.8% maps to 85 math. Heavyweight 405B model yields moderate speed (60).
Mistral Large 2407$140.0057
82
85
65
88
131K
85Evidence shows HumanEval 92% and LiveCodeBench 26.7 (Coding ~82), GPQA 47.2 and MMLU 84% (Logic ~84), MATH-500 71.4 (Math ~88), MT-Bench 8.63 (Instruction ~85). As a 123B flagship, speed is moderate (~65). No vision or native reasoning.
Qwen: Qwen3 Next 80B A3B Instruct$14.6064
88
84
65
82
262K
85HumanEval 95.1% maps to 88 coding. GPQA-D 47.0% maps to 75 logic. IFEval 93.4% maps to 93 instruction. As an 80B heavyweight, it lacks native reasoning and vision, prioritizing fast text generation at 32 tok/s.
Mistral: Mistral Medium 3$36.0060
80
83
75
90
131K
84HumanEval 92.1% (Coding ~80), GPQA Diamond 57.1% (Logic ~75), IFEval 89.4% (Instruction ~90), Math500 91.0% (Math ~90). Mid-tier model balancing cost and performance; speed ~69 tok/s (Speed ~75). Vision supported but price inferred.
Qwen: Qwen3 32B$6.0069
78
85
70
88
131K
84No exact percentages cited; evidence notes Qwen3 32B outperforms Qwen2.5 72B on GPQA and IFEval. Mapped Logic/Instruction to 85. Speed mapped to 70 based on 57 tok/s. Native reasoning supported via dual-mode architecture.
DeepSeek: DeepSeek V3 0324$15.7063
84
87
65
75
131K
83Based on DeepSeek-V3 baseline (HumanEval 82.6%, GPQA 59.1%, MATH 61.6%, IFEval 86.1%), with evidence noting 0324 improves on GPQA and MATH. Mapped to flagship 0-100 scale. Speed reflects 685B MoE architecture.
Qwen: Qwen3.5 397B A17B$39.0059
82
84
65
82
262K
83Evidence cites SWE-bench Verified and GPQA without exact percentages. Inferred as 397B flagship MoE: coding and logic mapped to 82, speed to 65 for heavy MoE.
NVIDIA: Llama 3.3 Nemotron Super 49B V1.5$20.0062
75
80
70
95
131K
83GPQA 71.97% maps to Logic 80. MATH500 97.4% maps to Math 95. LiveCodeBench 73.58% maps to Coding 75. Speed 50.6 tok/s maps to 70. Mid-tier 49B model with explicit reasoning modes and vision support.
Mistral: Mistral Medium 3.5$135.0056
90
80
70
80
262K
83Evidence lacks exact 3.5 benchmarks, so scores are inferred above Mistral Small 3.1 (HumanEval 88.4%, GPQA 46%). As a 128B model, it maps to strong mid-tier/frontier performance (Coding 90, Logic 75).
Qwen: Qwen3 Coder 480B A35B$26.8060
92
85
70
65
1.0M
82Evidence shows HumanEval 93.9% and GPQA 61.8%, mapping to high coding (92) and logic (85) for this heavyweight 480B MoE. Math is mapped to 65 based on MATH 39.3%. Speed is 70 (67 tok/s).
Qwen: Qwen3.5-9B$5.5069
70
87
85
83
262K
82GPQA Diamond 81.7% maps to 82 logic. IFEval 91.5% maps to 91 instruction. LiveCodeBench 65.6% maps to 70 coding. MMMU 78.4% maps to 78 multimodal. As a 9B lightweight model, speed is high (85).
Llama 4 Maverick$12.0063
80
87
75
75
1.0M
82HumanEval 82.6% maps to coding 80. MMLU 88.5% maps to logic 85. IFEval 92.1% maps to instruction 88. MATH 77% maps to math 75. MMMU 73.4% maps to multimodal 85. 17B active MoE yields mid-tier speed.
Gemma 4 26B A4B$5.7068
77
81
90
88
262K
82GPQA Diamond 82.3% maps to 82 logic. LiveCodeBench 77.1% maps to 77 coding. AIME 88.3% maps to 88 math. As a 3.8B active MoE, speed is rated 90. MMMU Pro 73.8% yields 78 multimodal.
Qwen2.5 72B Instruct$18.4061
82
77
65
88
131K
81Evidence cites HumanEval at 86.6% (mapped to 82 coding) and GPQA at 49.0% (mapped to 70 logic). IFEval is 84.1% (84 instruction) and MATH is 83.1% (88 math). As a 72B heavyweight, speed is standard (65).
Qwen: Qwen3.7 Plus$32.0059
80
81
80
82
1.0M
81Evidence lacks exact Qwen3.7 Plus benchmark scores. Inferred mid-tier capabilities (Coding 80, Logic 80) from 'Plus' designation and 71 tok/s throughput (Speed 80). Multimodal inferred at 70. Always-on reasoning noted in lineage.
Qwen: Qwen3.5 Plus 2026-04-20$30.0059
80
83
70
80
1.0M
81No exact benchmark percentages provided in evidence. Inferred coding (80) and logic (80) based on 'Plus' mid-tier status and strong comparative claims. Speed (70) reflects 45-51 tok/s throughput. Native reasoning explicitly supported.
Qwen3 235B A22B$36.4058
65
83
50
88
131K
80AA Index reports 20.9% coding (mapped to 65) and 88.3% math (mapped to 88). Speed is 35 t/s (mapped to 50). Heavyweight MoE with native thinking mode.
Qwen: Qwen2.5 VL 72B Instruct$17.5060
82
75
65
88
131K
80HumanEval 86.6% maps coding to 82. GPQA 49.0% maps logic to 65. IFEval 84.1% maps instruction to 84. GSM8K 95.8% maps math to 88. Heavyweight 72B tier dictates speed ~65 and strong multimodal ~85.
Qwen: Qwen3 VL 30B A3B Instruct$10.4063
75
86
90
75
262K
80GPQA 70.1% maps to Logic 85; IFEval 85.8% maps to Instruction 86. As a 30B/3B active MoE mid-tier model, coding and math are inferred ~75. Speed is high (~90) due to sparse architecture.
Qwen: Qwen VL Max$41.6057
75
80
60
80
131K
79Primary source: Evidence lists benchmarks as 'not available'. Inferred from 'Max' flagship tier, mapping coding, logic, and math to ~75-80. Speed reflects heavyweight class (~60).
Qwen: Qwen-Plus$18.2060
74
83
80
75
1.0M
79No exact Qwen-Plus benchmarks cited; inferred from Qwen-72B proxy (HumanEval 74.2%, MMLU 86.5%, MATH 64.0%). Mapped to mid-tier 0-100 scale. Evidence confirms native reasoning ('reasoning_details' array) and vision support, but lacks caching or batch details.
Meta: Llama 3.1 70B Instruct$20.0059
80
84
70
68
131K
79HumanEval 80.5% maps to 80 coding. MMLU 83.6% maps to 83 logic. MATH 68% maps to 68 math. 70B heavyweight tier yields 70 speed. No vision or native reasoning supported.
Qwen: Qwen3 Coder Next$12.4061
85
72
95
85
262K
78HumanEval 92.7% and GPQA-D 42.4% map to 85 coding and 55 logic. IFEval 89.6% yields 88 instruction. Speed is 95 based on 162 tok/s. As an 80B (3B active) efficient model, logic is appropriately scaled.
Qwen: Qwen3.5 Plus 2026-02-15$26.0057
75
79
80
75
1.0M
77Evidence notes Qwen3.6 Plus scores 78.8 on SWE-bench, implying Qwen3.5 Plus is slightly lower (mapped to 75). OpenRouter confirms 'reasoning' parameter support. Vision supported; estimated 1K tokens at $0.26/M input = $0.00026 per image.
Qwen: Qwen3 VL 32B Instruct$8.3262
75
78
70
75
262K
76Evidence lacks exact Instruct scores but cites Qwen2.5 72B (HumanEval 86.6%, GSM8K 95.8%). As a 32B mid-tier model, scores are inferred lower (~75). Multimodal is strong (VL architecture). Speed reflects 32B size.
Qwen2.5 Coder 32B Instruct$36.4055
85
75
80
65
128K
75Deepranking.ai reports HumanEval 92.7%, MMLU 75.1%, and MATH 57.2%, mapped to Coding 85, Logic 75, and Math 65. As a 32B mid-tier model, speed is rated 80. No vision or native reasoning features cited.

Need a shareable artifact?

Get a print-ready PDF of your results and a CSV spreadsheet. Tap the button, then enter your work email. We use it to build your files and start the download—and to email you a copy if the site owner enabled that.

AI ROI Leaderboard & Discovery by LeadsCalc

Detailed analysis

PDF Breakdown

Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: Open-weight).

Instant setup
No CC required

By submitting, you agree to our Privacy Policy and Terms.

Agency accelerator

Whitelabel Open-weight Leaderboard
for your site

Embed the interactive open-weight view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.

1-Click CRM sync
Custom branding
Branded reports
Lead analytics

Free to start

$0/mo*
GET STARTED

NO CREDIT CARD REQUIRED

How it works

Methodology: How we rank Open-Weight LLMs

Transparent, benchmark-driven rankings—same craft as our single-model deep dives.

What “open-weight” means on this leaderboard

Open-weight rankings filter to models tagged as open-weight / self-hostable in our qualitative catalog, then score within that subset. This helps teams avoiding proprietary API lock-in—common among startups and enterprises in the US, Canada, and Australia evaluating on-prem or dedicated cloud deployments.

Battle Arena

Compare up to four LLMs side by side

Tick up to four models in the leaderboard table, then open Battle Arena for API pricing, benchmarks, and workload math in one view—perfect when you are shortlisting vendors for a pilot in the US, Canada, or Australia.

Prefer a head start? Jump into high-intent comparisons people search for every day—same interactive calculator, zero signup.

Open Battle ArenaUp to 4 models · Live estimates
Signals & spend

Value analysis

Benchmarks vs. estimated API cost—read the story your CFO cares about.

Hosted API vs. self-host: how to read the trade-offs

Self-hosting shifts cost from tokens to hardware, ops, and reliability. Use hosted estimates here as a directional anchor, then build your TCO model. Australian and Canadian buyers often start with sovereignty requirements; US buyers may weigh velocity of managed APIs against control.

Production deployment

Data Sovereignty & Custom Fine-Tuning

How teams in the US, Canada, and Australia deploy these models in production.

Air-gapped enterprise, HIPAA/SOC2 compliance, and domain adaptation

Open-weight models are non-negotiable for organizations with strict data residency requirements. Healthcare providers in the US, government agencies in Canada, and financial institutions in Australia deploy these models in air-gapped VPCs to ensure zero-data retention. They also serve as the foundation for LoRA fine-tuning, allowing teams to bake proprietary domain knowledge directly into the model weights.

Architecture

Self-Hosting vs Managed Endpoints

Strategies to reduce monthly API spend without sacrificing capability.

Total Cost of Ownership (TCO) and GPU provisioning

While the weights are free, GPU compute is not. Teams must weigh the Total Cost of Ownership (TCO) of provisioning AWS or Azure instances against using managed serverless endpoints. This leaderboard displays the API cost of hosted open-weight models, providing a baseline to determine if your token volume justifies the engineering overhead of managing your own vLLM or Ollama infrastructure.

Embed-ready

Need this live Open-Weight data on your website?

Join 500+ agencies in the US and Australia using LeadsCalc to capture high-intent leads. Embed this interactive Open-Weight leaderboard on your site in about a minute—Canadian teams use the same flows for CAD-priced proposals and compliance-friendly landing pages.

Customize & Embed this ToolWhite-label · No code required
United StatesCanadaAustralia
Live preview

Your visitors compare Open-Weight models without leaving your domain.

Support & clarity

Frequently Asked Questions

Focused on teams in the United States, Canada, and Australia.

Not automatically—you must add GPU, ops, and engineering time. Compare hosted API estimates here against your self-host TCO; Australian and Canadian teams sometimes prefer open-weight for data sovereignty.