Interactive leaderboard

Best Coding LLMs 2026: AI Models for Software Engineering

Rank the best coding LLMs in 2026 using SWE-bench signals, HumanEval benchmarks, and live API pricing. Compare autonomous editing and repo-scale tasks for US, Canadian, and Australian engineering orgs.

Compare coding LLMs with benchmarks and real API pricing in 2026

The coding tab prioritizes models that perform on real software engineering tasks—not trivia—then layers estimated monthly API cost for the same workload. Engineering leads from San Francisco to Toronto to Sydney use it to pair a daily-driver model with a cheaper tier for bulk codegen, while keeping vendor options defensible for security and procurement reviews.

Workload & pricing toggles

Workload presets

Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.

Include Vision / Image Processing

Off — no image fees in cost estimates for vision-capable models.

Turn On to include image fees.

OffOn

Use Cached Pricing

Enable to get 50% off input tokens where cached rates apply

OffOn

Deep Reasoning / Thinking Mode

Model hidden reasoning / extended thinking charged like output tokens when enabled.

OffOn

Batch Pricing

Enable for 50% off input & output where batch/async pricing applies

OffOn
≈ $100.00/mo
8K
1K1.0M
≈ $100.00/mo
2K
100500K
≈ $200.00 total
5K
10100K

Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.

Magic quadrant (top 15)

X: est. monthly · Y: Coding · Dot: provider color · Hover for rank, model & details

Full leaderboard

Showing 48 of 365 models.

PickModelEst. monthlyROI scoreCodingReasoningSpeedMathContextOverall
DeepSeek: DeepSeek V4 Pro$26.1066
98
91
70
88
1.0M
92SWE-bench Verified at 81.0% maps to 98 coding. GPQA Diamond at 66.3% maps to 92 logic. Flagship MoE tier justifies high scores; speed is standard for large MoE.
OpenAI: o3 Pro$1,600.0060
98
92
40
99
200K
95SWE-bench Verified at 69.1% maps to 98 coding. GPQA Diamond at 83.3% maps to 98 logic. AIME 2024 at 91.6% maps to 99 math. As a flagship reasoning model, speed is lower (40).
Anthropic: Claude Opus 4.7 (Fast)$2,700.0060
98
97
85
90
1.0M
95SWE-bench Verified 87.6% maps to 98 coding. GPQA Diamond 94.2% maps to 98 logic. MATH 500 80.7% maps to 90 math. Fast-mode variant boosts speed to 85. Frontier tier defaults used for missing metrics.
Anthropic: Claude Sonnet 4.5$270.0062
98
94
80
94
1.0M
95SWE-bench Verified at 77.2% maps to 98 coding. Other benchmarks lack explicit scores in evidence, so logic, math, and instruction are inferred from its frontier Sonnet-class tier. Speed reflects balanced mid-tier latency.
Anthropic: Claude Opus 4.7$450.0062
98
97
60
95
1.0M
97SWE-bench Verified at 87.6% maps to 98 coding. GPQA Diamond at 94.2% maps to 98 logic. Heavyweight Opus tier with 53 tok/s yields 60 speed. Native reasoning supported via OpenRouter reasoning parameter.
Google: Gemini 3.1 Pro Preview Custom Tools$200.0063
98
97
65
95
1.0M
97SWE-bench Verified at 80.6% maps to 98 coding. GPQA Diamond at 94.3% maps to 98 logic. As a flagship Pro model, it receives high multimodal (95) and math (95) scores, with standard heavyweight speed (65).
Anthropic: Claude Opus 4.5$450.0062
98
97
50
95
200K
97SWE-bench Verified at 80.9% maps to 98 coding. GPQA Diamond at 87.0% maps to 98 logic. MMMU at 80.7% yields 95 multimodal. As a heavyweight reasoning model with an effort parameter, speed is rated lower at 50.
Kwaipilot: KAT-Coder-Pro V2$24.0062
98
83
88
75
256K
85SWE-bench Verified at 79.6% maps to an exceptional 98 for coding. Artificial Analysis Index of 44 maps to 80 for logic. Speed of 109 tokens/s maps to 88. Flagship 72B MoE model.
OpenAI: GPT-5$150.0063
98
93
65
95
400K
95SWE-bench Verified 74.9% maps to 98 coding. MMLU 92.5% maps to 95 logic. MMMU 84.2% maps to 95 multimodal. Flagship tier speed estimated at 65.
Anthropic: Claude Opus 4.6 (Fast)$2,700.0060
98
96
70
95
1.0M
96SWE-bench Verified 82.1% maps to 98 coding. GPQA Diamond 88.5% maps to 96 logic. MATH 94.2% maps to 95 math. MMMLU 91.1% maps to 91 multimodal. Heavyweight tier with 60.5 tok/s yields 70 speed.
OpenAI: o3 Deep Research$800.0061
98
92
35
98
200K
95SWE-bench Verified 69.1% maps to 98 coding. GPQA Diamond 83.3% maps to 98 logic. MMMU 82.9% maps to 95 multimodal. Speed is 35 due to extended reasoning times.
OpenAI: GPT-5.1-Codex-Max$150.0062
98
93
85
92
400K
94SWE-bench Verified at 77.9% maps to 98 coding. Speed of 84.20 tok/s maps to 85. As a flagship reasoning model, logic and math are inferred high (~92-95). Vision supported; price defaulted to frontier average.
OpenAI: GPT-5.3-Codex$210.0060
98
90
60
85
400K
91SWE-Bench Pro at 56.8% maps to 98 coding. MMLU at 90.2% maps to 92 logic. MMMU Pro at 83.2% maps to 88 multimodal. Flagship agentic model with native reasoning.
Z.ai: GLM 5.1$70.0064
98
93
65
97
203K
95SWE-bench Pro top score (GLM-5 had 77.8, +3.3 gain) maps to 98 coding. GPQA Diamond 86.2% maps to 95 logic. AIME 95.3% maps to 97 math. Heavyweight MoE tier yields 65 speed.
Claude Opus 4.6$450.0062
98
96
50
95
1.0M
96SWE-bench Verified at 82.1% maps to 98 coding. GPQA Diamond at 91.3% maps to 96 logic. MATH at 94.2% maps to 95 math. As a flagship reasoning model, speed is set to 50.
Anthropic: Claude Opus 4.1$1,350.0061
98
97
45
95
200K
97SWE-bench Verified at 74.5% maps to near-perfect coding (98). GPQA Diamond at 80.9% dictates exceptional logic (98). As a flagship 'Opus' model with extended thinking, speed is lower (45). Vision price defaulted to frontier tier.
Anthropic: Claude Sonnet 4$270.0060
98
93
75
85
1.0M
92SWE-bench Verified 72.7% maps to 98 coding. GPQA Diamond 75.5% maps to 92 logic. MMMU 77.6% maps to 88 multimodal. Mid-tier Sonnet class balances high capability with moderate speed.
OpenAI: GPT-5.2 Pro$2,520.0061
98
97
50
95
400K
97SWE-bench Verified at 80.0% maps to 98 coding. GPQA Diamond at 93.2% maps to 98 logic. MMMU-Pro at 80.4% yields 95 multimodal. As a flagship reasoning model, speed is moderate (50).
OpenAI: GPT-5.2-Codex$210.0060
96
90
65
88
400K
91Evidence lacks raw benchmarks. Inferred as a frontier coding-specialized model (GPT-5.2 tier). Assigned high coding (96) and logic (90) based on software engineering optimization claims. Speed (65) reflects heavyweight architecture.
MiniMax: MiniMax M2.5$15.0064
96
81
85
80
205K
85SWE-bench Verified at 80.2% maps to 96 for coding. Global MMLU at 81.4% maps to 82 for logic. Speed of 100 tokens/s maps to 85. Flagship 230B MoE model with native reasoning support.
Anthropic: Claude Opus 4$1,350.0057
96
89
60
88
200K
90SWE-bench 72.5% maps to 96 coding. GPQA 50.5% maps to 85 logic. IFEval 92.1% maps to 92 instruction. MATH 77% maps to 88 math. MMMU 76.5% maps to 85 multimodal. Flagship tier dictates 60 speed.
Anthropic: Claude Opus Latest$450.0058
96
89
60
88
1.0M
90SWE-bench Verified at 74.5% maps coding to 96. GPQA at 50.5% maps logic to 85. IFEval (92.1%) and MATH (77%) inform instruction and math scores. As a flagship model, speed is moderate (60).
Google: Gemini 2.5 Pro Preview 06-05$150.0063
96
93
45
96
1.0M
95SWE-bench (59.6%) maps to 96 coding. GPQA (86.4%) maps to 96 logic. AIME (88.0%) maps to 96 math. MMMU (82.0%) maps to 90 multimodal. Flagship tier model with native reasoning; speed adjusted for thinking overhead.
Nex AGI: DeepSeek V3.1 Nex N1$10.4070
96
90
70
90
131K
92Flagship tier. SWE-bench Verified at 70.6 maps to 96 coding. BFCL v4 at 65.3 maps to 90 instruction. Logic and math inferred high (90) from flagship status. Speed set to 70 for heavy MoE.
Qwen: Qwen3.6 Max Preview$104.0064
96
95
55
97
262K
96Based on Qwen3.6 Plus scoring SWE-bench 78.8% and GPQA 90.4%, Max (1T MoE flagship) maps to 96 for coding and logic. Native reasoning is supported via <think> tags. Vision defaults to frontier pricing.
OpenAI: GPT-5.2$210.0063
96
96
50
98
400K
97SWE-bench Verified 68.1% maps to 96 coding. GPQA 81.4% maps to 97 logic. MMMU 81.6% maps to 95 multimodal. Frontier reasoning model; speed estimated at 50.
Qwen: Qwen3 Max$70.2063
96
91
55
95
262K
93SWE-bench Verified at 69.6% maps to 96 coding. SuperGPQA at 65.1% maps to 92 logic. AIME 2025 at 81.6% maps to 95 math. Flagship tier model; speed mapped to 55 based on 26 tok/s throughput.
o3$160.0062
96
91
40
98
200K
94SWE-bench Verified at 69.1% maps to 96 coding. GPQA Diamond at 87.7% maps to 97 logic. AIME 2024 at 96.7% maps to 98 math. Speed is 40 due to extended reasoning times.
MoonshotAI: Kimi K2 Thinking$49.0059
95
80
45
80
262K
84SWE-bench Verified (71.3%) maps to 95 coding. GPQA-Diamond (48.1%) maps to 75 logic. MATH (70.2%) maps to 80. As a 1T MoE reasoning model, speed is mapped to 45. No vision support found.
Anthropic Claude Sonnet Latest$270.0062
95
94
80
96
1.0M
95Claude 3.7 Sonnet scores SWE-bench Verified 72.7% (Coding: 95), GPQA Diamond 84.8% (Logic: 95), MATH 96.2% (Math: 96), IFEval 93.2% (Instruction: 93), MMMU 75% (Multimodal: 75). Mid-tier speed (~80). Features native extended thinking.
Qwen: Qwen3.5-122B-A10B$31.2067
95
94
45
95
262K
95SWE-bench Verified at 72.0% maps to 95 coding. GPQA Diamond at 86.6% maps to 96 logic. IFEval 92% maps to 92 instruction. As a native reasoning model, speed is mapped to 45.
Claude Sonnet 4.6$270.0061
95
94
80
92
1.0M
94SWE-bench Verified 79.6% maps to 95 coding. GPQA 89.9% maps to 95 logic. MMMLU 89.3% maps to 90 multimodal. Sonnet-class speed maps to 80.
Xiaomi: MiMo-V2-Pro$70.0062
95
90
60
88
1.0M
91SWE-bench Verified at 78.0% maps to 95 coding. GPQA Diamond at 87.0% maps to 95 logic. IFBench at 68.8% maps to 85 instruction. As a 1T+ flagship, speed is lower (60).
Anthropic: Claude Opus 4.8 (Fast)$900.0056
95
89
85
77
1.0M
88Claude 4 Opus proxy yields SWE-bench 72.5% (mapped to 95 coding) and MMLU 86% (mapped to 86 logic). IFEval 92.1% maps to 92 instruction. Fast variant throughput of 85 tok/s maps to 85 speed.
MoonshotAI: Kimi K2 0905$49.0063
95
90
70
90
262K
91SWE-bench Verified at 69.2% maps to a 95 coding score. As a 1T parameter flagship MoE, logic, math, and instruction are inferred around 90. No vision, caching, or native reasoning documented for this specific non-thinking SKU.
OpenAI: GPT-5.4$250.0060
95
91
70
88
1.1M
91Evidence states GPT-5.4 outperforms GPT-4.1 (SWE-bench Verified 54.6%, GPQA 66.3%). Mapped coding to 95 and logic to 92 for this frontier flagship. Multimodal inferred from native computer-use screenshots and GPT-4.1's MMMU 74.8%.
Arcee AI: Trinity Large Thinking$17.3069
95
94
45
98
262K
95SWE-bench Verified at 63.2% maps to 95 coding. GPQA-Diamond at 76.3% maps to 95 logic. AIME 2025 at 96.3% maps to 98 math. As a heavy reasoning model (398B MoE), speed is mapped to 45.
Mistral: Devstral 2 2512$36.0061
95
85
75
80
262K
86SWE-bench Verified at 72.2% maps to 95 coding. GPQA at 59.4% maps to 85 logic. Speed of 74 tok/s maps to 75. Heavyweight 123B model; no tier adjustment needed.
OpenAI: GPT-5.4 Image 2$470.0061
95
96
60
95
272K
95GPT-5.4 scores 81.2% on MMMU-Pro (Multimodal ~90). As a frontier reasoning model with native computer-use (OSWorld 75%), Logic and Coding map to ~95. Speed is ~60 due to chain-of-thought overhead.
DeepSeek: DeepSeek V3.2 Speciale$15.7970
95
97
45
96
164K
96GPQA 87.1% maps to 98 logic. LiveCodeBench 89.6% and Aider 88.0% map to 95 coding. HMMT 99.0% maps to 96 math. Flagship reasoning model with native thinking tokens; speed scored 45 due to extended reasoning overhead.
Anthropic: Claude Opus 4.8$450.0058
95
89
50
85
1.0M
89SWE-bench 72.5% maps to 95 coding. GPQA 50.5% maps to 85 logic. IFEval 92.1% maps to 92 instruction. MATH 77% maps to 85 math. MMMU 76.5% maps to 85 multimodal. Flagship tier speed is ~50.
MoonshotAI: Kimi K2.5$35.0066
95
93
65
98
262K
95SWE-bench Verified 76.8% maps to 95 coding. GPQA Diamond 87.9% maps to 96 logic. AIME 2025 96.1% maps to 98 math. 1T MoE flagship tier; native reasoning supported.
Qwen: Qwen3.6 Plus$32.5065
95
91
70
92
1.0M
92OpenRouter docs cite a 78.8% on SWE-bench Verified, mapping to a 95 coding score. As a reasoning-enabled Plus MoE model, logic and math score highly (~92), while speed is balanced (~70). Multimodal capabilities are explicitly noted.
OpenAI: GPT-5 Chat$150.0062
95
92
65
92
128K
93SWE-bench Verified 74.9% maps to 95 coding. MMLU 92.5% maps to 93 logic. MMMU 84.2% maps to 88 multimodal. Flagship tier model with native 'Thinking' mode and high reasoning capabilities.
Google: Gemini 2.5 Pro Preview 05-06$150.0062
95
93
45
95
1.0M
94Evidence notes it outperforms Claude 3.5 Sonnet on SWE-bench Verified, GPQA, and MMMU, though exact percentages are omitted. Mapped to frontier-level 95s for coding, logic, and multimodal. Speed is reduced due to mandatory native thought reasoning.
OpenAI: o3 Mini$88.0064
95
93
65
98
200K
95SWE-bench Verified (69.1%) maps to 95 coding. GPQA (83.3%) maps to 92 logic. Despite being a 'Mini' tier model, its native reasoning capabilities yield flagship-level STEM scores, though speed is balanced for chain-of-thought generation.
Xiaomi: MiMo-V2.5-Pro$26.1065
95
91
65
88
1.0M
91SWE-bench Verified 78.9% maps to 95 coding. GPQA Diamond 66.7% maps to 92 logic. Flagship tier model with native reasoning and caching support.
DeepSeek: DeepSeek V3.1 Terminus$20.3067
95
93
70
92
164K
93SWE Verified 68.4% maps to 95 coding. GPQA-Diamond 80.7% maps to 95 logic. Flagship tier model with native thinking mode; no vision or caching evidence found.

Need a shareable artifact?

Get a print-ready PDF of your results and a CSV spreadsheet. Tap the button, then enter your work email. We use it to build your files and start the download—and to email you a copy if the site owner enabled that.

AI ROI Leaderboard & Discovery by LeadsCalc

Detailed analysis

PDF Breakdown

Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: Coding).

Instant setup
No CC required

By submitting, you agree to our Privacy Policy and Terms.

Agency accelerator

Whitelabel Coding Leaderboard
for your site

Embed the interactive coding view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.

1-Click CRM sync
Custom branding
Branded reports
Lead analytics

Free to start

$0/mo*
GET STARTED

NO CREDIT CARD REQUIRED

How it works

Methodology: How we rank Coding LLMs

Transparent, benchmark-driven rankings—same craft as our single-model deep dives.

Coding rank inputs: code benchmarks vs. list pricing

Our 2026 coding rankings normalize data from SWE-bench Verified, HumanEval, and related code benchmarks. We prioritize models that demonstrate autonomous multi-file editing, refactors, and complex logic under real API constraints—so engineering teams in the US, Canada, and Australia can compare developer throughput, not marketing claims.

Battle Arena

Compare up to four LLMs side by side

Tick up to four models in the leaderboard table, then open Battle Arena for API pricing, benchmarks, and workload math in one view—perfect when you are shortlisting vendors for a pilot in the US, Canada, or Australia.

Prefer a head start? Jump into high-intent comparisons people search for every day—same interactive calculator, zero signup.

Open Battle ArenaUp to 4 models · Live estimates
Signals & spend

Value analysis

Benchmarks vs. estimated API cost—read the story your CFO cares about.

Choosing a coding model when API spend is capped

A high coding score does not automatically mean “affordable at scale.” Use the scatter plot to see who sits in the efficient frontier: strong coding axis, controlled monthly estimate. Agencies delivering fixed-fee builds in the US and Australia especially benefit from documenting this trade-off for clients; Canadian teams often add data-residency requirements before final selection.

Production deployment

Software Engineering & Autonomous Coding

How teams in the US, Canada, and Australia deploy these models in production.

IDE integrations, PR reviewers, and SWE-bench performance

Top-tier coding models are increasingly deployed as autonomous agents that can navigate entire repositories, review pull requests, and generate complex test suites. Engineering organizations in the US and Canada use this leaderboard to select models for custom VS Code extensions, automated CI/CD code review pipelines, and legacy codebase migrations where zero-data retention policies are mandatory.

Architecture

Optimizing Developer Tooling Costs

Strategies to reduce monthly API spend without sacrificing capability.

Context pruning and batching CI/CD reviews

Feeding entire codebases into an LLM context window is prohibitively expensive. Teams optimize costs by using AST-based context pruning, embedding-based code retrieval (RAG for code), and shifting non-blocking tasks like nightly code audits to 50% discounted Batch APIs. Evaluate the context window and batch eligibility columns here to plan your developer tooling architecture.

Embed-ready

Need this live Coding data on your website?

Join 500+ agencies in the US and Australia using LeadsCalc to capture high-intent leads. Embed this interactive Coding leaderboard on your site in about a minute—Canadian teams use the same flows for CAD-priced proposals and compliance-friendly landing pages.

Customize & Embed this ToolWhite-label · No code required
United StatesCanadaAustralia
Live preview

Your visitors compare Coding models without leaving your domain.

Support & clarity

Frequently Asked Questions

Focused on teams in the United States, Canada, and Australia.

It depends on your workload tokens and whether you use batch or cached pricing. Use the workload sliders above to compare estimated monthly spend; teams in the US, Canada, and Australia often pair a fast “daily driver” model with a cheaper model for bulk codegen once they see blended costs here.