Interactive leaderboard

Best Reasoning LLMs 2026: AI Models for Logic & Multi-Step Tasks

Find top reasoning LLMs in 2026: logic, instruction-following, native thinking modes, and estimated API cost. Ideal for agents, analytics, and complex workflows across the US, Canada, and Australia.

Reasoning-focused LLM rankings with transparent API economics in 2026

Reasoning models power agents, planning, and long chains of tool use—but they often bill more output tokens when “thinking” is enabled. This view highlights logic-heavy capability while showing what that sophistication costs per month for your token mix, helping operators in the US, Canada, and Australia avoid surprise overages.

Workload & pricing toggles

Workload presets

Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.

Include Vision / Image Processing

Off — no image fees in cost estimates for vision-capable models.

Turn On to include image fees.

OffOn

Use Cached Pricing

Enable to get 50% off input tokens where cached rates apply

OffOn

Deep Reasoning / Thinking Mode

Model hidden reasoning / extended thinking charged like output tokens when enabled.

OffOn

Batch Pricing

Enable for 50% off input & output where batch/async pricing applies

OffOn
≈ $100.00/mo
8K
1K1.0M
≈ $100.00/mo
2K
100500K
≈ $200.00 total
5K
10100K

Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.

Magic quadrant (top 15)

X: est. monthly · Y: Reasoning · Dot: provider color · Hover for rank, model & details

Full leaderboard

Showing 48 of 365 models.

PickModelEst. monthlyROI scoreCodingReasoningSpeedMathContextOverall
Anthropic: Claude Opus 4.7 (Fast)$2,700.0060
98
97
85
90
1.0M
95SWE-bench Verified 87.6% maps to 98 coding. GPQA Diamond 94.2% maps to 98 logic. MATH 500 80.7% maps to 90 math. Fast-mode variant boosts speed to 85. Frontier tier defaults used for missing metrics.
Anthropic: Claude Opus 4.7$450.0062
98
97
60
95
1.0M
97SWE-bench Verified at 87.6% maps to 98 coding. GPQA Diamond at 94.2% maps to 98 logic. Heavyweight Opus tier with 53 tok/s yields 60 speed. Native reasoning supported via OpenRouter reasoning parameter.
Google: Gemini 3.1 Pro Preview Custom Tools$200.0063
98
97
65
95
1.0M
97SWE-bench Verified at 80.6% maps to 98 coding. GPQA Diamond at 94.3% maps to 98 logic. As a flagship Pro model, it receives high multimodal (95) and math (95) scores, with standard heavyweight speed (65).
DeepSeek: DeepSeek V3.2 Speciale$15.7970
95
97
45
96
164K
96GPQA 87.1% maps to 98 logic. LiveCodeBench 89.6% and Aider 88.0% map to 95 coding. HMMT 99.0% maps to 96 math. Flagship reasoning model with native thinking tokens; speed scored 45 due to extended reasoning overhead.
Anthropic: Claude Opus 4.5$450.0062
98
97
50
95
200K
97SWE-bench Verified at 80.9% maps to 98 coding. GPQA Diamond at 87.0% maps to 98 logic. MMMU at 80.7% yields 95 multimodal. As a heavyweight reasoning model with an effort parameter, speed is rated lower at 50.
Anthropic: Claude Opus 4.1$1,350.0061
98
97
45
95
200K
97SWE-bench Verified at 74.5% maps to near-perfect coding (98). GPQA Diamond at 80.9% dictates exceptional logic (98). As a flagship 'Opus' model with extended thinking, speed is lower (45). Vision price defaulted to frontier tier.
OpenAI: GPT-5.2 Pro$2,520.0061
98
97
50
95
400K
97SWE-bench Verified at 80.0% maps to 98 coding. GPQA Diamond at 93.2% maps to 98 logic. MMMU-Pro at 80.4% yields 95 multimodal. As a flagship reasoning model, speed is moderate (50).
OpenAI: GPT-5.2$210.0063
96
96
50
98
400K
97SWE-bench Verified 68.1% maps to 96 coding. GPQA 81.4% maps to 97 logic. MMMU 81.6% maps to 95 multimodal. Frontier reasoning model; speed estimated at 50.
OpenAI: GPT-5.4 Image 2$470.0061
95
96
60
95
272K
95GPT-5.4 scores 81.2% on MMMU-Pro (Multimodal ~90). As a frontier reasoning model with native computer-use (OSWorld 75%), Logic and Coding map to ~95. Speed is ~60 due to chain-of-thought overhead.
Anthropic: Claude Opus 4.6 (Fast)$2,700.0060
98
96
70
95
1.0M
96SWE-bench Verified 82.1% maps to 98 coding. GPQA Diamond 88.5% maps to 96 logic. MATH 94.2% maps to 95 math. MMMLU 91.1% maps to 91 multimodal. Heavyweight tier with 60.5 tok/s yields 70 speed.
Claude Opus 4.6$450.0062
98
96
50
95
1.0M
96SWE-bench Verified at 82.1% maps to 98 coding. GPQA Diamond at 91.3% maps to 96 logic. MATH at 94.2% maps to 95 math. As a flagship reasoning model, speed is set to 50.
Qwen: Qwen3.6 Max Preview$104.0064
96
95
55
97
262K
96Based on Qwen3.6 Plus scoring SWE-bench 78.8% and GPQA 90.4%, Max (1T MoE flagship) maps to 96 for coding and logic. Native reasoning is supported via <think> tags. Vision defaults to frontier pricing.
Anthropic Claude Sonnet Latest$270.0062
95
94
80
96
1.0M
95Claude 3.7 Sonnet scores SWE-bench Verified 72.7% (Coding: 95), GPQA Diamond 84.8% (Logic: 95), MATH 96.2% (Math: 96), IFEval 93.2% (Instruction: 93), MMMU 75% (Multimodal: 75). Mid-tier speed (~80). Features native extended thinking.
Qwen: Qwen3.5-122B-A10B$31.2067
95
94
45
95
262K
95SWE-bench Verified at 72.0% maps to 95 coding. GPQA Diamond at 86.6% maps to 96 logic. IFEval 92% maps to 92 instruction. As a native reasoning model, speed is mapped to 45.
Claude Sonnet 4.6$270.0061
95
94
80
92
1.0M
94SWE-bench Verified 79.6% maps to 95 coding. GPQA 89.9% maps to 95 logic. MMMLU 89.3% maps to 90 multimodal. Sonnet-class speed maps to 80.
Anthropic: Claude Sonnet 4.5$270.0062
98
94
80
94
1.0M
95SWE-bench Verified at 77.2% maps to 98 coding. Other benchmarks lack explicit scores in evidence, so logic, math, and instruction are inferred from its frontier Sonnet-class tier. Speed reflects balanced mid-tier latency.
Arcee AI: Trinity Large Thinking$17.3069
95
94
45
98
262K
95SWE-bench Verified at 63.2% maps to 95 coding. GPQA-Diamond at 76.3% maps to 95 logic. AIME 2025 at 96.3% maps to 98 math. As a heavy reasoning model (398B MoE), speed is mapped to 45.
DeepSeek: R1 0528$41.5066
92
94
45
98
164K
95GPQA 81.0% maps to 98 logic. AIME 2024 91.4% maps to 98 math. LiveCodeBench 73.3% maps to 92 coding. Speed reflects 287 c/s but heavy reasoning overhead (23K thinking tokens).
OpenAI: GPT-5.5 Pro$3,000.0058
90
94
60
92
1.1M
92LLM Benchmarks reports 94.8 overall score, mapped to 95 logic. Outperforms in GPQA and MathVista. As a 1T parameter Pro flagship, coding and math are estimated at 90-92. Speed is standard for heavyweights (60).
xAI: Grok 3 Beta$270.0061
92
93
50
98
131K
94GPQA Diamond 84.6% and AIME 93.3% (Think mode) map to near-max logic/math. LiveCodeBench 79.4% maps to high coding. MMMU 78% confirms strong vision. Flagship tier; speed inferred moderate due to reasoning.
Qwen: Qwen3 Next 80B A3B Thinking$11.7070
90
93
45
97
262K
93GPQA at 77.2% maps to 96 logic. IFEval at 88.9% maps to 90 instruction. AIME 2025 at 87.8% maps to 97 math. As an 80B thinking model, speed is lower (45). No SWE-bench cited; coding estimated at 90.
Google: Gemini 2.5 Pro Preview 06-05$150.0063
96
93
45
96
1.0M
95SWE-bench (59.6%) maps to 96 coding. GPQA (86.4%) maps to 96 logic. AIME (88.0%) maps to 96 math. MMMU (82.0%) maps to 90 multimodal. Flagship tier model with native reasoning; speed adjusted for thinking overhead.
MoonshotAI: Kimi K2.5$35.0066
95
93
65
98
262K
95SWE-bench Verified 76.8% maps to 95 coding. GPQA Diamond 87.9% maps to 96 logic. AIME 2025 96.1% maps to 98 math. 1T MoE flagship tier; native reasoning supported.
OpenAI: o3 Mini$88.0064
95
93
65
98
200K
95SWE-bench Verified (69.1%) maps to 95 coding. GPQA (83.3%) maps to 92 logic. Despite being a 'Mini' tier model, its native reasoning capabilities yield flagship-level STEM scores, though speed is balanced for chain-of-thought generation.
OpenAI: o1$1,200.0059
88
93
40
98
200K
93SWE-bench Verified 48.9% maps to 88 coding. GPQA 78% maps to 96 logic. MMMU 77.6% maps to 85 multimodal. Speed is 40 due to extended reasoning times. Flagship tier.
OpenAI: GPT-5 Pro$1,800.0059
92
93
77
95
400K
93Evidence lacks exact GPT-5 Pro benchmark scores, so mapped from flagship reasoning tier (Pro/o1-class). Speed mapped from cited 77.4 tps. High coding/logic reflect its deep reasoning mode and 'most advanced model' status.
OpenAI: o1-pro$12,000.0057
85
93
40
98
200K
92GPQA 78.0% maps to 95 logic. SWE-bench Verified 48.9% maps to 85 coding. GSM8K 97.1% maps to 98 math. MMMU 77.6% maps to 85 multimodal. Speed is 40 due to heavy reasoning architecture.
DeepSeek: DeepSeek V3.1$16.3068
90
93
75
92
164K
92GPQA Diamond at 74.9% maps to 95 logic. AIME 2025 at 49.8% maps to 92 math. SWE-bench Verified noted as strength, mapping to 90 coding. Flagship 671B MoE tier yields ~75 speed.
Google: Gemini 2.5 Pro Preview 05-06$150.0062
95
93
45
95
1.0M
94Evidence notes it outperforms Claude 3.5 Sonnet on SWE-bench Verified, GPQA, and MMMU, though exact percentages are omitted. Mapped to frontier-level 95s for coding, logic, and multimodal. Speed is reduced due to mandatory native thought reasoning.
OpenAI: GPT-5$150.0063
98
93
65
95
400K
95SWE-bench Verified 74.9% maps to 98 coding. MMLU 92.5% maps to 95 logic. MMMU 84.2% maps to 95 multimodal. Flagship tier speed estimated at 65.
OpenAI: GPT-5 Image$500.0059
92
93
65
90
400K
92No raw benchmarks provided. Inferred scores based on GPT-5 flagship tier status and claims of major improvements in reasoning and code quality, mapping to frontier-level 90+ scores.
OpenAI: GPT-5.1-Codex-Max$150.0062
98
93
85
92
400K
94SWE-bench Verified at 77.9% maps to 98 coding. Speed of 84.20 tok/s maps to 85. As a flagship reasoning model, logic and math are inferred high (~92-95). Vision supported; price defaulted to frontier average.
Qwen: Qwen3 235B A22B Instruct 2507$4.6076
92
93
70
92
262K
92SWE-bench Verified 55.6% maps to 92 coding. GPQA 77.5% maps to 95 logic. IFEval 93.3% maps to 90 instruction. Flagship 235B MoE tier yields 70 speed.
Z.ai: GLM 5.1$70.0064
98
93
65
97
203K
95SWE-bench Pro top score (GLM-5 had 77.8, +3.3 gain) maps to 98 coding. GPQA Diamond 86.2% maps to 95 logic. AIME 95.3% maps to 97 math. Heavyweight MoE tier yields 65 speed.
DeepSeek: DeepSeek V3.1 Terminus$20.3067
95
93
70
92
164K
93SWE Verified 68.4% maps to 95 coding. GPQA-Diamond 80.7% maps to 95 logic. Flagship tier model with native thinking mode; no vision or caching evidence found.
OpenAI: GPT-5.4 Pro$3,000.0058
95
93
60
92
1.1M
93GPT-5.4 Pro lacks exact SWE-bench/GPQA scores but beats Gemini 3.1 Pro on SWE-Bench Pro and MMMU-Pro. Inferred as a flagship model: Coding ~95, Logic ~92 (GDPval 83%). Multimodal inferred ~85.
Anthropic: Claude Sonnet 4$270.0060
98
93
75
85
1.0M
92SWE-bench Verified 72.7% maps to 98 coding. GPQA Diamond 75.5% maps to 92 logic. MMMU 77.6% maps to 88 multimodal. Mid-tier Sonnet class balances high capability with moderate speed.
NVIDIA: Nemotron 3 Ultra$45.0065
95
93
70
95
1.0M
94SWE-bench at 49.0% maps to 95 coding. MMLU at 89.1% maps to 90 logic. IFBench at 99.2% maps to 95 instruction. AIME25 at 89.1% maps to 95 math. Flagship 550B MoE tier with high throughput yields 70 speed.
Qwen: Qwen3 235B A22B Thinking 2507$5.0075
88
93
60
95
262K
92GPQA 81.1% maps to Logic 95. MMLU-Pro 84.4% and HMMT25 83.9% map to Instruction 90 and Math 95. Coding inferred high (88) via LiveCodeBench. Speed 55 tok/s maps to 60. Flagship 235B reasoning model.
Qwen: Qwen3 Max Thinking$70.2063
90
93
45
96
262K
93Evidence lacks exact Qwen3 Max Thinking scores but notes it's the flagship reasoning model. Inferred from Qwen2.5 72B (HumanEval 86.6%, GSM8K 95.8%) and VL 32B comparisons; mapped Coding to 90, Logic to 92. Speed reflects heavy reasoning tier.
Qwen: Qwen3.5-Flash$5.2075
88
92
95
95
1.0M
92SWE-bench Verified at 69.2% maps to 88 coding. GPQA Diamond at 84.2% maps to 92 logic. IFEval 91.9% maps to 92 instruction. As a Flash tier, speed is 95, though its reasoning capabilities rival flagship models.
OpenAI: GPT-5.1-Codex$150.0061
88
92
60
96
400K
92PricePerToken cites GPQA 86.0%, Math 95.7%, and Coding 36.6% (likely SWE-bench). Mapped GPQA to Logic 94, Coding to 88, and Math to 96. Speed reflects 27.4 tok/s throughput.
Z.ai: GLM 5$43.2063
95
92
55
84
203K
91SWE-bench Verified 77.8 maps to 95 coding. GPQA Diamond 86.0 maps to 96 logic. IFEval 88.0 maps to 88 instruction. AIME 2025 84.0 maps to 84 math. As a 744B flagship MoE, speed is mapped to 55.
OpenAI: o3 Pro$1,600.0060
98
92
40
99
200K
95SWE-bench Verified at 69.1% maps to 98 coding. GPQA Diamond at 83.3% maps to 98 logic. AIME 2024 at 91.6% maps to 99 math. As a flagship reasoning model, speed is lower (40).
Qwen: Qwen3.5-27B$23.4065
88
92
85
90
262K
90SWE-bench Verified at 72.4% maps to 88 coding. GPQA Diamond at 85.5% maps to 88 logic. IFEval 95.0% maps to 95 instruction. As a 27B mid-tier model, speed is rated 85.
OpenAI: GPT-5 Chat$150.0062
95
92
65
92
128K
93SWE-bench Verified 74.9% maps to 95 coding. MMLU 92.5% maps to 93 logic. MMMU 84.2% maps to 88 multimodal. Flagship tier model with native 'Thinking' mode and high reasoning capabilities.
OpenAI: o3 Deep Research$800.0061
98
92
35
98
200K
95SWE-bench Verified 69.1% maps to 98 coding. GPQA Diamond 83.3% maps to 98 logic. MMMU 82.9% maps to 95 multimodal. Speed is 35 due to extended reasoning times.
DeepSeek: DeepSeek V4 Pro$26.1066
98
91
70
88
1.0M
92SWE-bench Verified at 81.0% maps to 98 coding. GPQA Diamond at 66.3% maps to 92 logic. Flagship MoE tier justifies high scores; speed is standard for large MoE.

Need a shareable artifact?

Get a print-ready PDF of your results and a CSV spreadsheet. Tap the button, then enter your work email. We use it to build your files and start the download—and to email you a copy if the site owner enabled that.

AI ROI Leaderboard & Discovery by LeadsCalc

Detailed analysis

PDF Breakdown

Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: Reasoning).

Instant setup
No CC required

By submitting, you agree to our Privacy Policy and Terms.

Agency accelerator

Whitelabel Reasoning Leaderboard
for your site

Embed the interactive reasoning view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.

1-Click CRM sync
Custom branding
Branded reports
Lead analytics

Free to start

$0/mo*
GET STARTED

NO CREDIT CARD REQUIRED

How it works

Methodology: How we rank Reasoning LLMs

Transparent, benchmark-driven rankings—same craft as our single-model deep dives.

What counts as “reasoning” in our ranking signals

Our reasoning rankings prioritize models that excel at complex logic, multi-step instruction following, and zero-shot problem solving. We heavily weight benchmarks like GPQA (Graduate-Level Google-Proof Q&A) and MATH, while also factoring in native extended thinking capabilities (such as OpenAI's o1-style Chain of Thought or DeepSeek's R1 reasoning tokens). We normalize these scores so engineering teams can directly compare the cognitive capabilities of frontier models against cost-optimized alternatives.

Battle Arena

Compare up to four LLMs side by side

Tick up to four models in the leaderboard table, then open Battle Arena for API pricing, benchmarks, and workload math in one view—perfect when you are shortlisting vendors for a pilot in the US, Canada, or Australia.

Prefer a head start? Jump into high-intent comparisons people search for every day—same interactive calculator, zero signup.

Open Battle ArenaUp to 4 models · Live estimates
Signals & spend

Value analysis

Benchmarks vs. estimated API cost—read the story your CFO cares about.

When to pay for a dedicated reasoning SKU

If your workload is mostly short chat, a flagship reasoning model may be overspend. If you run multi-step analysis, compliance checks, or agentic flows, the reasoning column helps justify the premium. Teams in Canada and Australia frequently validate latency and residency after shortlisting here; US teams often pair a reasoning core with a low-latency edge model.

Production deployment

Complex Logic & Agentic Workflows

How teams in the US, Canada, and Australia deploy these models in production.

Legal analysis, medical triage, and o1-style Chain of Thought

Models with native extended thinking (like OpenAI's o-series or DeepSeek's R1-class) excel at zero-shot problem solving that previously required complex agent orchestration. Enterprises in the US and Australia deploy these high-reasoning models for automated contract analysis, regulatory compliance auditing, and multi-step data extraction where logical accuracy is strictly prioritized over latency.

Architecture

Managing Reasoning Token Economics

Strategies to reduce monthly API spend without sacrificing capability.

Controlling max_thinking_tokens and fallback strategies

Because reasoning models bill internal 'thinking' tokens at the output rate, costs can spiral unpredictably. Optimize by setting strict `max_completion_tokens` limits, using prompt engineering to constrain unnecessary verbosity, and building fallback logic that routes simpler sub-tasks to cheaper, non-reasoning models. Use our deep-thinking toggle to forecast these exact token economics.

Embed-ready

Need this live Reasoning data on your website?

Join 500+ agencies in the US and Australia using LeadsCalc to capture high-intent leads. Embed this interactive Reasoning leaderboard on your site in about a minute—Canadian teams use the same flows for CAD-priced proposals and compliance-friendly landing pages.

Customize & Embed this ToolWhite-label · No code required
United StatesCanadaAustralia
Live preview

Your visitors compare Reasoning models without leaving your domain.

Support & clarity

Frequently Asked Questions

Focused on teams in the United States, Canada, and Australia.

For autonomous agents or complex data extraction, models with native 'thinking' capabilities (like OpenAI o1 or DeepSeek R1) typically perform best. However, they bill 'thinking' tokens as output tokens, which can get expensive. Teams in the US, Canada, and Australia often use this leaderboard to find the optimal balance—selecting a dedicated reasoning model for complex tasks and routing simpler queries to a cheaper model.