Interactive leaderboard

Best AI Models for Math 2026: Quantitative LLM Workloads

Compare top math LLMs in 2026 with GSM8K and MATH benchmarks alongside live API pricing. For finance, STEM, and analytics teams in the US, Canada, and Australia.

Math-focused LLM rankings with API spend context in 2026

Math-heavy workloads punish silent errors. This tab emphasizes quantitative benchmarks while surfacing estimated API cost so data and engineering groups in the United States, Canada, and Australia can pair accuracy targets with budget reality—before they commit to a model for spreadsheets, tutoring copilots, or internal calculators.

Workload & pricing toggles

Workload presets

Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.

Include Vision / Image Processing

Off — no image fees in cost estimates for vision-capable models.

Turn On to include image fees.

OffOn

Use Cached Pricing

Enable to get 50% off input tokens where cached rates apply

OffOn

Deep Reasoning / Thinking Mode

Model hidden reasoning / extended thinking charged like output tokens when enabled.

OffOn

Batch Pricing

Enable for 50% off input & output where batch/async pricing applies

OffOn
≈ $100.00/mo
8K
1K1.0M
≈ $100.00/mo
2K
100500K
≈ $200.00 total
5K
10100K

Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.

Magic quadrant (top 15)

X: est. monthly · Y: Math · Dot: provider color · Hover for rank, model & details

Full leaderboard

Showing 48 of 327 models.

PickModelEst. monthlyROI scoreCodingReasoningSpeedMathContextOverall
OpenAI: gpt-oss-20b$2.6084
96
97
85
98
131K
97
NVIDIA: Nemotron Nano 9B V2$3.2075
72
84
85
98
131K
85
DeepSeek: DeepSeek V4 Pro$104.4056
67
80
55
97
1.0M
81
NVIDIA: Llama 3.3 Nemotron Super 49B V1.5$8.0066
74
79
70
97
131K
82
Google: Gemma 4 31B$9.0072
97
92
70
97
262K
94
Claude Sonnet 4.6$270.0056
93
75
70
96
1.0M
85
Z.ai: GLM 4.7$32.6064
94
87
85
96
203K
91
Grok 3$270.0061
93
93
55
96
131K
94
Qwen: Qwen3.5-35B-A3B$19.5064
76
89
70
95
262K
87
Xiaomi: MiMo-V2.5-Pro$70.0057
78
78
70
95
1.0M
82
Meta: Llama 3.1 70B Instruct$20.0028
0
0
55
95
131K
24
Anthropic: Claude Opus 4$1,350.0055
85
81
55
95
200K
86
Claude Opus 4.6$450.0056
85
79
55
95
1.0M
85
Anthropic: Claude Opus 4.5$450.0057
85
84
55
95
200K
87
Tencent: Hunyuan A13B Instruct$11.3064
64
86
55
94
131K
83
Qwen: Qwen3.5 397B A17B$39.0062
85
89
60
92
262K
89
MoonshotAI: Kimi K2 Thinking$49.0056
65
79
55
92
262K
79
Qwen: Qwen3.5-27B$23.4064
80
91
70
92
262K
88
Qwen: Qwen3 32B$5.6073
85
89
60
92
41K
89
OpenAI: GPT-5.1-Codex-Mini$30.0061
84
82
85
92
400K
85
Mistral: Mistral Medium 3$36.0063
92
87
70
91
131K
89
NVIDIA: Nemotron 3 Nano 30B A3B$4.0071
74
79
85
91
262K
81
Z.ai: GLM 4.5V$42.0055
64
77
60
90
66K
77
Qwen: Qwen3 Max$70.2061
93
87
70
89
262K
89
EssentialAI: Rnj 1 Instruct$7.5066
75
81
85
89
33K
81
AllenAI: Olmo 3 32B Think$11.0067
90
84
50
88
66K
87
Elephant
Free
78
90
83
70
88
262K
86
Nous: Hermes 4 70B$9.2066
85
81
60
88
131K
84
AionLabs: Aion-1.0-Mini$42.0061
85
85
95
88
131K
86
Baidu: ERNIE 4.5 21B A3B$5.6072
85
89
60
87
120K
88
xAI: Grok 4$270.0055
87
80
70
87
256K
83
Qwen: Qwen3 VL 32B Instruct$8.3269
88
88
65
87
131K
88
Auto Router
VARIABLE
77
84
83
70
86
2.0M
84
Qwen: Qwen3 Coder Next$13.6062
93
73
65
85
262K
81
Z.ai: GLM 5.1$77.0061
92
89
70
85
203K
89
Z.ai: GLM 4.6V$21.0053
41
75
70
85
131K
69
Prime Intellect: INTELLECT-3$19.0060
77
79
65
85
131K
80
Qwen: Qwen3.5-122B-A10B$31.2062
81
90
55
85
262K
87
Meta: Llama 3.1 8B Instruct$1.3059
0
34
85
85
16K
38
NVIDIA: Nemotron 3 Super$8.1065
79
79
55
85
262K
80
Z.ai: GLM 5$44.8064
99
93
55
84
203K
92
Qwen: Qwen3.5-9B$5.5068
65
87
70
83
262K
80
Qwen: Qwen3 235B A22B Thinking 2507$20.9362
74
89
50
83
131K
84
DeepSeek: DeepSeek V3.2$13.8661
67
84
55
82
131K
79
Nous: Hermes 4 405B$70.0056
85
79
70
82
131K
81
Qwen: Qwen3 30B A3B Thinking 2507$7.2066
85
79
50
82
131K
81
Z.ai: GLM 4.5$46.0057
85
79
65
82
131K
81
Qwen: Qwen3 30B A3B$6.0067
85
79
60
82
41K
81

Need a shareable artifact?

Download a print-ready PDF from the leaderboard and workload above. No email step—lead capture is off.

Detailed analysis

PDF Breakdown

Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: Math).

Instant setup
No CC required

By submitting, you agree to our Privacy Policy and Terms.

Agency accelerator

Whitelabel Math Leaderboard
for your site

Embed the interactive math view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.

1-Click CRM sync
Custom branding
Branded reports
Lead analytics

Free to start

$0/mo*
GET STARTED

NO CREDIT CARD REQUIRED

How it works

Methodology: How we rank Math LLMs

Transparent, benchmark-driven rankings—same craft as our single-model deep dives.

Quantitative benchmarks that drive the math axis

Our math rankings isolate performance on rigorous quantitative benchmarks like GSM8K (grade school math) and MATH (competition-level mathematics). We normalize these scores to a 0–100 axis, helping FinTech, EdTech, and data engineering teams identify models that can reliably execute complex calculations, write accurate SQL queries, and parse financial tables without silent hallucinations.

Battle Arena

Compare up to four LLMs side by side

Tick up to four models in the leaderboard table, then open Battle Arena for API pricing, benchmarks, and workload math in one view—perfect when you are shortlisting vendors for a pilot in the US, Canada, or Australia.

Prefer a head start? Jump into high-intent comparisons people search for every day—same interactive calculator, zero signup.

Open Battle ArenaUp to 4 models · Live estimates
Signals & spend

Value analysis

Benchmarks vs. estimated API cost—read the story your CFO cares about.

Using math scores without over-trusting a single number

Benchmarks orient you; they do not replace domain validation. For financial or regulated use, run your own golden tests and compliance review. Teams in Australia and Canada often add privacy constraints; US teams may require audit trails regardless of leaderboard position.

Production deployment

Quantitative Analysis & EdTech

How teams in the US, Canada, and Australia deploy these models in production.

Financial modeling, STEM tutoring, and data analytics

High math scores correlate strongly with a model's ability to write accurate SQL, parse complex financial tables, and explain STEM concepts step-by-step. FinTech companies in the US and EdTech platforms in Australia rely on these quantitatively rigorous models to power spreadsheet copilots, automated accounting audits, and personalized math tutoring applications.

Architecture

Optimizing Quantitative Accuracy

Strategies to reduce monthly API spend without sacrificing capability.

Zero-shot vs few-shot cost trade-offs

While frontier reasoning models excel at zero-shot math, they are expensive. A highly optimized architecture often uses a cheaper model provided with extensive few-shot examples and a strict system prompt. Use this leaderboard to compare the cost of a premium model against a cheaper model that might require 2,000 extra input tokens of few-shot prompting to achieve the same accuracy.

Embed-ready

Need this live Math data on your website?

Join 500+ agencies in the US and Australia using LeadsCalc to capture high-intent leads. Embed this interactive Math leaderboard on your site in about a minute—Canadian teams use the same flows for CAD-priced proposals and compliance-friendly landing pages.

Customize & Embed this ToolWhite-label · No code required
United StatesCanadaAustralia
Live preview

Your visitors compare Math models without leaving your domain.

Support & clarity

Frequently Asked Questions

Focused on teams in the United States, Canada, and Australia.

Models that score highly on the MATH benchmark generally excel at structured data tasks like SQL generation and financial analysis. However, for enterprise deployments in the US, Canada, or Australia, you should pair these models with strict system prompts and few-shot examples to guarantee consistent output formatting.