Best AI Models for Math 2026: Quantitative LLM Workloads
Compare top math LLMs in 2026 with GSM8K and MATH benchmarks alongside live API pricing. For finance, STEM, and analytics teams in the US, Canada, and Australia.
Math-focused LLM rankings with API spend context in 2026
Math-heavy workloads punish silent errors. This tab emphasizes quantitative benchmarks while surfacing estimated API cost so data and engineering groups in the United States, Canada, and Australia can pair accuracy targets with budget reality—before they commit to a model for spreadsheets, tutoring copilots, or internal calculators.
Workload & pricing toggles
Same three scenarios as the main AI API calculator: moderate traffic, large RAG-style context, or per-request max tokens with a lower request count.
Include Vision / Image Processing
Off — no image fees in cost estimates for vision-capable models.
Turn On to include image fees.
Use Cached Pricing
Enable to get 50% off input tokens where cached rates apply
Deep Reasoning / Thinking Mode
Model hidden reasoning / extended thinking charged like output tokens when enabled.
Batch Pricing
Enable for 50% off input & output where batch/async pricing applies
Cached / batch est. monthly values only change after the pipeline sets supports_caching or supports_batch in Supabase. The toggles here narrow the table to models whose catalog or provider typically supports those modes.
Magic quadrant (top 15)
X: est. monthly · Y: Math · Dot: provider color · Hover for rank, model & detailsFull leaderboard
Showing 48 of 365 models.
| Pick | Model | Est. monthly | ROI score | Coding | Reasoning | Speed | Math | Context | Overall |
|---|---|---|---|---|---|---|---|---|---|
| OpenAI: o3 Pro | $1,600.00 | 60 | 98 | 92 | 40 | 99 | 200K | 95SWE-bench Verified at 69.1% maps to 98 coding. GPQA Diamond at 83.3% maps to 98 logic. AIME 2024 at 91.6% maps to 99 math. As a flagship reasoning model, speed is lower (40). | |
| xAI: Grok 3 Beta | $270.00 | 61 | 92 | 93 | 50 | 98 | 131K | 94GPQA Diamond 84.6% and AIME 93.3% (Think mode) map to near-max logic/math. LiveCodeBench 79.4% maps to high coding. MMMU 78% confirms strong vision. Flagship tier; speed inferred moderate due to reasoning. | |
| Z.ai: GLM 4.5 | $46.00 | 64 | 90 | 90 | 60 | 98 | 131K | 92SWE-bench Verified 64.2% maps to 90 coding. GPQA 79.1% maps to 92 logic. MATH-500 98.2% maps to 98 math. Flagship MoE model with native thinking mode; speed mapped to 60 based on 41.5 tokens/s. | |
| OpenAI: o1-pro | $12,000.00 | 57 | 85 | 93 | 40 | 98 | 200K | 92GPQA 78.0% maps to 95 logic. SWE-bench Verified 48.9% maps to 85 coding. GSM8K 97.1% maps to 98 math. MMMU 77.6% maps to 85 multimodal. Speed is 40 due to heavy reasoning architecture. | |
| DeepSeek R1 | $53.00 | 63 | 88 | 89 | 40 | 98 | 164K | 91MMLU at 90.8% maps Logic to 92. MATH at 97.3% maps Math to 98. Coding inferred at 88 for o1-class flagship. Speed set to 40 due to 2-4x slower reasoning token generation. No vision supported. | |
| Arcee AI: Trinity Large Thinking | $17.30 | 69 | 95 | 94 | 45 | 98 | 262K | 95SWE-bench Verified at 63.2% maps to 95 coding. GPQA-Diamond at 76.3% maps to 95 logic. AIME 2025 at 96.3% maps to 98 math. As a heavy reasoning model (398B MoE), speed is mapped to 45. | |
| DeepSeek: R1 0528 | $41.50 | 66 | 92 | 94 | 45 | 98 | 164K | 95GPQA 81.0% maps to 98 logic. AIME 2024 91.4% maps to 98 math. LiveCodeBench 73.3% maps to 92 coding. Speed reflects 287 c/s but heavy reasoning overhead (23K thinking tokens). | |
| MoonshotAI: Kimi K2.5 | $35.00 | 66 | 95 | 93 | 65 | 98 | 262K | 95SWE-bench Verified 76.8% maps to 95 coding. GPQA Diamond 87.9% maps to 96 logic. AIME 2025 96.1% maps to 98 math. 1T MoE flagship tier; native reasoning supported. | |
| OpenAI: o3 Mini High | $88.00 | 60 | 85 | 87 | 65 | 98 | 200K | 89GPQA 79.7% (Logic 88), MATH 97.9% (Math 98), HumanEval 97% (Coding 85). As a Mini tier model with high reasoning effort, it achieves flagship-level STEM scores but operates slower than standard minis due to extended chain-of-thought. | |
| OpenAI: o3 Mini | $88.00 | 64 | 95 | 93 | 65 | 98 | 200K | 95SWE-bench Verified (69.1%) maps to 95 coding. GPQA (83.3%) maps to 92 logic. Despite being a 'Mini' tier model, its native reasoning capabilities yield flagship-level STEM scores, though speed is balanced for chain-of-thought generation. | |
| OpenAI: GPT-5.2 | $210.00 | 63 | 96 | 96 | 50 | 98 | 400K | 97SWE-bench Verified 68.1% maps to 96 coding. GPQA 81.4% maps to 97 logic. MMMU 81.6% maps to 95 multimodal. Frontier reasoning model; speed estimated at 50. | |
| OpenAI: o1 | $1,200.00 | 59 | 88 | 93 | 40 | 98 | 200K | 93SWE-bench Verified 48.9% maps to 88 coding. GPQA 78% maps to 96 logic. MMMU 77.6% maps to 85 multimodal. Speed is 40 due to extended reasoning times. Flagship tier. | |
| OpenAI: o3 Deep Research | $800.00 | 61 | 98 | 92 | 35 | 98 | 200K | 95SWE-bench Verified 69.1% maps to 98 coding. GPQA Diamond 83.3% maps to 98 logic. MMMU 82.9% maps to 95 multimodal. Speed is 35 due to extended reasoning times. | |
| OpenAI: o4 Mini High | $88.00 | 60 | 85 | 85 | 65 | 98 | 200K | 88Evidence cites AIME at 99.5% (Math: 98). SWE-bench outpaces o3-mini (Coding: 85). As a Mini-tier reasoning model, Logic (85) and Speed (65) reflect its high reasoning effort and fast architecture. Multimodal (80) supported by MMMU/MathVista claims. | |
| o3 | $160.00 | 62 | 96 | 91 | 40 | 98 | 200K | 94SWE-bench Verified at 69.1% maps to 96 coding. GPQA Diamond at 87.7% maps to 97 logic. AIME 2024 at 96.7% maps to 98 math. Speed is 40 due to extended reasoning times. | |
| Qwen: Qwen3 Next 80B A3B Thinking | $11.70 | 70 | 90 | 93 | 45 | 97 | 262K | 93GPQA at 77.2% maps to 96 logic. IFEval at 88.9% maps to 90 instruction. AIME 2025 at 87.8% maps to 97 math. As an 80B thinking model, speed is lower (45). No SWE-bench cited; coding estimated at 90. | |
| Qwen: Qwen3.6 Max Preview | $104.00 | 64 | 96 | 95 | 55 | 97 | 262K | 96Based on Qwen3.6 Plus scoring SWE-bench 78.8% and GPQA 90.4%, Max (1T MoE flagship) maps to 96 for coding and logic. Native reasoning is supported via <think> tags. Vision defaults to frontier pricing. | |
| Z.ai: GLM 5.1 | $70.00 | 64 | 98 | 93 | 65 | 97 | 203K | 95SWE-bench Pro top score (GLM-5 had 77.8, +3.3 gain) maps to 98 coding. GPQA Diamond 86.2% maps to 95 logic. AIME 95.3% maps to 97 math. Heavyweight MoE tier yields 65 speed. | |
| Anthropic Claude Sonnet Latest | $270.00 | 62 | 95 | 94 | 80 | 96 | 1.0M | 95Claude 3.7 Sonnet scores SWE-bench Verified 72.7% (Coding: 95), GPQA Diamond 84.8% (Logic: 95), MATH 96.2% (Math: 96), IFEval 93.2% (Instruction: 93), MMMU 75% (Multimodal: 75). Mid-tier speed (~80). Features native extended thinking. | |
| Perplexity: Sonar Reasoning Pro | $160.00 | 59 | 85 | 87 | 45 | 96 | 128K | 89Evidence cites GPQA Diamond at 62.3% (Logic 88) and MATH-500 at 92.1% (Math 96). MMLU is 83.0%. As an R1-based reasoning model, speed is lower (45), while coding is inferred high (85) from its flagship tier. | |
| Z.ai: GLM 4.7 | $33.50 | 65 | 90 | 90 | 65 | 96 | 203K | 92SWE-bench Verified 73.8 maps to 90 coding. GPQA Diamond 85.7 maps to 92 logic. IFEval 88.0 maps to 88 instruction. AIME 2025 95.7 maps to 96 math. Flagship tier speed estimated at 65. | |
| Google: Gemini 2.5 Pro Preview 06-05 | $150.00 | 63 | 96 | 93 | 45 | 96 | 1.0M | 95SWE-bench (59.6%) maps to 96 coding. GPQA (86.4%) maps to 96 logic. AIME (88.0%) maps to 96 math. MMMU (82.0%) maps to 90 multimodal. Flagship tier model with native reasoning; speed adjusted for thinking overhead. | |
| DeepSeek: DeepSeek V3.2 Speciale | $15.79 | 70 | 95 | 97 | 45 | 96 | 164K | 96GPQA 87.1% maps to 98 logic. LiveCodeBench 89.6% and Aider 88.0% map to 95 coding. HMMT 99.0% maps to 96 math. Flagship reasoning model with native thinking tokens; speed scored 45 due to extended reasoning overhead. | |
| OpenAI: GPT-5.1-Codex | $150.00 | 61 | 88 | 92 | 60 | 96 | 400K | 92PricePerToken cites GPQA 86.0%, Math 95.7%, and Coding 36.6% (likely SWE-bench). Mapped GPQA to Logic 94, Coding to 88, and Math to 96. Speed reflects 27.4 tok/s throughput. | |
| Qwen: Qwen3 Max Thinking | $70.20 | 63 | 90 | 93 | 45 | 96 | 262K | 93Evidence lacks exact Qwen3 Max Thinking scores but notes it's the flagship reasoning model. Inferred from Qwen2.5 72B (HumanEval 86.6%, GSM8K 95.8%) and VL 32B comparisons; mapped Coding to 90, Logic to 92. Speed reflects heavy reasoning tier. | |
| Qwen: Qwen3.5-35B-A3B | $15.60 | 67 | 85 | 90 | 90 | 95 | 262K | 90SWE-bench (69.2%) maps to 85 coding; GPQA Diamond (84.2%) maps to 88 logic. As a 35B MoE activating 3B parameters (lightweight tier), it achieves high efficiency and speed (90) while maintaining strong native reasoning capabilities. | |
| OpenAI: GPT-5 Pro | $1,800.00 | 59 | 92 | 93 | 77 | 95 | 400K | 93Evidence lacks exact GPT-5 Pro benchmark scores, so mapped from flagship reasoning tier (Pro/o1-class). Speed mapped from cited 77.4 tps. High coding/logic reflect its deep reasoning mode and 'most advanced model' status. | |
| Qwen: Qwen3.5-122B-A10B | $31.20 | 67 | 95 | 94 | 45 | 95 | 262K | 95SWE-bench Verified at 72.0% maps to 95 coding. GPQA Diamond at 86.6% maps to 96 logic. IFEval 92% maps to 92 instruction. As a native reasoning model, speed is mapped to 45. | |
| OpenAI: o4 Mini Deep Research | $160.00 | 59 | 85 | 87 | 85 | 95 | 200K | 88SWE-bench Verified (68.1%) and GPQA (81.4%) map to 85 Coding and 88 Logic. As a 'Mini' reasoning model, it achieves flagship-level scores via internal chain-of-thought while maintaining high speed (84 tok/s). | |
| Qwen: Qwen3 30B A3B Thinking 2507 | $7.20 | 70 | 80 | 89 | 80 | 95 | 131K | 88GPQA Diamond 71.50 maps to logic 88. IFEval 90.09 maps to instruction 90. AIME 86.67 maps to math 95. Coding inferred at 80 (no SWE-bench). Speed 80 based on 75.5 tok/s throughput. | |
| Xiaomi: MiMo-V2-Flash | $7.00 | 72 | 92 | 88 | 90 | 95 | 262K | 91SWE-bench Verified 73.4% maps to 92 coding. GPQA 83.7% maps to 90 logic. AIME 94.1% maps to 95 math. Despite the 'Flash' name, its 309B MoE architecture and native reasoning deliver flagship-level scores. | |
| xAI: Grok 4 | $270.00 | 60 | 88 | 90 | 45 | 95 | 256K | 91Evidence lacks exact benchmarks. Inferred as flagship reasoning model based on 'Grok 4... reasoning model'. Estimated frontier-level logic (92) and math (95), with lower speed (45) typical of reasoning models. | |
| Anthropic: Claude Opus 4.7 | $450.00 | 62 | 98 | 97 | 60 | 95 | 1.0M | 97SWE-bench Verified at 87.6% maps to 98 coding. GPQA Diamond at 94.2% maps to 98 logic. Heavyweight Opus tier with 53 tok/s yields 60 speed. Native reasoning supported via OpenRouter reasoning parameter. | |
| Google: Gemini 3.1 Pro Preview Custom Tools | $200.00 | 63 | 98 | 97 | 65 | 95 | 1.0M | 97SWE-bench Verified at 80.6% maps to 98 coding. GPQA Diamond at 94.3% maps to 98 logic. As a flagship Pro model, it receives high multimodal (95) and math (95) scores, with standard heavyweight speed (65). | |
| OpenAI: GPT-5.4 Image 2 | $470.00 | 61 | 95 | 96 | 60 | 95 | 272K | 95GPT-5.4 scores 81.2% on MMMU-Pro (Multimodal ~90). As a frontier reasoning model with native computer-use (OSWorld 75%), Logic and Coding map to ~95. Speed is ~60 due to chain-of-thought overhead. | |
| Qwen: Qwen3.5-Flash | $5.20 | 75 | 88 | 92 | 95 | 95 | 1.0M | 92SWE-bench Verified at 69.2% maps to 88 coding. GPQA Diamond at 84.2% maps to 92 logic. IFEval 91.9% maps to 92 instruction. As a Flash tier, speed is 95, though its reasoning capabilities rival flagship models. | |
| Anthropic: Claude Opus 4.5 | $450.00 | 62 | 98 | 97 | 50 | 95 | 200K | 97SWE-bench Verified at 80.9% maps to 98 coding. GPQA Diamond at 87.0% maps to 98 logic. MMMU at 80.7% yields 95 multimodal. As a heavyweight reasoning model with an effort parameter, speed is rated lower at 50. | |
| OpenAI: o4 Mini | $88.00 | 60 | 88 | 84 | 85 | 95 | 200K | 88SWE-bench Verified at 68.1% maps to 88 coding. GPQA at 81.4% maps to 88 logic. Despite being a Mini tier, its reasoning tokens drive flagship-level math (AIME 99.5%). Speed is high (140 tok/s) mapping to 85. | |
| NVIDIA: Llama 3.3 Nemotron Super 49B V1.5 | $20.00 | 62 | 75 | 80 | 70 | 95 | 131K | 83GPQA 71.97% maps to Logic 80. MATH500 97.4% maps to Math 95. LiveCodeBench 73.58% maps to Coding 75. Speed 50.6 tok/s maps to 70. Mid-tier 49B model with explicit reasoning modes and vision support. | |
| StepFun: Step 3.5 Flash | $6.60 | 70 | 88 | 83 | 95 | 95 | 262K | 87SWE-bench Verified 74.4% maps to 88 coding; AIME 99.8% maps to 95 math. Despite Flash tier, explicit evidence shows frontier-level SWE-bench Verified, elevating coding score. Speed is 95 (143 tok/s). | |
| Google: Gemini 2.5 Pro Preview 05-06 | $150.00 | 62 | 95 | 93 | 45 | 95 | 1.0M | 94Evidence notes it outperforms Claude 3.5 Sonnet on SWE-bench Verified, GPQA, and MMMU, though exact percentages are omitted. Mapped to frontier-level 95s for coding, logic, and multimodal. Speed is reduced due to mandatory native thought reasoning. | |
| NVIDIA: Nemotron 3 Super | $8.10 | 71 | 92 | 89 | 75 | 95 | 1.0M | 91SWE-bench at 60.47% maps to 92 coding. GPQA at 79.23% maps to 92 logic. AIME25 at 90.21% yields 95 math. High scores reflect its 120B frontier reasoning capabilities. | |
| OpenAI: GPT-5 | $150.00 | 63 | 98 | 93 | 65 | 95 | 400K | 95SWE-bench Verified 74.9% maps to 98 coding. MMLU 92.5% maps to 95 logic. MMMU 84.2% maps to 95 multimodal. Flagship tier speed estimated at 65. | |
| Anthropic: Claude Opus 4.6 (Fast) | $2,700.00 | 60 | 98 | 96 | 70 | 95 | 1.0M | 96SWE-bench Verified 82.1% maps to 98 coding. GPQA Diamond 88.5% maps to 96 logic. MATH 94.2% maps to 95 math. MMMLU 91.1% maps to 91 multimodal. Heavyweight tier with 60.5 tok/s yields 70 speed. | |
| DeepSeek V3.2 | $12.58 | 68 | 92 | 89 | 45 | 95 | 131K | 91SWE-bench at 67.8% maps to 92 for coding. MMLU-Pro 85.0% maps to 90 for logic. AIME 2025 at 89.3% yields 95 for math. As a frontier reasoning model, speed is 45. | |
| Gemma 4 31B | $8.40 | 69 | 78 | 89 | 45 | 95 | 262K | 88HumanEval 76.8% maps to coding 78. MMLU 87.1% maps to logic 85. IFEval 93.7% maps to instruction 92. GSM8k 97.6% maps to math 95. Speed 8.52 t/s maps to 45. Native reasoning confirmed via 'reasoning_details'. | |
| MiniMax: MiniMax M1 | $38.00 | 62 | 88 | 85 | 65 | 95 | 1.0M | 88SWE-bench Verified (56.0%) maps to 88 coding. GPQA (70.0%) maps to 85 logic. MATH-500 (96.0%) maps to 95 math. Heavyweight 456B MoE model with native reasoning tokens. | |
| Claude Opus 4.6 | $450.00 | 62 | 98 | 96 | 50 | 95 | 1.0M | 96SWE-bench Verified at 82.1% maps to 98 coding. GPQA Diamond at 91.3% maps to 96 logic. MATH at 94.2% maps to 95 math. As a flagship reasoning model, speed is set to 50. |
Need a shareable artifact?
Get a print-ready PDF of your results and a CSV spreadsheet. Tap the button, then enter your work email. We use it to build your files and start the download—and to email you a copy if the site owner enabled that.
AI ROI Leaderboard & Discovery by LeadsCalc
PDF Breakdown
Receive a comprehensive native vector PDF of this leaderboard: your workload, filters, top rankings, and a table snapshot (sorted: Math).
By submitting, you agree to our Privacy Policy and Terms.
Whitelabel Math Leaderboard
for your site
Embed the interactive math view on your own domain — whitelabel branding, lead capture, and the same workload sliders your prospects already use on LeadsCalc.