Reasoning Models: When the Extra Cost Actually Pays Off

What changed in 2025#

In late 2024, OpenAI released o1. Throughout 2025 we got o3, DeepSeek R1, Claude's Extended Thinking, Gemini Deep Think, and more. The marketing framing was that these models "think before they answer" and produce dramatically better results on hard problems.

The framing is partly true and the pricing has shifted substantially. OpenAI cut o3's price by 80% in June 2025. Claude's Extended Thinking does not have a separate price tier — thinking tokens are billed as output tokens, with an effort knob. GPT-5 with thinking uses 50-80% fewer output tokens than o3 per the OpenAI announcement. The economics of "when to use reasoning" is no longer what it was in late 2024.

This post is about how to think about reasoning-mode usage in 2026, with verified pricing and benchmark data.

Actual current pricing#

Here is what the frontier reasoning models actually cost, as of April 2026.

OpenAI o3: $2.00 per million input tokens, $8.00 per million output tokens. This is after the 80% price cut OpenAI announced in June 2025. A Flex Mode variant costs $5/$20. Context window 200K.

OpenAI o4-mini: $1.10/$4.40 per million tokens. 200K context. The default "reasoning at a reasonable price" option.

Claude Sonnet 4.6: Base pricing $3/$15 per million tokens. Extended Thinking does not add a separate fee — thinking tokens are billed at the same output rate, and Anthropic offers an effort parameter to control how much the model thinks. Caching reduces cache reads by about 90%, and the Batch API halves the price (Anthropic pricing docs).

Claude Opus 4.7: More expensive base pricing (roughly 5x Sonnet in the typical configuration), with the same Extended Thinking model. Only worth it if you have a task that specifically benefits from Opus-tier capability.

GPT-5 with thinking: The OpenAI launch post claims 50-80% fewer output tokens than o3 on comparable tasks, with reported ~6x fewer hallucinations. GPT-5 standard pricing (per OpenAI's API pricing page) is lower than o3, meaning it is often cheaper per correct answer than o3 despite being a newer, generally-stronger model.

Gemini Deep Think: Available through Google AI Ultra subscription at $124.99/month for consumers (US, English only). Not a standalone API product at retail scale. Gemini 3.1 Pro Deep Think was announced as "coming soon".

DeepSeek R1: $0.55 per million input, $2.19 per million output on the official DeepSeek API. 64K context. Released January 20, 2025 and still available in 2026 through DeepSeek's own API and via OpenRouter. By far the cheapest frontier reasoning model, at a meaningful performance gap to GPT-5 and Claude on hard benchmarks.

Reading across these: the price-per-reasoning-token varies by roughly 15x between DeepSeek R1 and Claude Opus. That is a huge range, and it maps to real capability differences.

Where reasoning models earn their money#

There is a reasonably clean set of categories where the math works.

Formal problems with verifiable answers. Math, competitive programming, logic puzzles, physics. On AIME 2025, GPT-5 scored 94.6% vs o3's 88.9%. On SWE-bench Verified, GPT-5 scored 74.9% vs o3's 52.8% (OpenAI GPT-5 launch post; Glean coverage). If your task has a checkable answer, a reasoning model is worth the extra cost.

Long-range code refactors with tight constraints. Multi-file refactors that have to satisfy several invariants simultaneously. The reasoning traces help the model check its work in ways that matter for correctness. Claude Extended Thinking handles this particularly well in my experience, at a meaningfully lower price than o3.

Complex query optimisation and data modelling. Schema design with multiple trade-offs, query optimisation with constraints. Multi-constraint optimisation is where reasoning mode earns its keep.

Research and document synthesis. Deep-research modes in ChatGPT, Claude, Perplexity, and Gemini are all reasoning-heavy. The output is a structured analysis, and the extra thinking time correlates with quality.

Adversarial correctness. Legal review, security audit, financial calculations. Not because the base model cannot do these, but because reasoning mode will flag ambiguities and corner cases the base model skips.

Where reasoning models underperform or waste money#

Creative writing. Reasoning models tend to produce more cautious, more structured, less lively output on creative tasks. I do not use reasoning mode for any writing task where voice matters.

Customer-facing chat. Latency matters more than marginal quality for real-time conversation. A reasoning model at 10-70 second latency is a bad experience for a support bot. Stick with faster base models for synchronous chat.

Simple lookups and transformations. "Translate this", "extract dates", "summarise". The base model saturates these. Reasoning mode burns tokens to reach the same answer.

Open-ended brainstorming. Counterintuitively, reasoning models often converge on one or two answers and justify them at length, rather than exploring widely. Base models are more divergent.

Domains where the model is likely wrong about its own reasoning. This is the dangerous case. A reasoning model can produce a confident-looking wrong answer with an elaborate justification, and the justification reads as more reliable than a base-model guess would. I have seen this in medical questions and certain financial analyses. A confidently-reasoned wrong answer is harder to catch than an obviously-guessed wrong answer.

The GPT-5 wrinkle#

The OpenAI GPT-5 launch changed the reasoning-model math in a specific way that most people have not absorbed.

GPT-5 with thinking is not priced as "reasoning surcharge on top of the base model". It is just GPT-5, and the decision is whether to turn thinking on for a given call. When thinking is on, GPT-5 uses 50-80% fewer output tokens than o3 does on comparable tasks, meaning the cost per correct answer is often lower, not higher, than using o3. On AIME, on SWE-bench, on many reasoning-heavy benchmarks, GPT-5 dominates o3.

What this means practically: for a lot of tasks where you would have reached for o3 in 2025, GPT-5 with thinking is now the better choice on both quality and cost. o3 still has a bigger context window (200K vs 128K for GPT-5's standard output) for certain long-context tasks, but as a default, GPT-5-thinking has largely replaced o3 for me.

The same principle applies to Claude: Sonnet 4.6 with Extended Thinking is almost always more cost-effective than Opus 4.7 for tasks Sonnet can do well, because the thinking-token bill does not jump when you shift to Extended Thinking in Sonnet.

A practical decision rule#

For any given task, the quick rule I apply:

Use reasoning mode when:

The answer is verifiable (math, code with tests, legal claim check), and you care more about correctness than speed.
The task has more than two or three reasoning steps that the base model might miss.
The cost of a wrong answer is substantially higher than the extra $0.30-1.00 the reasoning call will cost.

Skip reasoning mode when:

The task is creative or voice-sensitive.
Latency matters (customer chat, synchronous tool use).
The task is simple enough that the base model already gets it right reliably.
You are brainstorming and want diverse outputs.

In practice, the shift for me over the last year has been: less reflexive reasoning-mode usage, more deliberate use of GPT-5-thinking or Claude Extended Thinking for specific task categories, and more willingness to drop to DeepSeek R1 when I want to use reasoning on a high-volume task where the price gap matters.

Monthly numbers#

My approximate cost over Q1 2026 for a solo-dev workload mixing all of these:

GPT-5 calls (with and without thinking): about $40/month on average.
Claude Sonnet 4.6 via API (mostly with Extended Thinking): about $70/month.
o3/o4-mini for specific reasoning-heavy tasks: about $15/month.
DeepSeek R1 for batch/volume tasks where price matters: about $8/month.

Total: around $133/month at my usage, versus what would have been double that in 2024 before the price cuts and the GPT-5 efficiency improvement. The productivity I am getting has not gone down — it has gone up because I am using better models, selectively.

Sources#

OpenAI GPT-5 launch: https://openai.com/index/introducing-gpt-5/
OpenAI API pricing: https://openai.com/api/pricing/
o3 80% price cut: https://community.openai.com/t/o3-is-80-cheaper-and-introducing-o3-pro/1284925
o4-mini pricing: https://pricepertoken.com/pricing-page/model/openai-o4-mini
Anthropic pricing docs: https://platform.claude.com/docs/en/about-claude/pricing
Claude Sonnet 4.6 pricing details: https://apidog.com/blog/claude-sonnet-4-6-pricing/
Gemini subscriptions: https://gemini.google/subscriptions/
Gemini API pricing: https://ai.google.dev/gemini-api/docs/pricing
DeepSeek R1 pricing: https://api-docs.deepseek.com/quick_start/pricing
DeepSeek R1 on OpenRouter: https://openrouter.ai/deepseek/deepseek-r1
GPT-5 vs o3 analysis (Glean): https://www.glean.com/blog/open-ai-gpt-5

Roland Hentschel

AI & Web Technology Expert

Web developer and AI enthusiast helping businesses navigate the rapidly evolving landscape of AI tools. Testing and comparing tools so you don't have to.

Reasoning Models: When the Extra Cost Actually Pays Off

What changed in 2025#

Actual current pricing#

Where reasoning models earn their money#

Where reasoning models underperform or waste money#

The GPT-5 wrinkle#

A practical decision rule#

Monthly numbers#

Further reading#

Sources#

Roland Hentschel

Tools Covered in This Post

Claude Guide 2026

ChatGPT Guide 2026

Cursor Pricing 2026: Plans, Credits and the AI Code Editor

More from the Blog

GPT Image 2 Masterclass: Prompts & Your Own Face

A $500 AI-First Stack That Replaces a $3,000 SaaS Bill

Local LLMs in 2026: Are Llama 4