Google's Gemini 2.5 Flash slashes AI costs by up to 600% with smart 'thinking budgets'

Google's Gemini 2.5 Flash introduces "thinking budgets" that let users control how much reasoning power is used
The model costs $0.60 per million output tokens with thinking turned off vs. $3.50 with full reasoning (600% difference)
Thinking budgets can be set from 0 to 24,576 tokens, acting as a maximum limit
2.5 Flash outperforms Claude 3.7 Sonnet on key benchmarks while being more cost-effective
Available now in preview through Google AI Studio and Vertex AI

Introduction to Gemini 2.5 Flash

Google just launched Gemini 2.5 Flash, and it's pretty cool stuff for anyone who uses AI. This new model lets you decide how much "thinking" your AI does, which can save you lots of money. It's kinda like having a brain with a dimmer switch - you can turn it up when you need deep thinking or turn it down when you don't.

This is big news for companies that worry about AI costs. Before now, if you wanted an AI that could solve hard problems, you had to pay top dollar all the time, even for simple tasks. That's not fair, right? Google agrees, so they made this new model that can switch between deep thinking and quick answers.

Understanding "Thinking Budgets" - The Game-Changing Feature

The star feature of Gemini 2.5 Flash is what Google calls "thinking budgets." I know, it sounds kinda weird - do AIs really think? Well, sorta. When the model tackles complex problems, it does extra computation steps before giving an answer. This is what Google means by "thinking."

"We know cost and latency matter for a number of developer use cases, and so we want to offer developers the flexibility to adapt the amount of the thinking the model does, depending on their needs," explained Tulsee Doshi, who works as Product Director for Gemini Models at Google DeepMind.

The thinking budget works like a limit you can set - from zero (no extra thinking) all the way up to 24,576 tokens of thinking. But here's the clever bit: the model decides how much of that budget to actually use based on how hard the question is. Ask something simple, it hardly uses any. Ask something super complex, it might use the whole budget to work out the answer.

How Gemini 2.5 Flash Cuts AI Costs by 600%

Let's talk money, cuz that's where things get really interesting. When you use Gemini 2.5 Flash, you pay $0.15 per million tokens for input (the stuff you feed into the AI). But for output tokens (the stuff the AI gives back), the price changes based on whether thinking is turned on or off:

With thinking turned OFF: $0.60 per million output tokens
With thinking turned ON: $3.50 per million output tokens

That's almost a 600% price difference! Imagine cutting your AI bill by that much just by turning down the thinking for simple tasks.

"Customers pay for any thinking and output tokens the model generates," Doshi told VentureBeat. "In the AI Studio UX, you can see these thoughts before a response. In the API, we currently don't provide access to the thoughts, but a developer can see how many tokens were generated."

This pricing model makes tons of sense for businesses that need AI for different types of tasks. Why pay for a brain surgeon when you just need a general checkup?

Performance Comparison: How 2.5 Flash Stacks Up Against Competitors

So you might be wondering - does cheaper mean worse? Not really! Google claims that 2.5 Flash performs really well on key benchmarks while staying smaller than other models.

On a super hard test called "Humanity's Last Exam," 2.5 Flash scored 12.1%. That beats Nvidia's recent stock market challenges and puts it ahead of Anthropic's Claude 3.7 Sonnet (8.9%) and DeepSeek R1 (8.6%). It did fall a bit short of OpenAI's new o4-mini model (14.3%), but still impressive considering the cost savings.

The model also did great on technical tests, scoring 78.3% on GPQA diamond and between 78-88% on AIME math exams from 2024-2025. These results suggest that Google is catching up to competitors while keeping prices lower - something businesses watching their budgets will love.

"Companies should choose 2.5 Flash because it provides the best value for its cost and speed," Doshi explained. "It's particularly strong relative to competitors on math, multimodal reasoning, long context, and several other key metrics."

With tech companies feeling pressure from Trump's tariff policies, cost-effective AI solutions could become even more important in the coming months.

Smart vs. Speedy: When to Use Different Thinking Budget Levels

One of the coolest things about 2.5 Flash is how you can match the thinking level to the task. It's like picking the right tool for the job instead of always using a sledgehammer.

For simple stuff like:

Translating languages
Answering basic facts
Summarizing straightforward content
Writing simple emails

You can turn thinking way down or off completely. The model will respond super fast and cost much less.

But for tough tasks like:

Solving complex math problems
Analyzing detailed financial data
Creating nuanced content strategies
Debugging complicated code

You can crank up the thinking budget and let the model really work through the problem step by step.

The really smart part? The model figures out how much thinking it actually needs. Ask "How many provinces does Canada have?" and it barely uses any thinking tokens. Ask it to solve a complex engineering problem about beam stress calculations, and it'll use more of its thinking budget to work through the solution properly.

This kind of flexibility is exactly what businesses dealing with economic uncertainty need - the ability to optimize costs while still getting high-quality results when they matter most.

Real-World Applications for Businesses and Developers

So what can you actually do with this new model? Tons of stuff! The adjustable thinking feature opens up new ways to use AI in business without breaking the bank.

Customer service teams could use low-thinking mode for common questions and high-thinking mode for complex troubleshooting. This would keep costs down while still providing great service for difficult problems.

Content creators could use minimal thinking for basic editing and formatting tasks, but switch to deep thinking for creating strategic content that requires careful reasoning and research.

Software developers might use different thinking levels for different stages of development - quick responses during rapid prototyping but deep thinking during security reviews or optimization phases.

Financial analysts could save money by using basic mode for standard reports but switch to full thinking power when analyzing unusual market trends, much like what we saw during recent Fed announcements.

The key benefit is that businesses don't have to choose between a cheap, simple AI and an expensive, smart one. With 2.5 Flash, they can have both in the same model and pay only for the brainpower they actually need.

Future Implications of Adjustable AI Reasoning

The introduction of adjustable reasoning marks a big shift in how we think about AI. In the past, AI models were fixed - you got what you paid for, and that was it. Now we're moving toward AI that adapts its capabilities based on need.

This could change how businesses budget for AI. Instead of having to predict exactly how much "smart AI" vs "basic AI" they need, companies can adjust on the fly. This is especially important as organizations recover from tech disruptions like the recent meeting platform outage that affected productivity across industries.

Looking ahead, we might see even more granular control over AI systems. Maybe future models will let you adjust not just thinking budget but also creativity, cautiousness, or other aspects of performance.

Google's approach suggests the AI market is growing up. It's not just about having the smartest model anymore - it's about having the most flexible, cost-effective model that businesses can actually afford to use day to day.

As tech leaders like Mark Zuckerberg face scrutiny over their company practices, Google's focus on giving users more control and transparency about AI reasoning could help build trust in these powerful technologies.

Frequently Asked Questions

What exactly is a "thinking budget" in Gemini 2.5 Flash?

A thinking budget is the maximum amount of computational resources (measured in tokens) that you allow the AI to use for reasoning through complex problems before generating a response. You can set it from 0 to 24,576 tokens.

How much can I really save by adjusting the thinking budget?

You can save up to 600% on output costs by turning thinking off ($0.60 vs $3.50 per million tokens). For many applications with simple queries, this could mean substantial savings.

Is Gemini 2.5 Flash available for everyone to use?

Currently, it's available in preview through Google AI Studio and Vertex AI. Consumers can also access it through the Gemini app as "2.5 Flash (Experimental)" in the model dropdown menu.

Does turning down the thinking budget make the AI dumber?

Not exactly. It makes the AI do less reasoning before responding, which is fine for simple tasks but might affect quality for complex problems. The model tries to use an appropriate amount of thinking based on question complexity.

How does Gemini 2.5 Flash compare to OpenAI's latest models?

It scores lower than OpenAI's o4-mini on some benchmarks (12.1% vs 14.3% on Humanity's Last Exam) but offers more flexibility in controlling costs through thinking budgets.

Can I see what the AI is "thinking" when it uses the thinking budget?

In Google AI Studio, you can see the thoughts before a response. In the API, you can't currently access the thoughts directly, but you can see how many thinking tokens were generated.

What kinds of tasks benefit most from higher thinking budgets?

Complex tasks like mathematical problem-solving, detailed analysis, multi-step reasoning, and nuanced content creation benefit most from higher thinking budgets.

Will this technology impact jobs in AI development?

Like many advances in AI, it could change how developers approach model deployment, with more focus on optimization and less on building separate models for different complexity levels. This trend is important to monitor as world leaders discuss economic policies that affect the tech industry.