Google’s Gemini 2.5 Pro could be the most important AI model so far this year

Google released its new Gemini 2.5 Pro Experimental AI model late last month, and it’s quickly stacked up top marks on a number of coding, math, and reasoning benchmark tests—making it a contender for the world’s best model right now.

Gemini 2.5 Pro is a “reasoning” model, meaning its answers derive from a mix of training data and real-time reasoning performed in response to the user prompt or question. Like other newer models, Gemini 2.5 Pro can consult the web, but it also contains a fairly recent snapshot of the world’s knowledge: Its training data cuts off at the end of January 2025.

Last year, in order to boost model performance, AI researchers began shifting toward teaching models to “reason” when they’re live and responding to user prompts. This approach requires models to process and retain increasingly more data to arrive at accurate answers. (Gemini 2.5 Pro, for example, can handle up to a million tokens.) However, models often struggle with information overload, making it difficult to extract meaningful insights from all that context.

Google appears to have made progress on this front. The YouTube channel AI Explained points out that Gemini 2.5 fared very well on a new benchmark test called Fiction.liveBench that’s designed to test a model’s ability to remember and comprehend context information. For instance, Fiction.liveBench might ask the model to read a novelette and answer questions that require a deep understanding of the story and characters. Some of the top models, including those from OpenAI and Anthropic, score well when the amount of stored data (the context window) is relatively small. But as the context window increases to 32K, then 60K, then 120K—about the size of a novelette—Gemini 2.5 Pro stands out for its superior comprehension.

That’s important because some of the most productive use cases to date for generative AI involve comprehending and summarizing large amounts of data. A service representative might depend on an AI tool to swim through voluminous manuals in order to help someone struggling with a technical problem out in the field, or a corporate compliance officer might need a long context window to sift through years of regulations and policies. 

Gemini also scored much higher than competing reasoning models on a new benchmark called MathArena, which tests models using hard questions from recent math Olympiads and contests. The test also requires that the model clearly show its reasoning as it steps toward an answer. Top models from OpenAI, Anthropic, and DeepSeek failed to break 5% of a perfect score, but Gemini 2.5 Pro model scored an impressive 24.4%.

The new Google model also scored high on another superhard benchmark called Humanity’s Last Exam, which is meant to show when AI models exceed the knowledge and reasoning of top experts in a given field. The Gemini 2.5 scored an 18.8%, a score topped only by OpenAI’s Deep Research model. The model also now sits atop the crowdsourced benchmarking leaderboard, LMArena.

Finally, Gemini 2.5 Pro is among the top models for computer coding. It scored a 70.4% on the LiveCodeBench benchmark, coming in just behind OpenAI’s o3-mini model, which scored 74.1%. Gemini 2.5 Pro scored 63.8% on SWE-bench (measures agentic coding), while Anthropic’s latest Claude 3.7 Sonnet scored 70.3%. Finally, Google’s model outscored Anthropic, OpenAI, and xAI models on the MMMU visual reading test by roughly 6 points. 

Google initially released its new model to paying subscribers but has now made it accessible by all users for free.


https://www.fastcompany.com/91311063/google-gemini-2-5-pro-testing?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

созданный 5mo | 3 апр. 2025 г., 22:10:02


Войдите, чтобы добавить комментарий

Другие сообщения в этой группе

Kalshi found a backdoor to sports gambling, and is throwing it open to everyone

Last month, the online prediction market Kalshi filed some very dry but potentially very lucrative paperwork with t

6 сент. 2025 г., 12:50:03 | Fast company - tech
A slimmer iPhone and new Apple Watches: What to expect from Apple’s September 9 launch event

Apple holds several events throughout the year, but none is as vital to the company’s bottom line as its annual one in September. That’s when Apple unveils its new iPhone lineup, drawing our atten

6 сент. 2025 г., 10:30:04 | Fast company - tech
From Kindle to Kobo and beyond, this free ebook depot will blow your mind

The first time I read The Count of Monte Cristo, I was astounded by how freakin’ cool it all was. Here’s a story about daring prison escapes, finding hidden treasure, and elaborately exec

6 сент. 2025 г., 10:30:04 | Fast company - tech
TikTok is obsessed with this guy who bought an abandoned golf course in Maine

Buying an abandoned golf course and restoring it from scratch sounds like a dream for many golf fans. For one man in Maine, that dream is now reality.

A user who posts under the handle @

5 сент. 2025 г., 22:50:05 | Fast company - tech
Andreessen Horowitz is not a venture capital fund

I was reading funding news last week, and I came to a big realization: Andreessen Horowitz is not a venture capital fund.

A lot of people are thinking it. So there, I said it.

5 сент. 2025 г., 20:30:11 | Fast company - tech
Fake Holocaust AI slop is flooding social media

A post circulating on Facebook shows a man named Henek, a violinist allegedly forced to play in the concentration camp’s orchestra at Auschwitz. “His role: to play music as fellow prisoners

5 сент. 2025 г., 20:30:09 | Fast company - tech
Think this AI-generated Italian teacup on your kid’s phone is nonsense? That’s the point

In the first half of 2025, she racked up over 55 million views on TikTok and 4 mil

5 сент. 2025 г., 20:30:08 | Fast company - tech