Google has reclaimed its spot at the top of the AI leaderboards with the release of Gemini 3.1 Pro on Thursday, February 19, 2026. This “point release” marks a significant intelligence upgrade, integrating the high-precision reasoning engine from the specialized “Deep Think” mode into the broader Pro model.

Gemini 3.1 Pro is now the default model in the Gemini app and NotebookLM, specifically optimized for agentic workflows, complex coding, and fact-grounded reasoning.

Google Gemini 3.1 Pro Benchmarks: The New King of Agentic Reasoning

The release of Gemini 3.1 Pro comes just one week after Google teased its “Deep Think” capabilities. While the previous Gemini 3 was already a top-tier model, the 3.1 update specifically targets the “hallucination gap” in complex, multi-step tasks.

1. Shattering the “Reasoning Ceiling” (ARC-AGI-2)

The most discussed metric in the Google Gemini 3.1 Pro benchmarks is the ARC-AGI-2 score, which tests a model’s ability to solve entirely new logic puzzles it hasn’t seen in its training data.

Gemini 3.1 Pro Score: 77.1%
Context: This is more than double the score of Gemini 3 Pro (31.1%) and significantly higher than Claude Opus 4.6 (68.8%) and GPT-5.2 (52.9%).

2. Dominating “Humanity’s Last Exam” (HLE)

HLE is designed to be the ultimate test of frontier AI, consisting of questions so difficult they often stump human experts.

Performance: Gemini 3.1 Pro scored 44.4% (without tools), the highest ever recorded for a Pro-tier model.
The “Tool Use” Catch: Interestingly, when external search and code tools were enabled, Claude Opus 4.6 narrowly edged out Gemini (53.1% vs 51.4%), suggesting Anthropic still holds a slight lead in “agentic tool orchestration.”

3. The APEX-Agents Leaderboard Victory

Brendan Foody, CEO of Mercor, confirmed that Gemini 3.1 Pro has taken the #1 spot on the APEX-Agents leaderboard.

The Metric: APEX measures “long-horizon” professional tasks—think of a model acting as a junior analyst for several hours to complete a project.
The Result: Gemini 3.1 Pro scored 33.5%, surpassing both Opus 4.6 (29.8%) and GPT-5.2 (23.0%).

4. Technical Specs: 1 Million Tokens & Animated SVGs

Beyond raw scores, Gemini 3.1 Pro introduces several functional upgrades:

Context Window: Maintains a massive 1 million token window with improved “needle-in-a-haystack” recall.
Animated SVGs: The model can now generate crisp, interactive animated SVG files directly from text prompts, a major win for web developers.
Thinking Levels: Users can now toggle between “Standard” and “Deep” thinking modes to balance speed and reasoning depth.

Gemini 3.1 Pro vs. Rivals: At a Glance

Benchmark	Gemini 3.1 Pro	Claude Opus 4.6	OpenAI GPT-5.2
ARC-AGI-2 (Reasoning)	77.1%	68.8%	52.9%
GPQA Diamond (Science)	94.3%	91.3%	92.4%
SWE-Bench (Coding)	80.6%	80.8%	80.0%
LiveCodeBench (Elo)	2887	N/A	2393

Conclusion: The Agentic Era has Arrived

The Google Gemini 3.1 Pro benchmarks confirm that the 2026 AI race has shifted from “knowledge retrieval” to “active problem solving.” By outperforming its peers in nearly every reasoning-heavy category, Google has positioned Gemini 3.1 Pro as the premier engine for developers building autonomous agents and complex enterprise workflows.

Check out our [Home Page] for more AI model comparisons and the latest from Google DeepMind.

Editor’s Choice: Why we recommend Taskade for this workflow

To fully utilize the agentic power shown in the Gemini 3.1 Pro benchmarks, we recommend using Taskade. You can integrate Gemini 3.1 Pro into Taskade’s AI agents to handle complex data synthesis and long-horizon project management, turning raw benchmark performance into real-world business productivity.

TryToolHunt