GPT-5 Benchmarks

Joel LimAugust 8, 2025

GPT-5 has arrived — and it's pushing the limits in math, coding, reasoning, and multimodal understanding.

We've gathered its latest benchmark scores and lined them up against other top models like Claude 4 Sonnet, Claude 4.1 Opus, Gemini 2.5, and Grok-4.

Let's dive in.

Humanity's Last Exam

A benchmark designed to test complex, cross-domain reasoning at the highest difficulty levels.

GPT-5 sets a new standard here, outperforming most peers across all tested domains.

Math

AIME

Measures advanced competition-level mathematical reasoning.

GPT-5 pro scored a perfect 100%, while GPT-5 (no tools) achieved 94.6% — the highest score among models without tool assistance.

Closest competitors were Grok-4 at 91.7% and Gemini 2.5 at 83.0%, with Claude 4.1 Opus trailing at 78.0%.

FrontierMath

Evaluates expert-level mathematical reasoning across some of the most challenging, research-grade problems.

GPT-5 pro (python) reached 32.1%, a notable jump from the previous 27.4% by the ChatGPT agent with full tool access. Without tools, GPT-5 scored 13.5%, still ahead of earlier generations.

GPQA

Tests PhD-level science and reasoning questions that cannot be easily answered by search.

GPT-5 pro (python) reached 89.4%, leading all models, with Grok-4 close behind at 87.5%. Gemini 2.5 Pro scored 83.0%, while Claude 4.1 Opus and Claude 4 Sonnet trailed at 80.9% and 75.4% respectively

Coding

SWE-bench Verified

Measures real-world software engineering capability by testing whether a model can resolve GitHub issues using repository context.

GPT-5 shows a notable leap in coding performance, with the thinking version reaching 74.9%, up from earlier-generation results in the mid-60% range. Even without tool assistance, GPT-5 records 52.8%, indicating stronger baseline problem-solving.

When compared to other models, GPT-5 (thinking) leads the pack, narrowly ahead of Claude 4.1 Opus (74.5%) and Claude 4 Sonnet (72.7%). Gemini 2.5 Pro trails at 67.2%, reflecting a clear advantage for GPT-5 in high-fidelity code reasoning tasks.

Agentic tool use

Tau2-bench

Evaluates autonomous planning and execution via function calling across industries.

In this benchmark, GPT-5 demonstrates strong multi-domain performance:

Airline: 62.6% with thinking, 55.0% without tools
Retail: 81.1% with thinking, 72.8% without tools
Telecom: 96.7% with thinking, 38.6% without tools

These results highlight GPT-5's capacity to adapt across industries, with especially high accuracy in telecom workflows when reasoning capabilities are enabled. The large gap between “with thinking” and “without tools” also illustrates the significant impact of extended reasoning in complex tool-use scenarios.

Multimodal reasoning

GPT-5 demonstrates strong reasoning capabilities not only with text but also with images, videos, and diagrams.

MMMU (College-level visual problem-solving): 84.2% (thinking) and 74.4% (no tools), slightly outperforming OpenAI o3 (82.9%).
MMMU Pro (Graduate-level visual problem-solving): 78.4% (thinking) and 62.7% (no tools), showing stable performance on complex visual reasoning tasks.
VideoMMMU (Video-based multimodal reasoning): 84.6% (thinking) and 61.6% (no tools), maintaining high accuracy even with a 256-frame limit.

These results indicate that GPT-5 can accurately interpret and reason over non-textual inputs such as visuals, videos, and diagrams, with the “thinking” mode delivering significant improvements on challenging visual tasks.

Accuracy & Reliability

GPT-5 shows a substantial reduction in factual errors compared to previous models. With “thinking” enabled, GPT-5's responses are about 80% less likely to contain an error than OpenAI o3, and about 45% less likely than GPT-4o on production-like queries.

On open-source factuality benchmarks (LongFact and FActScore), GPT-5 with thinking achieves 6× fewer hallucinations than o3, delivering more consistent and accurate long-form reasoning.

In live traffic, GPT-5 with thinking has the lowest observed error rate (4.8%), compared to 11.6% without thinking, 22.0% for o3, and 20.6% for GPT-4o.

Conclusion

Across multiple benchmarks, GPT-5 demonstrates clear and consistent improvements over its predecessors. From mathematics, science, and software engineering to multimodal reasoning and agentic tool use, the model shows substantial gains in both stability and accuracy.

Compared to other leading models, GPT-5 consistently delivers higher accuracy and reliability, particularly in complex problem-solving and long-form reasoning. These advancements position GPT-5 as a more dependable AI partner, ready for real-world, high-stakes applications.

Ready to see GPT-5 in action?

Try it free today in Slack, Microsoft Teams, or your favorite workspace — and experience the difference.