Back to list

GPT-5 Benchmarks

GPT-5 is finally here. It's showing up strong across benchmarks — math, coding, reasoning, and even multimodal tasks.

GPT-5 has arrived — and it's pushing the limits in math, coding, reasoning, and multimodal understanding.

We've gathered its latest benchmark scores and lined them up against other top models like Claude 4 Sonnet, Claude 4.1 Opus, Gemini 2.5, and Grok-4.

Let's dive in.

Humanity's Last Exam

A benchmark designed to test complex, cross-domain reasoning at the highest difficulty levels. GPT-5 Benchmark - humanity's last exam GPT-5 sets a new standard here, outperforming most peers across all tested domains.

Math

AIME

Measures advanced competition-level mathematical reasoning.
GPT-5 Benchmark - AIME GPT-5 pro scored a perfect 100%, while GPT-5 (no tools) achieved 94.6%the highest score among models without tool assistance.

Closest competitors were Grok-4 at 91.7% and Gemini 2.5 at 83.0%, with Claude 4.1 Opus trailing at 78.0%. AIME 2025 Chart

FrontierMath

Evaluates expert-level mathematical reasoning across some of the most challenging, research-grade problems. GPT-5 Benchmark - FrontierMath GPT-5 pro (python) reached 32.1%, a notable jump from the previous 27.4% by the ChatGPT agent with full tool access. Without tools, GPT-5 scored 13.5%, still ahead of earlier generations.

GPQA

Tests PhD-level science and reasoning questions that cannot be easily answered by search.
GPT-5 Benchmark - GPQA

GPT-5 pro (python) reached 89.4%, leading all models, with Grok-4 close behind at 87.5%. Gemini 2.5 Pro scored 83.0%, while Claude 4.1 Opus and Claude 4 Sonnet trailed at 80.9% and 75.4% respectively GPQA Chart

Coding

SWE-bench Verified

Measures real-world software engineering capability by testing whether a model can resolve GitHub issues using repository context. GPT-5 Benchmark - SWE-bench Verified GPT-5 shows a notable leap in coding performance, with the thinking version reaching 74.9%, up from earlier-generation results in the mid-60% range. Even without tool assistance, GPT-5 records 52.8%, indicating stronger baseline problem-solving.

When compared to other models, GPT-5 (thinking) leads the pack, narrowly ahead of Claude 4.1 Opus (74.5%) and Claude 4 Sonnet (72.7%). Gemini 2.5 Pro trails at 67.2%, reflecting a clear advantage for GPT-5 in high-fidelity code reasoning tasks. SWE-bench Verified Chart

Agentic tool use

Tau2-bench

Evaluates autonomous planning and execution via function calling across industries. GPT-5 Benchmark - Tau2-bench

In this benchmark, GPT-5 demonstrates strong multi-domain performance:

  • Airline: 62.6% with thinking, 55.0% without tools
  • Retail: 81.1% with thinking, 72.8% without tools
  • Telecom: 96.7% with thinking, 38.6% without tools

These results highlight GPT-5's capacity to adapt across industries, with especially high accuracy in telecom workflows when reasoning capabilities are enabled. The large gap between “with thinking” and “without tools” also illustrates the significant impact of extended reasoning in complex tool-use scenarios.

Multimodal reasoning

GPT-5 demonstrates strong reasoning capabilities not only with text but also with images, videos, and diagrams. GPT-5 Benchmark - multimodal reasoning

  • MMMU (College-level visual problem-solving): 84.2% (thinking) and 74.4% (no tools), slightly outperforming OpenAI o3 (82.9%).
  • MMMU Pro (Graduate-level visual problem-solving): 78.4% (thinking) and 62.7% (no tools), showing stable performance on complex visual reasoning tasks.
  • VideoMMMU (Video-based multimodal reasoning): 84.6% (thinking) and 61.6% (no tools), maintaining high accuracy even with a 256-frame limit.

These results indicate that GPT-5 can accurately interpret and reason over non-textual inputs such as visuals, videos, and diagrams, with the “thinking” mode delivering significant improvements on challenging visual tasks.

Accuracy & Reliability

GPT-5 Benchmark - hallucination rate

GPT-5 shows a substantial reduction in factual errors compared to previous models. With “thinking” enabled, GPT-5's responses are about 80% less likely to contain an error than OpenAI o3, and about 45% less likely than GPT-4o on production-like queries.

On open-source factuality benchmarks (LongFact and FActScore), GPT-5 with thinking achieves 6× fewer hallucinations than o3, delivering more consistent and accurate long-form reasoning.

In live traffic, GPT-5 with thinking has the lowest observed error rate (4.8%), compared to 11.6% without thinking, 22.0% for o3, and 20.6% for GPT-4o. GPT-5 Benchmark - error rate

Conclusion

Across multiple benchmarks, GPT-5 demonstrates clear and consistent improvements over its predecessors. From mathematics, science, and software engineering to multimodal reasoning and agentic tool use, the model shows substantial gains in both stability and accuracy.

Compared to other leading models, GPT-5 consistently delivers higher accuracy and reliability, particularly in complex problem-solving and long-form reasoning. These advancements position GPT-5 as a more dependable AI partner, ready for real-world, high-stakes applications.


Ready to see GPT-5 in action?

Try it free today in Slack, Microsoft Teams, or your favorite workspace — and experience the difference.