OpenAI o3-mini vs Anthropic Sonnet 3.7 Thinking for Bug Detection

Company

Greptile

Date Published

April 3, 2025

Author

Everett Butler

Word count

682

Language

English

Hacker News points

None

URL

www.greptile.com/blog/o3-mini-vs-sonnet-3.7-thinking

Summary

OpenAI's o3-mini outperformed Anthropic's Sonnet 3.7 Thinking in a benchmark of bug detection, catching more bugs across multiple programming languages, particularly in Python and Rust. Despite being designed as a "thinking" model with an added planning step, Sonnet 3.7 Thinking did not outperform o3-mini overall, with strengths shown in lower-resource languages like Ruby and Go where logic deduction plays a bigger role. The results suggest that while reasoning models have value in certain scenarios, they still need to demonstrate stronger consistency across languages to match the performance of non-reasoning models like o3-mini.