Anthropic overtakes OpenAI: Claude Opus 4 codes seven hours nonstop, sets record SWE-Bench score and reshapes enterprise AI

22 May 2025 at 16:45

Credit: VentureBeat made with Midjourney

Anthropic's Claude Opus 4 outperforms OpenAI's GPT-4.1 with unprecedented seven-hour autonomous coding sessions and record-breaking 72.5% SWE-bench score, transforming AI from quick-response tool to day-long collaborator.Read More

Microsoft launches Phi-4-Reasoning-Plus, a small, powerful, open weights reasoning model!

VentureBeat

By:Carl Franzen

1 May 2025 at 12:41

Blocky blue humanoid robot stands before skyline of tall buildings against pastel cloudless blue sky

The release demonstrates that with carefully curated data and training techniques, small models can deliver strong reasoning performance.Read More

30 seconds vs. 3: The d1 reasoning framework that’s slashing AI response times

VentureBeat

By:Ben Dickson

28 April 2025 at 19:31

d1 framework changes boosts diffusion LLMs with novel reinforcement learning, unlocking efficient, problem-solving AI possibilities.Read More

New study shows why simulated reasoning AI models don’t yet live up to their billing

Ars Technica

By:Benj Edwards

25 April 2025 at 21:43

There's a curious contradiction at the heart of today's most capable AI models that purport to "reason": They can solve routine math problems with accuracy, yet when faced with formulating deeper mathematical proofs found in competition-level challenges, they often fail.

That's the finding of eye-opening preprint research into simulated reasoning (SR) models, initially listed in March and updated in April, that mostly fell under the news radar. The research serves as an instructive case study on the mathematical limitations of SR models, despite sometimes grandiose marketing claims from AI vendors.

What sets simulated reasoning models apart from traditional large language models (LLMs) is that they have been trained to output a step-by-step "thinking" process (often called "chain-of-thought") to solve problems. Note that "simulated" in this case doesn't mean that the models do not reason at all but rather that they do not necessarily reason using the same techniques as humans. That distinction is important because human reasoning itself is difficult to define.

Read full article

Comments

OpenAI launches o3 and o4-mini, AI models that ‘think with images’ and use tools autonomously

VentureBeat

By:Michael Nuñez

16 April 2025 at 18:38

OpenAI launches groundbreaking o3 and o4-mini AI models that can manipulate and reason with images, representing a major advance in visual problem-solving and tool-using artificial intelligence.Read More

When AI reasoning goes wrong: Microsoft Research shows more tokens can mean more problems

VentureBeat

By:Ben Dickson

15 April 2025 at 23:50

Not all AI scaling strategies are equal. Longer reasoning chains are not sign of higher intelligence. More compute isn't always the answer.Read More

OpenAI releases new simulated reasoning models with full tool access

Ars Technica

By:Benj Edwards

16 April 2025 at 22:21

On Wednesday, OpenAI announced the release of two new models—o3 and o4-mini—that combine simulated reasoning capabilities with access to functions like web browsing and coding. These models mark the first time OpenAI's reasoning-focused models can use every ChatGPT tool simultaneously, including visual analysis and image generation.

OpenAI announced o3 in December, and until now, only less capable derivative models named "o3-mini" and "03-mini-high" have been available. However, the new models replace their predecessors—o1 and o3-mini.

OpenAI is rolling out access today for ChatGPT Plus, Pro, and Team users, with Enterprise and Edu customers gaining access next week. Free users can try o4-mini by selecting the "Think" option before submitting queries. OpenAI CEO Sam Altman tweeted that "we expect to release o3-pro to the pro tier in a few weeks."

Read full article

Comments

Researchers concerned to find AI models misrepresenting their “reasoning” processes

Ars Technica

By:Benj Edwards

10 April 2025 at 22:37

Remember when teachers demanded that you "show your work" in school? Some new types of AI models promise to do exactly that, but new research suggests that the "work" they show can sometimes be misleading or disconnected from the actual process used to reach the answer.

New research from Anthropic—creator of the ChatGPT-like Claude AI assistant—examines simulated reasoning (SR) models like DeepSeek's R1, and its own Claude series. In a research paper posted last week, Anthropic's Alignment Science team demonstrated that these SR models frequently fail to disclose when they've used external help or taken shortcuts, despite features designed to show their "reasoning" process.

(It's worth noting that OpenAI's o1 and o3 series SR models were excluded from this study.)

Read full article

Comments

Now it’s TikTok parent ByteDance’s turn for a reasoning AI: enter Seed-Thinking-v1.5!

VentureBeat

By:Carl Franzen

11 April 2025 at 19:08

Hands planting seedling in cybernetic garden

It achieved an 8.0% higher win rate over DeepSeek R1, suggesting that its strengths generalize beyond just logic or math-heavy challenges.Read More

Deep Cogito emerges from stealth with hybrid AI ‘reasoning’ models

TechCrunch

By:Kyle Wiggers

8 April 2025 at 19:50

A new company, Deep Cogito, has emerged from stealth with a family of openly available AI models that can be switched between “reasoning” and non-reasoning modes. Reasoning models like OpenAI’s o1 have shown great promise in domains like math and physics, thanks to their ability to effectively fact-check themselves by working through complex problems step […]