Ars Technica
New study shows why simulated reasoning AI models don’t yet live up to their billing 25 April 2025 at 21:43

New study shows why simulated reasoning AI models don’t yet live up to their billing

25 April 2025 at 21:43

There's a curious contradiction at the heart of today's most capable AI models that purport to "reason": They can solve routine math problems with impressive accuracy, yet when faced with formulating deeper mathematical proofs found in competition-level challenges, they often fail.

That's the finding of eye-opening preprint research into simulated reasoning (SR) models, initially listed in March and updated in April, that mostly fell under the news radar. The research serves as an instructive case study on the mathematical limitations of SR models, despite sometimes grandiose marketing claims from AI vendors.

What sets simulated reasoning models apart from traditional large language models (LLMs) is that they have been trained to output a step-by-step "thinking" process (often called "chain-of-thought") to solve problems. Note that "simulated" in this case doesn't mean that the models do not reason at all but rather that they do not necessarily reason using the same techniques as humans. That distinction is important because human reasoning itself is difficult to define.

Read full article

Comments

Ars Technica
In the age of AI, we must protect human creativity as a natural resource 25 April 2025 at 11:00

In the age of AI, we must protect human creativity as a natural resource

Ars Technica

By:Benj Edwards

25 April 2025 at 11:00

Ironically, our present AI age has shone a bright spotlight on the immense value of human creativity as breakthroughs in technology threaten to undermine it. As tech giants rush to build newer AI models, their web crawlers vacuum up creative content, and those same models spew floods of synthetic media, risking drowning out the human creative spark in an ocean of pablum.

Given this trajectory, AI-generated content may soon exceed the entire corpus of historical human creative works, making the preservation of the human creative ecosystem not just an ethical concern but an urgent imperative. The alternative is nothing less than a gradual homogenization of our cultural landscape, where machine learning flattens the richness of human expression into a mediocre statistical average.

A limited resource

By ingesting billions of creations, chatbots learn to talk, and image synthesizers learn to draw. Along the way, the AI companies behind them treat our shared culture like an inexhaustible resource to be strip-mined, with little thought for the consequences.

Read full article

Comments

Ars Technica
AI secretly helped write California bar exam, sparking uproar 23 April 2025 at 19:05

AI secretly helped write California bar exam, sparking uproar

Ars Technica

By:Benj Edwards

23 April 2025 at 19:05

On Monday, the State Bar of California revealed that it used AI to develop a portion of multiple-choice questions on its February 2025 bar exam, causing outrage among law school faculty and test takers. The admission comes after weeks of complaints about technical problems and irregularities during the exam administration, reports the Los Angeles Times.

The State Bar disclosed that its psychometrician (a person or organization skilled in administrating psychological tests), ACS Ventures, created 23 of the 171 scored multiple-choice questions with AI assistance. Another 48 questions came from a first-year law student exam, while Kaplan Exam Services developed the remaining 100 questions.

The State Bar defended its practices, telling the LA Times that all questions underwent review by content validation panels and subject matter experts before the exam. "The ACS questions were developed with the assistance of AI and subsequently reviewed by content validation panels and a subject matter expert in advance of the exam," wrote State Bar Executive Director Leah Wilson in a press release.

Read full article

Comments

Ars Technica
OpenAI releases new simulated reasoning models with full tool access 16 April 2025 at 22:21

OpenAI releases new simulated reasoning models with full tool access

Ars Technica

By:Benj Edwards

16 April 2025 at 22:21

On Wednesday, OpenAI announced the release of two new models—o3 and o4-mini—that combine simulated reasoning capabilities with access to functions like web browsing and coding. These models mark the first time OpenAI's reasoning-focused models can use every ChatGPT tool simultaneously, including visual analysis and image generation.

OpenAI announced o3 in December, and until now, only less capable derivative models named "o3-mini" and "03-mini-high" have been available. However, the new models replace their predecessors—o1 and o3-mini.

OpenAI is rolling out access today for ChatGPT Plus, Pro, and Team users, with Enterprise and Edu customers gaining access next week. Free users can try o4-mini by selecting the "Think" option before submitting queries. OpenAI CEO Sam Altman tweeted that "we expect to release o3-pro to the pro tier in a few weeks."

Read full article

Comments

Ars Technica
Researchers claim breakthrough in fight against AI’s frustrating security hole 16 April 2025 at 11:15

Researchers claim breakthrough in fight against AI’s frustrating security hole

Ars Technica

By:Benj Edwards

16 April 2025 at 11:15

In the AI world, a vulnerability called a "prompt injection" has haunted developers since chatbots went mainstream in 2022. Despite numerous attempts to solve this fundamental vulnerability—the digital equivalent of whispering secret instructions to override a system's intended behavior—no one has found a reliable solution. Until now, perhaps.

Google DeepMind has unveiled CaMeL (CApabilities for MachinE Learning), a new approach to stopping prompt-injection attacks that abandons the failed strategy of having AI models police themselves. Instead, CaMeL treats language models as fundamentally untrusted components within a secure software framework, creating clear boundaries between user commands and potentially malicious content.

The new paper grounds CaMeL's design in established software security principles like Control Flow Integrity (CFI), Access Control, and Information Flow Control (IFC), adapting decades of security engineering wisdom to the challenges of LLMs.

Read full article

Comments

Ars Technica
Researchers concerned to find AI models misrepresenting their “reasoning” processes 10 April 2025 at 22:37

Researchers concerned to find AI models misrepresenting their “reasoning” processes

Ars Technica

By:Benj Edwards

10 April 2025 at 22:37

Remember when teachers demanded that you "show your work" in school? Some new types of AI models promise to do exactly that, but new research suggests that the "work" they show can sometimes be misleading or disconnected from the actual process used to reach the answer.

New research from Anthropic—creator of the ChatGPT-like Claude AI assistant—examines simulated reasoning (SR) models like DeepSeek's R1, and its own Claude series. In a research paper posted last week, Anthropic's Alignment Science team demonstrated that these SR models frequently fail to disclose when they've used external help or taken shortcuts, despite features designed to show their "reasoning" process.

(It's worth noting that OpenAI's o1 and o3 series SR models were excluded from this study.)

Read full article

Comments

Ars Technica
After months of user complaints, Anthropic debuts new $200/month AI plan 9 April 2025 at 19:20

After months of user complaints, Anthropic debuts new $200/month AI plan

Ars Technica

By:Benj Edwards

9 April 2025 at 19:20

On Wednesday, Anthropic introduced a new $100- to $200-per-month subscription tier called Claude Max that offers expanded usage limits for its Claude AI assistant. The new plan arrives after many existing Claude subscribers complained of hitting rate limits frequently.

"The top request from our most active users has been expanded Claude access," wrote Anthropic in a news release. A brief stroll through user feedback on Reddit seems to confirm that sentiment, showing that many Claude users have been unhappy with Anthropic's usage limits over the past year—even on the Claude Pro plan, which costs $20 a month.

One of the downsides of a relatively large context window with Claude (the amount of text it can process at once) has been that long conversations or inclusions of many reference documents (such as code files) fill up usage limits quickly. That's because each time the user adds to the conversation, the entire text of the conversation (including any attached documents) is fed back into the AI model again and re-evaluated. But on the other hand, a large context window allows Claude to process more complex projects within each session.

Read full article

Comments

Ars Technica
Carmack defends AI tools after Quake fan calls Microsoft AI demo “disgusting” 8 April 2025 at 18:26

Carmack defends AI tools after Quake fan calls Microsoft AI demo “disgusting”

Ars Technica

By:Benj Edwards

8 April 2025 at 18:26

On Monday, John Carmack, co-creator of id Software's Quake franchise, defended Microsoft's recent AI-generated Quake II demo against criticism from a fan about the technology's impact on industry jobs, calling it "impressive research work."

Last Friday, Microsoft released a new playable tech demo of a generative AI game engine called WHAMM (World and Human Action MaskGIT Model) that generates each simulated frame of Quake II in real time using an AI world model instead of traditional game engine techniques. However, Microsoft is up front about the limitations: "We do not intend for this to fully replicate the actual experience of playing the original Quake II game," the researchers wrote on the project's announcement page.

Carmack's comments came after an X user with the handle "Quake Dad" called the new demo "disgusting" and claimed it "spits on the work of every developer everywhere." The critic expressed concern that such technology would eliminate jobs in an industry already facing layoffs, writing: "A fully generative game cuts out the number of jobs necessary for such a project which in turn makes it harder for devs to get jobs."

Read full article

Comments

Ars Technica
Meta’s surprise Llama 4 drop exposes the gap between AI ambition and reality 7 April 2025 at 19:54

Meta’s surprise Llama 4 drop exposes the gap between AI ambition and reality

Ars Technica

By:Benj Edwards

7 April 2025 at 19:54

On Saturday, Meta released its newest Llama 4 multimodal AI models in a surprise weekend move that caught some AI experts off guard. The announcement touted Llama 4 Scout and Llama 4 Maverick as major advancements, with Meta claiming top performance in their categories and an enormous 10 million token context window for Scout. But so far the open-weights models have received an initial mixed-to-negative reception from the AI community, highlighting a familiar tension between AI marketing and user experience.

"The vibes around llama 4 so far are decidedly mid," independent AI researcher Simon Willison told Ars Technica. Willison often checks the community pulse around open source and open weights AI releases in particular.

While Meta positions Llama 4 in competition with closed-model giants like OpenAI and Google, the company continues to use the term "open source" despite licensing restrictions that prevent truly open use. As we have noted in the past with previous Llama releases, "open weights" more accurately describes Meta's approach. Those who sign in and accept the license terms can download the two smaller Llama 4 models from Hugging Face or llama.com.

Read full article

Comments

Normal view

A limited resource