Surge AI CEO Says That Companies Are Optimizing for ‘AI Slop’

AI companies are prioritizing flash over substance, says Surge AI’s CEO.

“I’m worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understanding universal, all these big grand questions, we are optimizing for AI slop instead,” Edwin Chen said in an episode of “Lenny’s” podcast published on Sunday.

“We’re basically teaching our models to chase dopamine instead of truth,” he added.

Chen founded AI training startup Surge in 2020 after working at Twitter, Google, and Meta. Surge runs the gig platform Data Annotation, which says it pays one million freelancers to train AI models. Surge competes with data labeling startups like Scale AI and Mercor and counts Anthropic as a customer.

On Sunday’s podcast, Chen said that companies are prioritizing AI slop because of industry leaderboards.

“Right now, the industry is played by these terrible leaderboards like LMArena,” he said, referring to a popular online leaderboard where people can vote on which AI response is better.

“They’re not carefully reading or fact-checking,” he said. “They’re skimming these responses for two seconds and picking whatever looks flashiest.”

He added: “It’s literally optimizing your models for the types of people who buy tabloids at the grocery store.”

Still, the Surge CEO said that AI labs have to pay attention to these leaderboards because they can be asked about their rankings during sales meetings.

Like Chen, research scientists have criticized benchmarks for overvaluing superficial traits.

In a March blog post, Dean Valentine, the cofounder and CEO of AI security startup ZeroPath, said that “Recent AI model progress feels mostly like bullshit.”

Valentine said that he and his team had been evaluating the performance of different models claiming to have “some sort of improvement” since the release of Anthropic’s 3.5 Sonnet in June 2024. None of the new models his team tried had made a “significant difference” in his company’s internal benchmarks or in developers’ abilities to find new bugs, he said.

They might have been “more fun to talk to,” but they were “not reflective of economic usefulness or generality.”

In a February paper titled “Can we trust AI Benchmarks?” researchers at the European Commission’s Joint Research Center concluded that major issues exist in today’s evaluation approach.

The researchers said benchmarking is “fundamentally shaped by cultural, commercial and competitive dynamics that often prioritize state-of-the-art performance at the expense of broader societal concerns.”

Companies have also come under fire for “gaming” these benchmarks.

In April, Meta released two new models in its Llama family that it said delivered “better results” than comparably sized models from Google and French AI lab Mistral. It then faced accusations that it had gamed a benchmark.

LMArena said that Meta “should have made it clearer” that it had submitted a version of Llama 4 Maverick that had been “customized” to perform better for its testing format.

“Meta’s interpretation of our policy did not match what we expect from model providers,” LMArena said in an X post.


Continue Reading