A publisher has found that less than 25% of authors disclosed their use of AI to prepare manuscripts, despite the publisher mandating disclosure for submission.Credit: Nicolas Economou/NurPhoto via Getty
An analysis of tens of thousands of research-paper submissions has shown a dramatic increase in the presence of text generated using artificial intelligence (AI) in the past few years, an academic publisher has found.
The American Association for Cancer Research (AACR) found that 23% of abstracts in manuscripts and 5% of peer-review reports submitted to its journals in 2024 contained text that was probably generated by large language models (LLMs). The publishers also found that less than 25% of authors disclosed their use of AI to prepare manuscripts, despite the publisher mandating disclosure for submission.
To screen manuscripts for signs of AI use, the AACR used an AI tool that was developed by Pangram Labs, based in New York City. When applied to 46,500 abstracts, 46,021 methods sections and 29,544 peer-review comments submitted to 10 AACR journals between 2021 and 2024, the tool flagged a rise in suspected AI-generated text in submissions and review reports since the public release of OpenAI’s chatbot, ChatGPT, in November 2022.
“We were shocked when we saw the Pangram results,” says Daniel Evanko, the AACR’s director of journal operations and systems, who presented the findings at the 10th International Congress on Peer Review and Scientific Publications in Chicago, Illinois, on 3 September.
The analysis found that AI-generated text in peer-review reports dropped by 50% in late 2023, after the AACR banned peer reviewers from using LLMs. But detections of AI-generated text in peer-review comments more than doubled by early 2024 and continued to climb.
It “was disconcerting to see people increasing the usage of LLMs for peer review in spite of us prohibiting that usage”, says Evanko. He adds that “our intention is definitely to start screening all incoming manuscripts and all incoming peer review comments”.
The tool “seems to work exceptionally well”, says Adam Day, founder of Clear Skies, a London-based research-integrity firm. However, “there may be bias that we’re not seeing regarding false positive rate, and we should be mindful of that”, he adds.
99.85% accurate
Pangram was trained on 28 million human-written documents from before 2021, including 3 million scientific papers, as well as ‘AI mirrors’ — LLM-generated texts that mimic human-written passages in length, style and tone.
Max Spero, chief executive officer of Pangram Labs, says that adding an active-learning mode to Pangram was “one of the breakthroughs” that enabled it to reduce the false-positive rate —the share of texts incorrectly flagged as being AI-written. He and his team repeatedly retrained the tool, which “reduced our false-positive rate from about one in 100 to about one in 10,000,” he says.
‘AI models are capable of novel research’: OpenAI’s chief scientist on what to expect
In a preprint posted last year1, Spero and his colleagues showed that Pangram’s accuracy was 99.85%, with error rates 38 times lower than that of other currently available AI-detection tools.
Testing the AI-detection tool on manuscripts before ChatGPT was released in November 2022, it flagged only seven abstracts and no methods or peer-review reports as containing potentially AI-generated text. “From there on, the detections just increased linearly and at what we would think is a very high rate,” says Evanko.
The tool can also distinguish between different LLMs, including ChatGPT models, DeepSeek, LLaMa and Claude. “We’re only able to do this because we’ve generated our entire training set ourselves, so we know the exact provenance, we know what model the training data came from,” explains Spero.
The current model of Pangram cannot distinguish between passages that are fully generated by AI and those that are written by humans but edited using AI.
Language aid
The AACR used Pangram to analyze submissions in 2024 which included 11,959 abstracts, 11,875 methods sections and 7,211 peer-review reports.
Their analysis found that authors at institutions in countries where English is not a native language were more than twice as likely to use LLMs.
“I was personally shocked at just how high the usage was in the methods,” says Evanko. “Asking an LLM to improve the language of the methods section could introduce errors … because those details need to be exact in terms of how you did something and if you rephrase something, it might not be correct anymore,” he adds.