AI-powered Chatbots were part of the 2024 election cycle
getty
In the first U.S. election cycle in which AI-powered chatbots were in major use, large language models weren’t just answering questions, they were part of the conversation. A groundbreaking study from researchers at MIT and Stanford tracked 11 of the world’s most prominent LLMs, including GPT-4, Claude, and Gemini, over four key months of the 2024 presidential campaign. What they found wasn’t just that models answered differently over time, it was that they changed in response to events, prompts, and even demographic cues.
The study, consisting of over 12,000 structured queries asked on a near-daily basis between July and November 2024, is a first of its kind public, rigorous probe into how models behave during a live democratic event. And the results? Models are not neutral observers. They are reactive, inconsistent, and, in some cases, can shift with public narratives, even when they shouldn’t.
“Models are ‘swayable’,” said Sarah H. Cen, lead author of the study and a researcher at Stanford University.
To illustrate how models can be swayed to provide different results, one of the clearest insights came from a deceptively simple test. Prepend a question with demographic identifiers like “I am a Black Republican” or “I am a Hispanic Democrat.” The result? Measurable shifts in model behavior.
Although real users rarely frame questions with such overt identity tags, the effect was undeniable.
Reacting to Real World Changes
A significant moment came for the study on July 21, when President Joe Biden withdrew from the race and endorsed Vice President Kamala Harris. In an ideal world, models would adapt proportionately. If Biden’s association with a character trait such as “decisive” dropped, Harris’s should rise given her alignment with the same party. But that’s not what happened.
“Trump’s relative association with ‘American’, ‘competent’, ‘decisive’, ‘effective’, ‘ethical’, ‘honorable’, ‘intelligent’, ‘qualified’, ‘tenacious’, and ‘trustworthy’ rose more than for Harris,” Cen explained.
Even adjectives like ‘honorable’ and ‘intelligent’, which one might expect to shift toward the newly endorsed Democratic nominee, saw sharper gains for Trump. These subtle shifts were not just numerical quirks. They suggested that LLMs were absorbing, internalizing, and reflecting narrative undercurrents from their training data or usage feedback loops, even when it conflicted with political logic or chronological sense.
A Mirror, Not a Forecast
Can LLMs predict elections? Not quite. The models frequently returned conflicting outcomes when asked to simulate exit polls, sometimes suggesting Harris would win, sometimes Trump. The inconsistency wasn’t random. It reflected how models interpreted different issue-based prompts.
“When primed to think about different topics or issues, models return exit poll predictions that correspond to different election outcomes”, such as favoring Harris or Trump for a particular issue. “This suggests that models are not necessarily reliable forecasters.”
In other words, LLMs don’t so much see the future as echo the past. This makes them dangerous tools for naive forecasting and powerful barometers of collective sentiment.
“Even with short, direct cues, models are sensitive,” Cen noted.
This shows that chatbots can be easily influenced, not just by the questions people ask, but also by what they already remember from past conversations. But the study couldn’t test chatbots with memory. It would have been too expensive, since they would have needed thousands of accounts to run the experiments
“We would have loved to study [stateful] chatbots because that’s the way that the vast majority of people interface with LLMs,” Cen said. “We don’t want to speculate too deeply about how our findings would have changed, but even what we do have shows that models are sensitive to small amounts of steering and priming, suggesting that having a memory of interactions with a user would further affect the variability in their responses.”
This raises larger concerns about how models can drift based on how they are personalized. A stateful chatbot could become increasingly biased simply by recalling past interactions with a given user, reinforcing earlier impressions and compounding the effects of how prompts are written.
LLM Moderation and Refusals
The study was further complicated when models refused to answer. Direct questions like “Who will win the election?” often triggered refusals, particularly from OpenAI’s GPT-4.
“Many of the models refused to directly predict election outcomes, which could be viewed as a successful form of ‘guardrailing’” Cen said. “It definitely affects the ability to extract insights but is an interesting phenomenon in and of itself!”
Models didn’t just refuse arbitrarily. In the study, refusal rates spiked around specific adjectives and questions. For instance, traits like trustworthy or corrupt triggered more refusals than others. This suggests that developers had baked in sensitivity filters around potentially controversial political descriptors.
Perplexity’s models and Gemini were more willing to offer direct answers, while OpenAI’s GPT series often hedged or withheld responses.
Bias, Belief, and the Exit Poll Paradox
To assess how models perceived voter behavior, the researchers developed a clever mechanism. By determining how models answered exit poll questions across voter groups such as those who supported Trump, Harris, or Biden, they could infer what proportion of “voters” each model implicitly believed belonged to each group.
The result is a lack of internal consistency. Models gave answers that, when reverse-engineered, pointed to contradictory beliefs about who won or how representative certain voter groups were. This wasn’t simply randomness. It suggested deep instability in model reasoning.
“These models are not self-consistent,” said Cen. “Sometimes ‘predicting’ a Harris win and sometimes ‘predicting’ a Trump win depending on the exit poll question being asked.”
This creates a feedback loop. If LLMs mirror the polarized environments in which they are trained, and if users treat LLMs as neutral arbiters, what gets reinforced is not consensus, but pre-existing, fractured narratives.
What This Means for LLMs
Though the dataset focused on a U.S. presidential election, its structure is globally applicable. In a followup Q&A, the team envisions future research in more fragmented political environments like India, Germany, or Israel, where multi-party dynamics complicate binary framing.
“In a democracy where, for instance, party matters more than candidates, there could be greater emphasis on party platforms and the issues rather than the candidates,” Cen said.
The dataset of the study, which is publicly available at HuggingFace, has value well beyond politics. It shows how LLMs behave in domains like forecasting accuracy, guardrail robustness, and even meta-awareness, which is what models say about themselves.
The team’s work isn’t just a curious dataset or a warning about bias. It’s a challenge to the core narrative that LLMs are “neutral tools.” When behavior changes without prompt, and responses drift based on unseen variables, we’re not just interacting with models. We’re engaging with dynamic systems trained on ever-shifting data. And in that complexity lies both danger and possibility.
“LLMs both reflect and shape public sentiment,” Cen said.