GPT-4V shows human-like ability to interpret social scenes, study finds

Stay informed on the latest psychology and neuroscience research—follow PsyPost on LinkedIn for daily updates and insights.


A new study published in Imaging Neuroscience has found that large language models with visual processing abilities, such as GPT-4V, can evaluate and describe social interactions in images and short videos in a way that closely matches human perception. The research suggests that artificial intelligence can not only identify individual social cues, but also capture the underlying structure of how humans perceive social information.

Large language models (LLMs) are advanced machine learning systems that can generate human-like responses to text inputs. Over the past few years, LLMs have become capable of passing professional exams, emulating personality traits, and simulating theory of mind. More recently, models such as GPT-4V have gained the ability to process visual inputs, making it possible for them to “see” and describe scenes, objects, and people.

This leap in visual capability opens new possibilities for psychological research. Human social perception depends heavily on our ability to make quick inferences from visual input—interpreting facial expressions, body posture, and interactions between people.

If AI models can match or approximate these human judgments, they may offer scalable tools for behavioral science and cognitive neuroscience. But the key question remains: How well can AI interpret the nuanced, often ambiguous social signals that humans rely on?

To explore this question, researchers at the University of Turku used OpenAI’s GPT-4V to evaluate a set of 468 static images and 234 short video clips, all depicting scenes with rich social content drawn from Hollywood films. The goal was to see whether GPT-4V could detect the presence of 138 different social features—ranging from concrete behaviors like “laughing” or “touching someone” to abstract traits like “dominant” or “empathetic.”

These same images and videos had previously been annotated by a large group of human participants. In total, over 2,200 individuals contributed more than 980,000 perceptual judgments using a sliding scale from “not at all” to “very much” to rate each feature. The human evaluations were used as a reference point to assess how closely GPT-4V’s ratings aligned with the consensus of real observers.

For each image or video, the researchers prompted GPT-4V to generate numerical ratings for the full set of social features. They repeated this process five times to account for the model’s variability, then averaged the results. In the case of video clips, since GPT-4V cannot yet directly process motion, the researchers extracted eight representative frames and added the transcribed dialogue from the clip.

The results showed a high level of agreement between GPT-4V and human observers. The correlation between AI and human ratings was 0.79 for both images and videos—a level that approaches the reliability seen between individual human participants. In fact, GPT-4V outperformed single human raters for 95% of the social features in images and 85% in videos.

However, GPT-4V’s ratings did not always match group-level consensus. When compared to the average of five human raters, the AI’s agreement was slightly lower, particularly for video clips. This suggests that while GPT-4V provides a strong approximation of human perception, its reliability may not yet match the collective judgment of multiple human observers working together.

The study also examined whether GPT-4V captured the deeper structure of how humans organize social information. Using statistical techniques such as principal coordinate analysis, the researchers found that the dimensions GPT-4V used to represent the social world—such as dominant vs. empathetic or playful vs. sexual—were strikingly similar to those found in human data.

This suggests that the model is not only mimicking surface-level judgments but may be tapping into similar patterns of representation that humans use to make sense of social interactions.

To take the comparison one step further, the researchers used GPT-4V’s social feature annotations as predictors in a functional MRI (fMRI) study. Ninety-seven participants had previously watched a medley of 96 short, socially rich video clips while undergoing brain scans. By linking the social features present in each video to patterns of brain activity, the researchers could map which areas of the brain respond to which types of social information.

Remarkably, GPT-4V-based stimulus models produced nearly identical brain activation maps as those generated using human annotations. The correlation between the two sets of maps was extremely high (r = 0.95), and both identified a similar network of regions—such as the superior temporal sulcus, temporoparietal junction, and fusiform gyrus—as being involved in processing social cues.

This finding provides evidence that GPT-4V’s judgments can be used to model how the brain perceives and organizes social information. It also suggests that AI models could assist in designing and interpreting future neuroimaging experiments, especially in cases where manual annotation would be time-consuming or expensive.

These findings open several possible directions for future research and real-world applications. In neuroscience, LLMs like GPT-4V could help generate high-dimensional annotations of complex stimuli, allowing researchers to reanalyze existing brain data or design new experiments with greater precision. In behavioral science, AI could serve as a scalable tool for labeling emotional and social content in large datasets.

Outside the lab, this technology could support mental health care, by identifying signs of distress in patient interactions, or improve customer service by analyzing emotional cues in video calls. It could also be used in surveillance systems to detect potential conflicts or identify unusual social behaviors in real-time settings.

At the same time, the study’s authors caution that these models are not perfect replacements for human judgment. GPT-4V performed worse on some social features that involve more subjective or ambiguous judgments, such as “ignoring someone” or “harassing someone.” These types of evaluations may require contextual understanding that AI systems still lack, or may be influenced by training data biases or content moderation filters.

The model also tended to rate low-level features more conservatively than humans—possibly due to its probabilistic nature or its safeguards against generating controversial outputs. In some cases, the AI refused to evaluate scenes containing sexual or violent content, highlighting the constraints imposed by platform-level safety policies.

While the results are promising, some limitations should be noted. The AI ratings were compared against a relatively small number of human raters per stimulus, and larger datasets could provide a more robust benchmark. The model was also tested on short, scripted film clips rather than real-world or live interactions, so its performance in more natural settings remains an open question.

Future work could explore whether tailoring LLMs to specific demographic perspectives improves their alignment with particular groups. Researchers might also investigate how AI models form these judgments—what internal processes or representations they use—and whether these resemble the mechanisms underlying human social cognition.

The study, “GPT-4V shows human-like social perceptual capabilities at phenomenological and neural levels,” was authored by Severi Santavirta, Yuhang Wu, Lauri Suominen, and Lauri Nummenmaa.

Continue Reading