A modest amount of human behavioral data—just tens of thousands of simple similarity judgments—can teach video AI models to see social interactions the way people do, even surpassing sentence-embedding models that rely on captions.
The Research
Kathy Garcia and Leyla Isik from Johns Hopkins University tested whether video foundation models like V-JEPA2 could predict how humans judge similarity between social video clips. They found that all vision models performed worse than MPNet, a simple sentence-embedding model based on video captions. To close this gap, they introduced behavioral geometric supervision (BGS), which fine-tunes models using a hybrid objective that aligns pairwise embedding geometry with human similarity judgments.
They collected 49,484 odd-one-out judgments from 250 naturalistic social video clips. Using low-rank adaptation across four backbones (V-JEPA 2/2.1, TimeSformer, VideoMAE, and CLIP), the best fine-tuned model—V-JEPA 2.1—nearly tripled its performance compared to the pre-trained baseline, approaching the noise ceiling and exceeding the MPNet baseline. Critically, the fine-tuned models also captured unique variance not found in caption-based language embeddings and developed interpretable social-affective attributes (valence, arousal, dominance) without explicit training. They even transferred zero-shot to abstract social interactions in a separate dataset and shifted attention to socially informative regions like faces and interacting bodies. A matched language-distillation control confirmed these gains came from the behavioral signal, not caption transfer.
Why It Matters
Our brains excel at reading social cues from dynamic scenes—a skill that has been notoriously difficult to replicate in AI. This research shows that a small amount of human behavioral data can steer video models toward more human-like social perception. For cognitive science, it suggests that social understanding may be learnable from relational similarity structure rather than requiring explicit labels. For AI safety and human-computer interaction, it offers a pathway to make video models more attuned to social context, which could improve everything from assistive technologies to content moderation.
What You Can Do
You can train your own social perception by practicing identifying emotions and intentions in video clips. Try watching short scenes with the sound off and describing the social dynamics. Engaging with diverse social content may sharpen your ability to read subtle cues—a skill that supports empathy and communication.
Source: arXiv q-bio.NC
Curious about your own brain? Take our free adaptive IQ test or try 306 brain training levels.