gorilla logo

KokoMind: Can LLMs Understand Social Interactions?

Authors: Weiyan Shi* and Liang Qiu* and Dehong Xu and Pengwei Sui and Pan Lu and Zhou Yu

   TL;DR: We introduce KokoMind, a dataset with multi-party social interactions to evaluate LLMs' social understanding abilities. GPT-4 tops the list, followed by Claude.

Introduction: Imagine an AI 🤖 at a cocktail party 🍻

[Download KokoMind to evaluate your own models!]

Imagine this: you're at a vibrant cocktail party 🍹, filled with the buzz of conversation and the clink of glasses 🍻. You're a laid-back observer 👀, tucked comfortably in a corner. Yet, you can still easily figure out the social relations between different people, understand what's going on, and even provide social suggestions by reading people's verbal and non-verbal cues.

If a large language model (LLM) could replicate this level of social aptitude, then we could say that it possesses certain social abilities. Curious how different LLMs perform when it comes to understanding and navigating social interactions? Check out these demos processed by AI models!
These videos are transcribed by Whisper.
ChatGPT, a text-only model, is predicting emotions and answering questions by reading the interaction.
ChatGPT also reads the conversation and answers different questions related to social understanding.
For reference, CLIP, a vision-only model is also predicting facial expressions by looking at the face.

Shrinking S01 E03 I Cried 4 Times Today

Shrinking S01 E06 Just a Normal Day

Shrinking S01 E07 You Wish

gorilla logo KokoMind: A Multifaceted Evaluation Dataset of Social Interactions

KokoMind contains 150 complex multi-party social interactions (50 per source) with free-text questions and answers. To ensure diversity and scalability and avoid data contamination, all the social interactions, questions, and answers are generated by GPT-4 and verified by human experts later. These generations are based on three different sources:

For each social interaction, we ask various questions designed to probe the following aspects of social understanding.

Check out these examples in gorilla logo KokoMind!


Jun (nervously sipping coffee Male 28 marketing executive): I simply don't understand why things got so heated...
Kyung (unsure Female 32 nurse): Maybe it's just a misunderstanding. Don't worry too much about it.
Min (displeased Male 24 student): I couldn't even hear what Ji-ho was whispering about earlier.
Hana (surprised Female 45 housewife): I must've missed that too.
Ji-ho (whispering to Hae-won Male 40 manager): You've noticed the tension too, right?
Hae-won (whispers back Female 35 teacher): Yes, Ji-ho. But I think it's best if we don't get involved.
Dae (jovial Male 55 gardener): Oh, come on, everyone! We're here to enjoy our time together.
Jin (quietly concerned Female 60 retired): I hope nothing worse happens out of this situation.
Eun (unaware of what's happening Male 22 barista): Enjoy your coffee and pastries!
Soo-min (troubled but smiling Female 52 artist): Let's try to put this behind us and think positively.
Question: In Hae-won's mind, what does she think Ji-ho is feeling about the tension in the group?
Answer: Hae-won thinks Ji-ho is concerned about the tension in the group.

gorilla logo Here is the data distribution!

Results: GPT-4 tops the list, followed by Claude in many cases

We evaluated different models following AlpacaEval with text-davinci-003 as reference, and performed an ablation study, where we removed non-verbal cues in the parenthesis (e.g., nervously sipping coffee, etc) from the context. Here are some interesting takeaways.

gorilla logo Play around with these results!


Our project, while exciting in many respects, does have certain limitations. First, the size of KokoMind is relatively small, which could limit the broad applicability and comprehensiveness of our conclusions. Secondly, all of the interactions in KokoMind are generated by GPT-4 and require human verification, which makes it hard to scale up the dataset. Besides, although KokoMind provides human-verfied answers in the dataset, we did not use these answers as a reference during evaluation, and as these answers are generated by GPT-4, they may be biased towards GPT-4 and future research can focus on how to evaluate models with human-verified machine-generated reference answers. Next, all the models we evaluated were versions prior to June 1, 2023. It is possible that newly released models may have better performance. Despite these limitations, we envision KokoMind as a springboard for future investigations related to social intellignece, multi-modal language models, etc.


This project is an early-stage research showcase, designed solely for non-commercial purposes. It adheres to OpenAI's data usage terms, and ShareGPT's privacy practices. Let us know if you spot any potential violations. The software's code is available under the Apache License 2.0.


We would like to thank Yejin Choi from UW, Louis-Philippe Morency from CMU, Jason Weston from Meta, and Diyi Yang from Stanford for their enlightening dialogue and constructive input. The theoretical foundation of KokoMind is based on Liang's PhD research with Song-Chun Zhu from Peking University, Tsinghua University and Beijing Institute for General Artificial Intelligence (BIGAI) and Ying Nian Wu from UCLA.


   title = {KokoMind: Can Large Language Models Understand Social Interactions?},
   url = {https://chats-lab.github.io/KokoMind/},
   author = {Shi, Weiyan and Qiu, Liang and Xu, Dehong and Sui, Pengwei and Lu, Pan and Yu, Zhou},
   month = {July},
   year = {2023}