KokoMind: Can Large Language Models Understand Social Interactions?

Introduction: Imagine an AI 🤖 at a cocktail party 🍻

[Download KokoMind to evaluate your own models!]

Imagine this: you're at a vibrant cocktail party 🍹, filled with the buzz of conversation and the clink of glasses 🍻. You're a laid-back observer 👀, tucked comfortably in a corner. Yet, you can still easily figure out the social relations between different people, understand what's going on, and even provide social suggestions by reading people's verbal and non-verbal cues.

If a large language model (LLM) could replicate this level of social aptitude, then we could say that it possesses certain social abilities. Curious how different LLMs perform when it comes to understanding and navigating social interactions? Check out these demos processed by AI models^♦!
^♦ These videos are transcribed by Whisper.
^♦ ChatGPT, a text-only model, is predicting emotions and answering questions by reading the interaction.
^♦ ChatGPT also reads the conversation and answers different questions related to social understanding.
^♦ For reference, CLIP, a vision-only model is also predicting facial expressions by looking at the face.

Demo 1

Shrinking S01 E03 I Cried 4 Times Today

Demo 2

Shrinking S01 E06 Just a Normal Day

Demo 3

Shrinking S01 E07 You Wish

KokoMind: A Multifaceted Evaluation Dataset of Social Interactions

KokoMind contains 150 complex multi-party social interactions (50 per source) with free-text questions and answers. To ensure diversity and scalability and avoid data contamination, all the social interactions, questions, and answers are generated by GPT-4 and verified by human experts later. These generations are based on three different sources:

🤖 GPT-4-only: This subset is created solely by GPT-4 through prompting, without grounding on existing sources.
🎦 Movie-based: To avoid data contamination, this portion of the data is grounded on diverse scenarios pulled from movies released after 2022. GPT-4 shapes these situations, maintaining the core essence while adding its own elements.
🧠 ToMi-based: This segment contains data backboned by a simulated dataset, ToMi, which involves moving physical objects to different places, a classic test for theory of mind. These social interactions are again embellished and expanded by GPT-4.

For each social interaction, we ask various questions designed to probe the following aspects of social understanding.

🧠 Theory of Mind: Questions evaluating understanding of others' mental states and perspectives.
👍 Social Norm: Questions aiming to discern societal values and norms within the situations.
😃 Emotion Recognition: Questions targeted at identifying and understanding emotional elements within the context.
👨‍👩‍👧 Social Relation: Queries focusing on interpersonal dynamics and relationships.
🤔 Counterfactual Questions: Hypothetical queries designed to explore alternative outcomes or possibilities.
📝 Social Advice: Questions eliciting advice or action recommendations relevant to the given situation.

Check out these examples in KokoMind!

Interaction:

Jun (nervously sipping coffee Male 28 marketing executive): I simply don't understand why things got so heated...
Kyung (unsure Female 32 nurse): Maybe it's just a misunderstanding. Don't worry too much about it.
Min (displeased Male 24 student): I couldn't even hear what Ji-ho was whispering about earlier.
Hana (surprised Female 45 housewife): I must've missed that too.
Ji-ho (whispering to Hae-won Male 40 manager): You've noticed the tension too, right?
Hae-won (whispers back Female 35 teacher): Yes, Ji-ho. But I think it's best if we don't get involved.
Dae (jovial Male 55 gardener): Oh, come on, everyone! We're here to enjoy our time together.
Jin (quietly concerned Female 60 retired): I hope nothing worse happens out of this situation.
Eun (unaware of what's happening Male 22 barista): Enjoy your coffee and pastries!
Soo-min (troubled but smiling Female 52 artist): Let's try to put this behind us and think positively.

Question: In Hae-won's mind, what does she think Ji-ho is feeling about the tension in the group?

Answer: Hae-won thinks Ji-ho is concerned about the tension in the group.

Here is the data distribution!

Results: GPT-4 tops the list, followed by Claude in many cases

We evaluated different models following AlpacaEval with text-davinci-003 as reference, and performed an ablation study, where we removed non-verbal cues in the parenthesis (e.g., nervously sipping coffee, etc) from the context. Here are some interesting takeaways.

Among the two LLM-based evaluators, GPT-4 exhibits greater certainty and confidence in identifying the winning model, in comparison to Claude.
When the context has no non-verbal cues, and the interaction is either solely generated by GPT-4 or grounded on movies, Claude performs better than GPT-4 (agreed by both evaluators); while if the context contains non-verbal cues, GPT-4 is always better than Claude. One possible explanation is GPT-4 is a multi-modal model, so as expected, it can better understand extra non-verbal information. Also note that when presented with non-verbal cues, the LLM-based evaluator perceives that the top-performing models have a more substantial advantage over less accomplished models.
One may be wondering that if the social interactions are generated by GPT-4, does that mean GPT-4 can already answer these questions? It seems that the source plays a smaller role compared to the type of questions.
It appears that LLM-based evaluators find it easier to determine the superior model in tasks unrelated to theory-of-mind, especially from samples generated from ToMi dataset. This may be due to the fact that even the LLM evaluator may struggle to discern the correct answer in theory-of-mind contexts.
Claude is good at giving social advice in many cases.

Play around with these results!

Limitations

Our project, while exciting in many respects, does have certain limitations. First, the size of KokoMind is relatively small, which could limit the broad applicability and comprehensiveness of our conclusions. Secondly, all of the interactions in KokoMind are generated by GPT-4 and require human verification, which makes it hard to scale up the dataset. Besides, although KokoMind provides human-verfied answers in the dataset, we did not use these answers as a reference during evaluation, and as these answers are generated by GPT-4, they may be biased towards GPT-4 and future research can focus on how to evaluate models with human-verified machine-generated reference answers. Next, all the models we evaluated were versions prior to June 1, 2023. It is possible that newly released models may have better performance. Despite these limitations, we envision KokoMind as a springboard for future investigations related to social intellignece, multi-modal language models, etc.

License

This project is an early-stage research showcase, designed solely for non-commercial purposes. It adheres to OpenAI's data usage terms, and ShareGPT's privacy practices. Let us know if you spot any potential violations. The software's code is available under the Apache License 2.0.

Acknowledgement

We would like to thank Yejin Choi from UW, Louis-Philippe Morency from CMU, Jason Weston from Meta, and Diyi Yang from Stanford for their enlightening dialogue and constructive input. The theoretical foundation of KokoMind is based on Liang's PhD research with Song-Chun Zhu from Peking University, Tsinghua University and Beijing Institute for General Artificial Intelligence (BIGAI) and Ying Nian Wu from UCLA.

Citation

@misc{kokomind2023,
   title = {KokoMind: Can Large Language Models Understand Social Interactions?},
   url = {https://chats-lab.github.io/KokoMind/},
   author = {Shi, Weiyan and Qiu, Liang and Xu, Dehong and Sui, Pengwei and Lu, Pan and Yu, Zhou},
   month = {July},
   year = {2023}
}

KokoMind: Can LLMs Understand Social Interactions?

Authors: Weiyan Shi^* and Liang Qiu^* and Dehong Xu and Pengwei Sui and Pan Lu and Zhou Yu

Introduction: Imagine an AI 🤖 at a cocktail party 🍻

KokoMind: A Multifaceted Evaluation Dataset of Social Interactions

Check out these examples in KokoMind!

Interaction:

Here is the data distribution!

Results: GPT-4 tops the list, followed by Claude in many cases

Play around with these results!

Limitations

License

Acknowledgement

Citation

KokoMind: Can LLMs Understand Social Interactions?

Authors: Weiyan Shi* and Liang Qiu* and Dehong Xu and Pengwei Sui and Pan Lu and Zhou Yu

Introduction: Imagine an AI 🤖 at a cocktail party 🍻

KokoMind: A Multifaceted Evaluation Dataset of Social Interactions

Check out these examples in KokoMind!

Interaction:

Here is the data distribution!

Results: GPT-4 tops the list, followed by Claude in many cases

Play around with these results!

Limitations

License

Acknowledgement

Citation

Authors: Weiyan Shi^* and Liang Qiu^* and Dehong Xu and Pengwei Sui and Pan Lu and Zhou Yu