Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Sep 18, 2025·
Chandler Smith
,
Marwa Abdulhai
,
Manfred Diaz
,
Marko Tesic
,
Rakshit Trivedi
,
Sasha Vezhnevets
,
Lewis Hammond
,
Jesse Clifton
,
Minsuk Chang
,
Edgar A. Duéñez-Guzmán
,
John P Agapiou
,
Jayd Matyas
,
Danny Karmon
,
Beining Zhang
,
Jim Dilkes
,
Akash Kundu
,
Jord Nguyen
,
Emanuel Tewolde
,
Jebish Purbey
,
Ram Mohan
,
Rao Kadiyala
,
Siddhant Gupta
,
Aliaksei Korshuk
,
Buyantuev Alexander
,
Ilya Makarov
,
Gang Zhao
,
Rolando Fernandez
,
Zhihan Wang
,
Caroline Wang
,
Jiaxun Cui
,
Lingyun Xiao
,
Di Yang Shi
,
Yoonchang Sung
,
Arrasy Rahman
,
Peter Stone
,
Yipeng Kang
,
Hyeonggeun Yun
,
Ananya Ananya
,
Taehun Cha
,
Zhiqiang Wu
,
Elizaveta Tennant
,
Olivia Macmillan-Scott
,
Marta Emili García Segura
,
Diana Riazi
,
Fuyang Cui
,
Sriram Ganapathi Subramanian
,
Toryn Q. Klassen
,
Nico Schiavone
,
Mogtaba Alim
,
Sheila A. McIlraith
,
Manuel Sebastian Rios Beltran
,
Oswaldo Peña
,
Carlos Saith Rodriguez Rojas
,
Manuela Chacon-Chamorro
,
Ruben Manrique
,
Luis Felipe Giraldo
,
Nicanor Quijano
,
Yiding Wang
,
Yuxuan Chen
,
Fangwei Zhong
,
Mengmeng Wang
,
Wenming Tu
,
Zhaowei Zhang
,
Ziang Chen
,
Zixia Jia
,
Xue Feng
,
Zilong Zheng
,
Chichen Lin
,
Weijian Fan
,
Chenao Liu
,
Sneheel Sarangi
,
Ziyan Wang
,
Shuqing Shi
,
Yali Du
,
Avinaash Anand Kulandaivel
,
Yang Liu
,
Wu Ruiyang
,
Chetan Talele
,
陆孙嘉
,
Gema Parreño Piqueras
,
Shamika Dhuri
,
Bain McHale
,
Tim Baarslag
,
Dylan Hadfield-Menell
,
Natasha Jaques
,
Jose Hernandez-Orallo
,
Joel Z Leibo
· 1 min read
Type
Publication
NeurIPS 2025 Datasets and Benchmarks Track

Abstract. Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent’s ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.