Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions.
We introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions.
Based on LOCOMO, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks.
An example of a conversation in LoCoMo is shown to the right.
In this study, we present a holistic evaluation framework to assess an agent’s proficiency in managing and responding within long-term contexts.
We present extensive experimental results on the LoCoMo benchmark using instruction-based LLMs, long-context LLMs, and RAG techniques. Our findings include:
@article{maharana2024lococmo,
author = {Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei},
title = {Evaluating Very Long-Term Conversational Memory of LLM Agents.},
journal = {arxiv},
year = {2024},
}