Multi-turn interactions between large language models (LLMs) and users naturally include implicit feedback signals. If an LLM responds in an unexpected way to an instruction, the user is likely to signal it by rephrasing the request, expressing frustration, or pivoting to an alternative task. Such signals are task-independent and occupy a relatively constrained subspace of language, allowing the LLM to identify them even if it fails on the actual task. This creates an avenue for continually learning from interactions without additional annotations. We introduce ReSpect, a method to learn from such signals in past interactions via retrospection. We deploy ReSpect in a new multimodal interaction scenario, where humans instruct an LLM to solve an abstract reasoning task with a combinatorial solution space. Through thousands of interactions with humans, we show how ReSpect gradually improves task completion rate from 31% to 82%, all without any external annotation.
We deploy an LLM policy \(\pi_{\theta_{\rho}}(a \vert x) \) to interact with users in multi-turn interactions. Following each round, the LLM reasons retrospectively about each of its actions (highlighted in blue) to decode feedback given the interaction context, including follow up utterances. After each round, the model is retrained using all data aggregated so far \(D_{\leq \rho}\). The LLM improves over time without any external annotations. The plot on the right shows the performance curve in our experiments - the LLM improves from 31% to 82% task completion rate over six rounds.
Multiref is a multi-turn reference game . A speaker and a listener both observe a shared set of tangram shapes, but in different order. The goal of the speaker is to describe a subset of targets for the listener to select. Because the target requires multiple abstract shapes, humans often communicate the targets gradually over multiple turns. As an interaction progresses naturally, the speaker produces implicit feedback signals that validate or reject the listener's actions.
We present deployment results across three rounds for six concurrent systems, and three more rounds for the top system (B-SUP), together with human-human references (HH) and a redeployment of the initial policy \(\pi_{\theta_0}\) (CONTROL). Left: interaction-level success rate (\(\uparrow\), higher is better). Center: interaction-level efficiency by # turns per interactions (\(\downarrow\)). Right: micro-level performance by click accuracy (\(\uparrow\)).
More granularly, we present the turn-level performance of B-SUP and controls, evaluated by post-hoc human annotations. Left: % turns where the policy's action \(\hat a\) matches exactly the human listener's action \(a^*\) (\(\uparrow\)). Center: similarity between the policy's action and the human listener's action (\(\uparrow\)). Even actions that receive negative feedback in deployment (NEG FB) are increasingly similar to human actions. Right: % turns that annotated to have received positive implicit feedback from human listeners (\(\uparrow\)).
@misc{chen2024retrospective,
title={Retrospective Learning from Interactions},
author={Zizhao Chen and Mustafa Omer Gul and Yiwei Chen and Gloria Geng and Anne Wu and Yoav Artzi},
year={2024},
eprint={2410.13852},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.13852},
}