Large Language Models (LLMs) can learn new tasks through in-context supervised learning (i.e., ICL). This work studies if this ability extends to in-context reinforcement learning (ICRL), where models are not given gold labels in context, but only their past predictions and rewards. We show that a naive application of ICRL fails miserably, and identify the root cause as a fundamental deficiency at exploration, which leads to quick model degeneration. We propose an algorithm to address this deficiency by increasing test-time compute, as well as a compute-bound approximation. We use several challenging classification tasks to empirically show that our ICRL algorithms lead to effective learning from rewards alone, and analyze the characteristics of this ability and our methods. Overall, our results reveal remarkable ICRL abilities in LLMs.
This work explores how large language models (LLMs) can learn in context using reinforcement learning (ICRL), rather than the more traditional in-context learning (ICL) based on supervised learning. While ICL relies on providing models with correct input-output demonstrations, ICRL introduces a new approach where the model generates predictions and learns from reward signals it receives after each interaction. We outline three approaches to ICRL:
Explorative ICRL significantly improves over zero-shot performance across all tasks and models. For example, with Llama 3.1 8B Instruct, Explorative ICRL boosts accuracy over zero-shot by +48.8% in Banking77, +56.8% in CLINIC150, and +36.8% in NLU. In comparison, Phi 3.5 Mini sees gains of +46.2% in Banking77 and +55.2% in CLINIC150. Naive ICRL generally performs worse than zero-shot, failing to explore effectively, showing that it is not sufficient for in-context reinforcement learning. Approximate ICRL performs comparably to Explorative ICRL with Llama 3.1 8B Instruct but struggles with Phi 3.5 Mini, requiring less approximation to avoid degeneration. Overall, Explorative ICRL demonstrates continual improvement and effective in-context learning from rewards alone.
Not available yet