LLMs Are In-Context Reinforcement Learners

Abstract

Large Language Models (LLMs) can learn new tasks through in-context supervised learning (i.e., ICL). This work studies if this ability extends to in-context reinforcement learning (ICRL), where models are not given gold labels in context, but only their past predictions and rewards. We show that a naive application of ICRL fails miserably, and identify the root cause as a fundamental deficiency at exploration, which leads to quick model degeneration. We propose an algorithm to address this deficiency by increasing test-time compute, as well as a compute-bound approximation. We use several challenging classification tasks to empirically show that our ICRL algorithms lead to effective learning from rewards alone, and analyze the characteristics of this ability and our methods. Overall, our results reveal remarkable ICRL abilities in LLMs.

Method

This work explores how large language models (LLMs) can learn in context using reinforcement learning (ICRL), rather than the more traditional in-context learning (ICL) based on supervised learning. While ICL relies on providing models with correct input-output demonstrations, ICRL introduces a new approach where the model generates predictions and learns from reward signals it receives after each interaction. We outline three approaches to ICRL:

Naive Implementation: Uses all past episodes in the model’s context. This approach quickly leads to degeneration.
Explorative Approach: Samples a subset of past episodes and uses only positive reward signals. We demonstrate that this approach addresses the limitations of the naive implementation.
Approximate Method: Balances exploration with computational efficiency by maintaining a limited set of potential contexts that can only grow.

Results

Explorative ICRL significantly improves over zero-shot performance across all tasks and models. For example, with Llama 3.1 8B Instruct, Explorative ICRL boosts accuracy over zero-shot by +48.8% in Banking77, +56.8% in CLINIC150, and +36.8% in NLU. In comparison, Phi 3.5 Mini sees gains of +46.2% in Banking77 and +55.2% in CLINIC150. Naive ICRL generally performs worse than zero-shot, failing to explore effectively, showing that it is not sufficient for in-context reinforcement learning. Approximate ICRL performs comparably to Explorative ICRL with Llama 3.1 8B Instruct but struggles with Phi 3.5 Mini, requiring less approximation to avoid degeneration. Overall, Explorative ICRL demonstrates continual improvement and effective in-context learning from rewards alone.

BibTeX

Not available yet