Microsoft Research has rolled out a fresh technique designed to better align Large Language Models (LLMs) with human intentions, leveraging active preference elicitation. This new AI strategy aims to fine-tune the efficiency and precision of LLMs by maximizing their reward functions.
Harnessing Human Feedback
Historically, Reinforcement Learning from Human Feedback (RLHF) has been the go-to method for aligning LLMs with user expectations. This approach seeks to enhance a reward function based on human evaluations of prompt-response pairs. The diversity in responses is crucial for developing flexible language models, ensuring the reward system doesn't become confined to narrow optimal solutions.
The alignment process can be executed either online or offline. Offline alignment involves creating a variety of responses for predetermined prompts, but this often doesn't encompass the extensive range of natural language variations. Conversely, online alignment adopts an iterative approach, collecting new preference data from feedback on LLM-generated responses. This method facilitates exploration of previously uncharted language territories but may risk overfitting due to passive data gathering.
Introduction of Self-Exploring Language Models (SELM)
To overcome the shortcomings of current techniques, Microsoft researchers have proposed a bilevel objective that prioritizes responses with a potentially high reward. Dubbed Self-Exploring Language Models (SELM), this method incorporates the reward function directly within the LLM, therefore eliminating the need for a separate reward model. SELM seeks to enhance exploration efficiency while minimizing unselective preference for new extrapolations compared to Direct Preference Optimization (DPO).
Initial experiments show that SELM boosts performance on key instruction-following benchmarks like MT-Bench and AlpacaEval 2.0 when applied to models such as Zephyr-7B-SFT and Llama-3-8B-Instruct. Furthermore, SELM displays notable performance across a range of academic standards in various contexts.
This method ensures that LLMs not only follow instructions more accurately but also consider a broader spectrum of responses. It marks a significant step forward in aligning LLMs with user intentions, promising more reliable and capable language models. For those interested in more specifics, the research paper is accessible on arXiv.