92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
The open-source large language model (LLM) ecosystem has now matured past the point of simple manually built prompts and completion fine-tuning. They can now follow complex instructions and hold useful multi-turn conversations with your users and customers on par with market-leading commercial LLMs.
In this article, we explore a technique called reinforcement learning from human feedback that transforms raw LLMs into capable conversationalists and instruction-following autonomous agents.
Out of the box, pretrained large language models are only capable of completions, i.e., they can only add text to a supplied prompt. You can't talk back and forth with a pretrained LLM like it's a chatbot.
To enable them to conduct dialogue with people, you must first train and finetune them for instruction-following and conversations.
Reinforcement Learning from Human Feedback (RLHF) is a training process to teach an LLM for instruction-following and multi-turn conversations using reinforcement learning.
The best way to understand what RLHF can do is to compare the output of a stock LLM with its RLHF-enhanced version.
In the demo below, we ask a question to a stock Llama 2 LLM that hasn't been finetuned for conversations or instructions:
The simple question in the prompt elicits almost random text as the output. That's because the LLM examines the last word "mean" and decides the next word purely based on the text corpora it's trained on. It doesn't understand that the prompt is a question it must answer.
Now compare the result from an RLHF-enhanced Llama 2 chat LLM:
Thanks to RLHF, the LLM understands that it's being asked a question and that its generated text must not only be a meaningful answer but also have the tone and cadence of an answer.
In the rest of this article, we explore how to implement RLHF for open-source LLMs.
When the first set of open-source LLMs like LLaMA came out in early 2023, open-source RLHF datasets and techniques weren't available. Only the large commercial LLMs were implementing RLHF. But by the third quarter of 2023, RLHF-powered open-source models as well as software and datasets to implement it were readily available.
As of September 2023, you can use these RLHF-trained or RLHF-ready open-source LLMs.
The diagram below shows the typical process to finetune an LLM for instruction-following and multi-turn dialogue using RLHF.
We explain each of these steps and datasets in the sections below.
LLM pretraining is the initial training of an untrained LLM on large text corpora. All LLMs start out like this, including popular ones like the legacy GPT-3 davinci-002 model and the Llama 2 base models. Since so many pretrained, open-source, ready-for-commercial-use LLMs are readily available, you rarely have to implement this step from scratch.
SFT is often the first step in creating an LLM capable of instruction-following or multi-turn dialogue. It takes a pretrained LLM that's only capable of completions and trains it to follow instructions or chat with a user.
Instruction-following refers to the ability of an LLM to follow instructions for specific tasks like summarization or code generation. These LLMs typically contain "instruct" in their names (like InstructGPT, for example).
In contrast, multi-turn dialogue is an LLM's ability to follow any number of diverse instructions and answer a variety of questions through back-and-forth conversation with a user while remembering all the information exchanged so far. It's a superset of instruction-following. Such LLMs are typically labeled as chat or chatbot LLMs (for example, Llama-2-70b-chat).
SFT's goal is to make a pretrained LLM reply to a user's prompt with specialized information in a suitable tone instead of simply adding text to the prompt like a raw LLM. It's similar to LLM pretraining but with differences in these aspects:
In the next section, we examine some datasets useful for SFT.
For SFT, you can use these public datasets.
The Databricks-dolly-15k is a dataset for instruction-following created by Databricks employees through an internal gamified crowdsourcing process. The Dolly-15K dataset has these characteristics:
The Stanford Alpaca dataset has these characteristics:
The OpenAssistant conversations dataset has these characteristics:
SFK teaches an LLM to generate plausible and meaningful responses to instructions and questions. Unfortunately, people may not perceive many of its responses as relevant, helpful, or safe. To overcome that problem, additional training using RLHF is necessary.
RLHF is a model training technique that aligns an LLM with human preferences in terms of relevance, helpfulness, and safety. The technique combines reinforcement learning with a machine learning model that mimics human preferences, as explained in the sections below.
The standard technique to combine human preferences or behavior with a machine learning model is reinforcement learning (RL).
The typical RL paradigm works like this: In response to an input, a model acts according to a policy. The policy can be a complex non-linear function implemented by a neural network. A separate reward model (RM) then evaluates its action and either rewards or penalizes it. This reward or penalty is fed back to the policy to optimize it such that the probability of future reward is maximized.
Applying the RL algorithm to an LLM, the language model becomes the policy and its action is the generation of the next token. The RM rewards the policy if the generated tokens satisfy human preferences and penalizes it if not.
This reward is implemented as part of the loss function for the fine-tuning. A reward minimizes the training loss while a penalty maximizes it. Accordingly, the gradient optimization learning algorithm adjusts the policy's (i.e., the LLM's) unfrozen weights such that future rewards are maximized.
We first explore how to implement RLHF reward models in more depth.
There are many ways to implement reward models that mimic human preferences.
The RM can be a multivariate regression model that produces a composite numerical score as a reward (or penalty) for an action. This regression model combines scores on various aspects like:
The training data for such a model is provided by thousands of human evaluators who score the LLM responses on the above aspects on some scale like one to five.
However, such scoring is time-consuming and expensive. Besides, numerical scores bring a lot of subjectivity into the process and require additional quality control checks like measuring inter-rater agreement using Fleiss' Kappa or similar metrics.
A faster, simpler, and cheaper approach is to present just two alternative responses to each evaluator and ask them to select the better one. In the training data created this way, each row will have a prompt, a winning response, and a losing response.
The reward function calculates a scalar score for each winning and losing response. The loss is simply a function of the difference in their scores, also known as the pairwise ranking loss. It ensures that the winning response's score is always greater than the loser's.
An example of this is the Llama 2 RM's loss function:
Proximal policy optimization (PPO) is a technique to optimally update the policy (i.e., the LLM's unfrozen layers).
Normally, policy gradient optimization using gradient descent involves just one gradient update per data sample. In contrast, PPO performs a gradient update across a batch of samples. This turns out to be faster as well as simpler to implement.
Another technique you can use for RL is rejection sampling. It's one of the two techniques, along with PPO, used by Llama 2 for its RLHF implementation.
In rejection sampling, multiple responses are generated by the policy (i.e., the LLM), and the best candidate among them is selected by the reward model. For each prompt, the sample with the highest reward score is considered the new gold standard. The model is then finetuned on this new set of re-ranked samples, effectively reinforcing the reward with another reward.
You can implement your RLHF finetuning using these public datasets.
The SHP dataset has these characteristics:
The Anthropic HH-RLHF dataset has these characteristics:
In this section, we review useful frameworks and libraries you can use to implement RLHF.
TRL is a Hugging Face helper library with these out-of-the-box capabilities:
LLaMA efficient tuning is a framework for fine-tuning open-source LLMs like Llama 2, LLaMA, Falcon, and more. It provides the following capabilities:
RLHF has been a key advancement in turning LLMs from pure language models into useful tools for business workflows. In this article, you explored how you can apply supervised fine-tuning and RLHF to open-source LLMs to create your own capable agents and chatbots. Contact us if you need such customized chatbots in your customer service or other processes.