Hands-On Expert-Level Contract Summarization Using LLMs
Discover how we use LLMs for robust high-quality contract summarization and understanding for our legal clients as well as other businesses.
OpenAI's GPT-3 and ChatGPT may have stolen the limelight, but by no means are they the only players in the large language model (LLM) and generative ai ecosystem. You can instead run powerful open-source alternatives like GPT-J more efficiently and reliably on your own servers. They're capable of all the same use cases like blog generation, customer service chatbots, internal search, question-answering, and more. As we’ll see below these open source models perform much better than you think! Let’s take a look at what GPT-J is, and how GPT-J vs. GPT-3 fare on language processing tasks.
GPT-J is an open-source, open-access large language model that came out of experiments at EleutherAI to train massive models on the scale of OpenAI's GPT-3. With 6 billion parameters, GPT-J isn't the biggest but is above average and bigger than their older GPT-Neo models. Eleuther's own newer GPT-NeoX with 20 billion parameters and others like BLOOM with 176 billion parameters outclass GPT-J. For comparison, GPT-3 has about 175 billion parameters.
The main benefit of GPT-J is that its model and code are available to everyone to customize and deploy on consumer hardware or private cloud infrastructure.
GPT-J, like GPT-3 and GPT-2, is an autoregressive model consisting of just the decoder of the standard transformer model. Like GPT-3, it's a causal language model (LM), meaning that its predictions are entirely based on the words and context that came earlier. In contrast, masked language models use context from both sides. GPT-J is autoregressive because, at every step, it uses the previous prediction as input for the next prediction.
Unidirectionality is not a weakness. How causal LMs work is closer to how people use languages, making them a better fit for natural-sounding text generation. In contrast, the ability of masked LMs to use both sides for context makes them better for document understanding tasks.
A critical factor behind the power of these LLMs is the data they're trained on and the format of the text sequences relative to how users use it. One of the reasons ChatGPT (different from GPT-3 discussed here) is so much better at Q&A than smaller models is because it is trained specifically on human reviewed Q&A pairs in a very similar format to how you use it in the web interface.
GPT-J was trained on an 825-gigabyte collection called the Pile that consists of 22 high-quality datasets from a diverse set of sources that enable its use in multiple domains for various use cases. Some of the sources are:
GPT-3 was also trained on the Common Crawl data and on WebText2, Books1, Books2, and Wikipedia among others. This wide range of text sources and various language formats is a key part of why GPT-J can be used across so many use cases.
By seeing the structure and statistics of text in these datasets, the parameters of these LLMs learn to mimic natural-sounding and semantically consistent text. A visible consequence of this is the ability to discern abstract patterns in the text just like people do, giving them powerful few-shot capabilities on many natural language processing (NLP) tasks.
When you read the above text, you immediately notice patterns at multiple levels. You see the obvious patterns in the structure. Then you sense higher-level abstract patterns like analogies between each line and its next. By applying these analogies, you're able to predict the next word. This is called few-shot learning.
Since they are trained on large volumes of human-produced text, LLMs too learn to sense these patterns. We can use those abilities for natural language tasks through two mechanisms:
This technique supplies the LLM with examples of what a correct result is for your specific use case, and how to get there with various patterns and instructions. In some natural language generation use cases where the correctness of the output is a bit more relative to the user these few shot examples are even more critical to help the model understand the key details in the goal output.
Such inputs with examples and instructions are called "prompts." The patterns and analogies in the examples influence the LLM to predict text that's consistent with them. It's not genuine learning but more like mimicking the abstract relationships in the examples. Surprisingly, that's good enough for many NLP use cases!
Just remember that, by default, a GPT model won't learn or remember anything from a previous prompt. So if you want to classify or summarize many things, you must supply these examples for every inference. We use things like dynamic prompt optimization to provide these LLMs with optimized prompts and examples each time they run.
Another drawback of in-prompt guidance is the length restriction on prompts. Every LLM's input capacity is limited to just a few thousand tokens (tokens here are not even full words but fragments, often two or three to a full word). GPT-J and most GPT-3 models restrict to 2,048 tokens while the latest GPT-3 Davinci model supports 4,000 tokens. But what if your use case is something like document summarization or blog generation with too much text to fit in a length-restricted prompt? Providing more than one or two few-shot examples is impossible for those use cases. For them, you'll have to look into fine-tuning or zero shot with stronger prompt language.
Like any other neural network, you can fine-tune an LLM's network weights towards specific goals by retraining some of its layers. If you want the LLM to sound like a legal or medical expert, you can fine-tune it on relevant case files or medical records. Given enough data, its network weights will change to generate text consistent with the domain's text.
Both GPT-J and GPT-3 support fine-tuning. Since GPT-J is a regular neural network model, you can fine-tune it like any other network by unfreezing its layers. You have full control over the process, like the layers to unfreeze and optimization algorithms to use.
In contrast, GPT-3 doesn't let you access its models directly. It provides a basic application programming interface (API) that you can use to fine-tune. But:
In the next three sections, you'll see how these approaches and models fare on applied NLP tasks.
Let's see if GPT-J can hold its own against the four GPT-3 models on few-shot classification using in-prompt guidance.
The goal was to test GPT-J versus all four GPT-3 models on few-shot intent recognition using the banking77 dataset. The dataset contains short online chat queries like these from bank customers:
They're categorized under 77 fine-grained categories:
We selected 10 labels at random for this test. A large number of categories makes this a difficult task for even trained models. Can the GPTs do well using just few-shot guidance? Also, notice that the labels aren't proper phrases but programming strings joined by underscores. We expect the GPTs to predict them verbatim.
The challenge here is to achieve the best classification metrics without crossing the prompt's limit of 2,048 tokens. The unknown parameter is the optimal number of examples per label for this dataset. Too few means our metrics may be sub-optimal. But too many means we may be overfitting and getting unreliable metrics.
So we conduct the experiment over multiple cycles using split testing of prompts:
The selected samples and labels were combined with delimiters to create the few-shot prompt above. You can immediately spot the structural patterns. Quite impressively, even the GPTs can, without any instructions or headers.
Next, we selected about 30 samples from the validation set, and appended each of them to our few-shot prompt, as shown above, to get 30 prompts for inference.
We passed the 30 prompts to the completion endpoint of the four GPT-3 models using OpenAI's Python library. The parameters below were set because we want them to generate just one of the 10 labels and nothing else:
We passed the same 30 prompts and nearly the same parameters to the GPT-J model API hosted by a popular nlp cloud API called Forefront.
As you’ve probably seen before and been told, it’s often found that open models don't do too well right out of the box. Because they’re smaller, they’re seen as less able to complete any task in many people's minds. Some tweaking and fine-tuning is needed every time to compete with the larger models. We fully expected bad labels here. So we were blown away when GPT-J not only generated only valid labels (with underscores) but also matched Babbage's metrics and was closer to Curie than Ada was.
Where did GPT-J have a performance gap? The failed samples above show that they're quite ambiguous even for us. Even Curie tripped up on the same sample. All said and done, GPT-J, out-of-the-box, fared surprisingly well in head-to-head fine-grained text classification against GPT-3.
The next experiment was document summarization. Few-shot prompt guidance was impractical because the documents and summaries were too long to provide more than one or two examples. So we fine-tuned GPT-J using Forefront’s fine-tuning interface and GPT-3's Curie using OpenAI's tools.
We used the Multi-LexSum legal dataset that provides three different expert-authored summaries for court cases — a long summary, a short one, and a tiny one. The plan was to use the long summaries as our "documents" and the short ones as ideal summaries for fine-tuning. Then given an unseen document (i.e., a long summary), the GPTs must generate something close to its short summary or at least contain all the same information.
GPT-3's fine-tuning guide recommends at least 500 samples while GPT-J says at least 200 samples. We used the same 500 samples for fine-tuning both. The samples must be javascript object notation line (JSONL) records with two fields, prompt and completion, per record. There are particular rules for delimiters at the end of both and a space at the start of the completion.
After uploading the training and validation JSONL files using OpenAI API, we launched the fine-tuning task using their command-line tool.
Unfortunately, it just stayed in the queue for almost 4-5 hours. But once started, it finished in 10 minutes.
For fine-tuning GPT-J, we uploaded the same training JSONL file and used Forefront's web frontend. We can use the API in theory but we found some things not working as expected. Fine-tuning for two epochs and five checkpoints with evaluation took about 15 minutes.
In the images above, you can observe how the generated summary improved between first and last checkpoints. It's short only because the length validation parameter was wrongly set to just 64, though the ideal length must be slightly more than the longest ideal generation.
Let's compare the summaries:
From this qualitative comparison, it's clear that fine-tuning improved GPT-J's ability to summarize better and include most of the same salient details as the human expert. You can even add a bit of advanced entity extraction to this pipeline to improve the models understanding of key information for a more extractive summarization output. This is a perfect example of why we tell customers that data is more important than the size of the model you choose. Most customers scoff at the idea of using a model other than Davinci GPT-3 simply because they believe size of the model equals more accuracy for their use case. While the size of the model certainly affects the models ability to learn tasks in a low data environment (few-shot learning), fine-tuning these task agnostic language models to your specific data and use case completely changes the equation on which model should be chosen.
Surprisingly, both baseline and fine-tuned Curie generate almost identical summaries. Is baseline Curie that good, or is it possible that Curie had already seen this data during its training?
This last experiment was to explore GPT-J's ability to sound like a domain expert by training it on domain-specific data. This is useful for businesses to create bots that can provide domain-specific autocomplete suggestions, explain complex internal processes to their employees, or explain product features to their customers. Since it fundamentally modifies the model's text generation abilities, fine-tuning is the way to go.
We trained it on the same Multi-LexSum dataset as the summarization experiment. For fine-tuning text generation, we must leave the JSONL prompt fields empty and only supply the completion fields. We supplied 500 long summaries as completion fields. The idea was to turn it into a kind of legal expert. The fine-tuning process was the same as done before for summarization.
However, as the generated text above shows, both the base text generation model and the fine-tuned model hallucinated unrealistic statements. In hindsight, this was because this dataset is not suitable for text generation at all. It's full of case-specific details with no general theory or explanations.
The ideal datasets for this task are things like knowledge bases of products or theoretical works on a subject.
GPT-3 is OpenAI's proprietary LLM that you can only access via their API. In contrast, an open-access LLM like GPT-J gives you more deployment freedom. Some of the choices include:
This is one of key benefits of using these open source models. This control over the infrastructure used to inference your model gives you the ability to adjust the speed of inference time with much more flexibility than managed services.
An LLM may be a very capable text generator, but its knowledge and abilities are limited by what it saw during training. However, many use cases require the model to be knowledgeable about current real-world events. For example, law firms may require their LLM assistants to stay up to date with that day’s morning's court appointments and rulings. Or customer service chatbots may need to know the latest product prices or traffic conditions.
LLMs can be immensely useful to many business use cases if we can enable them to see the world in real-time. We'll cover two approaches to such real-world data augmentation.
LangChain is a spaCy-like library to create data processing pipelines around LLMs. It integrates tightly with LLM-oriented interfaces like prompts and concepts like agents and statefulness to create powerful data pipelines that use LLMs. You can use it with GPT-3, GPT-J, or any other LLM.
For example, it can retrieve data from an external API, inject them into an LLM's prompt or output, or split and recombine long documents for tasks like summarization. But, useful as it is, LangChain doesn't fundamentally change the working of an LLM. The next approach does.
Toolformer is a paradigm shift because it enables an LLM to learn to use data augmentation tools by itself and make context-aware decisions, like when to use or not use a tool. Such built-in context awareness is far more powerful than rule-oriented approaches like LangChain.
Under the hood, it introduces special tokens (like BERT's CLS token). If the rest of the context activates that token, it executes actions like running scripts, connecting to another neural network, or calling third-party APIs. It then intelligently decides whether the results it receives are relevant to the context and if it should include them in the generated text.
Because Toolformer directly modifies some model fundamentals, you need an open-source one like GPT-J for it. Of course, OpenAI may decide to implement it in the future and provide an API but are you ready to wait with uncertainty? Moreover, you may never be able to use a third-party LLM with your internal APIs. By opting for self-hosted ones like GPT-J, you can integrate your proprietary APIs or data sources immediately to provide unique services to your customers.
Internal knowledge search engines, customer service help, document summarization, question-answering, or text classification are all some of the new possibilities that artificial intelligence has opened up via powerful large language models like GPT-J. Contact us for ideas on how to integrate them into your workflows to improve your business efficiency.