Hands-On Expert-Level Contract Summarization Using LLMs
Discover how we use LLMs for robust high-quality contract summarization and understanding for our legal clients as well as other businesses.
The field of text summarization for input texts of all different types and sizes continues to grow in 2022, especially as deep learning continues to push forward and expand the range of use cases possible. We’ve moved from being able to summarize paragraphs and short pages to being able to use large language models to summarize entire books (thanks to OpenAi) or long research papers. These advances in transformers and large language models have driven the game changing summarization abilities seen right now. Models such as GPT-3 have made it easy for anyone to get started in text summarization to some level.
As we continue to push the bounds of text summarization it’s easy to see why it’s considered one of the most challenging fields to perfect. I’m sure you’ve seen it’s not incredibly difficult in a playground environment with simple tasks such as paragraphs and a small dataset. What happens when you want to have a bit more input to what you consider a “good” summarization or go from 1 paragraph to 8? As the data variance changes and you look to have more input to what your summarization looks like the difficulty grows rapidly. The type of architecture you need to go from experimenting in summarization to a production system that supports a huge range of text sizes and document types is entirely different.
With GPT-3 specifically, you have a number of different variables to take into account that make it different from other summarization architectures. Parameters such as temperature, prompt size, token limit, top_p, and goal task are just a few of the things to manage with a GPT-3 text summarizer. These variables allow you to create an incredibly powerful and customizable text summarization system but do make it extremely challenging.
We’ll take a look at a wide variety of systems we’ve built in GPT-3 to tackle the summarization problem for different document lengths and types. We’ll start with simpler models that help us get the basics down then move to much more powerful architectures for the various different ways we can summarize.
We should discuss the two main types of summarization before diving into actual GPT-3 summarization. Extractive and abstractive summarization are two different methods to create a summary of a given input text. An output summary can have a blend of both or be one or the other, but they do have key features that outline the difference. Summarization systems that are much more structured than GPT-3 will often be categorized as one or the other.
Extractive summarization focuses on selecting near exact sentences from the input text to form the summary. The architecture acts similar to that of a binary classification problem for each sentence in the text. The goal is to give a yes or no to extracting that sentence into the summary. We’ll then use portions or the entire sentence in the extracted summary.
The most popular architecture for extractive summarization is BERT as a baseline architecture with a level of domain specific finetuning or retraining on top. Deciding what sentences matter is a bit relative and domain specific which is why you’ll see variations for different use cases. You can use a library we leverage all the time in SBERT to create a summary with the ability to say how many sentences you want.
Abstractive summarization works to generate a paraphrasing type summary that takes all information in the input text into account to form the summary. This is different from extractive as it does not look to use exact sentences from the text but a concise summary of everything.
The common architectures follow the transformer focused approach that we see in extractive summarization with fine-tuning on top for specific domains such as books, research papers, financial interviews, and legal documents. These models usually support pretty large input sizes and we’ve used larger models up to 4096 tokens with Longformers. It should be noted that the new GPT-3 update allows for 4000 tokens as well with the completions API.
Most GPT-3 summarizers you see are built as abstractive summarizers given the amount of data required to finetune the transformer based architectures for new use cases outside of the available datasets.
Zero shot text summarization refers to using GPT-3 to summarize a given text input without providing any examples in the prompt. We simply provide the instructions for what we want GPT-3 to do and provide the text. In the playground example above we just provide a top line that says what we want to do, then the text to summarize, with the generated summary. The generated summary can be seen below.
The GPT-3 playground provides another example of summarization by simply adding a “tl;dr” to the end of the text passage. They consider this a “no instruction” example as they have not specified an initial task and rely entirely on the underlying language models' understanding of what “tl;dr” means.
As you can see either of these default playground examples use any prompt examples above the input text we want to summarize. It’s also worth taking note of the rather high temperature value and shorter max tokens. You’ll notice that the summarization you get for either input changes if we use the other ones instruction set:
As I talked about before, GPT-3 pushes the bounds of direct extractive vs abstractive summarization in a way we’re not used to seeing. This summary is certainly more extractive as we see direct information being pulled, but we can also see how phrases are being combined together.
Zero shot based GPT-3 prompts allow you to utilize the underlying model's understanding of your task given through instructions or headers in the prompt without being affected by examples. The model is only steered by any headers and instructions and it leverages these to grow its understanding of what you consider correct. At the end of the day, most GPT-3 tasks are somewhat relative. What you consider a correct response vs what the model considers correct vs what a client considers correct can all be somewhat different. In summarization, it can mean an emphasis on specific keywords, topics, or phrases. It can mean a specific length or contain specific proper nouns.
The reason the two GPT-3 playground examples for summarization come out different when we switch the input text is that GPT-3 has a different idea of what a summary for a 2nd grader looks like and what a “tl;dr” summary looks like. Although they are both summaries, we’ve asked GPT-3 for two very different responses in the model's understanding.
Because you haven’t included any real information through examples or a more specific prompt you’re at the mercy of the models underlying understanding. This makes it more difficult to steer the model towards what we might consider a good summary. This is the key flaw in zero shot summarization, as it becomes much harder to fit our prompt to a test dataset as the variation grows. Language changes to just the header and instruction cause less and less change as a few things happen:
This makes zero shot summarizers relatively unstable and very hard to use in production without a full natural language processing pipeline. There are some use cases where it does make sense.
Here’s another zero shot summarizer that focuses on 5 word summaries. Although you might consider it correct for this use case (I don't), let’s consider how difficult it would be to make it focus on another part of the summary without steering the model through a ton of prompt language testing or prompt examples.
There are a number of use cases we’ve worked with where you don’t really have a choice based on input document size constraints or cost considerations. Before the recent GPT-3 update the size limitation for any prompt was 2048 tokens or 1250 words. Many of our use cases such as interview summarization or legal document summarization have input sizes that are far too big for a prompt. You either have to build out a large natural language processing pipeline (we’ll explain later) or create document chunks that are small enough to include any prompt examples as well. Raw chunking notion can cause a number of issues we’ll discuss.
You don’t have any prompt examples to use. This is probably the most popular reason why people start out with zero shot. You need to build up examples with high data variance to leverage the production architectures. If your product is in the early stages it’s more important to gather user feedback of the GPT-3 outputs than it is to have a perfectly refined prompt.
Another way you can use zero shot is with a fine tuned model. Fine tuned tokens do not cost money per run as you pay for them up front when you create the new model, which reduces your per run cost. You could argue that this isn’t really zero shot anymore given you need prompt examples for training data. The steering effect that a fine tuned model has on a given task is not nearly at the level of prompt examples, and you could use public datasets and fit them to your use case.
The key benefit to zero shot summarization is the ease at which you can get the system off the ground. If you understand building prompts at a high level you can reach fairly good accuracy on a ton of different use cases. At the end of the day the level of system you’ll need for your task fully depends on what your task is and how much data variance you want to cover. If you’re just looking for an internal tool to use for simple tasks or as a small piece of a larger pipeline you can build and test prompt variations quickly.
For large documents you can make your way through the text in less runs.
The underlying GPT-3 model actually has a pretty good idea of extractive and abstractive summarization to some level. This means that if you’re just looking to use baseline versions of these without any variations for your use case you can just allow the model to do its job.
Summarization models for writing summaries for news articles, blog posts, legal documents, books, etc where the input text is much larger than the max tokens require an entirely different framework. Here are a few things to think about as you look to handle huge documents with zero shot. It’s assumed you’ve already decided you can’t use any prompt examples for this task.
You’ll need to split the input text into smaller chunks to pass through GPT-3. Should I just try to fill up as much of the prompt as I can with each run to minimize the total runs? One problem with this is that your model's ability to understand each sentence and its importance to the overall chunk will go down as the size grows. This doesn’t really affect extractive summarization as much you can simply just increase the sentence count, but abstractive will take a hit as it becomes harder to decide what information is valuable enough to fit into a summary as the “possible information pool” grows. You also limit the size of your summary that can be generated.
Smaller chunks allow for more understanding per chunk but increase the risk of split contextual information. Let’s say you split a dialog or topic in half when chunking to summarize. If the contextual information from that dialog or topic is small or hard to decipher per chunk that model might not include it at all in the summary for either chunk. You’ve now taken an important part of the overall text and split the contextual information about it in half reducing the model's likelihood to consider it important. On the other side you might produce two summaries of the two chunks dominated by that dialog or topic.
I’ve now created all my summarized chunks from the long form document. How do I condense them down into a single summary
For news articles or a blog post; should I split the document based on headers or pages? With a lower amount of data and less data variance this seems like a route that works. You do need to account for instances where a given section or page is longer than the token limit and the extra text either needs to run through as a super small chunk (huge change in data variance) or be appended to the next chunk. This can become extremely difficult to protect against in this specific use case as a new header based sections often take a turn in the topic discussed and can be difficult to predict.
Blog post summarization on our Chatbot article using our chunking algorithm. The OpenAi engine picks up on two separate topics in the same chunk and summarizes. On top of that it recognizes a proper noun and is able to correlate information to it correctly.
The first important idea is to fully understand what type of summarization you’re looking for. The adjustments made to take care of the above issues depend on what your expected output is.
Answer some of the data variance questions related to what your expected result looks like. How long do you want your summary to be? How many sentences do you want to extract? Are there key features (proper nouns, key topics, specific people) that you want in your summary?
How much adventuring are you comfortable with the model using when generating summaries? You’ll need to have an understanding of how to use temperature.
We’ve built a robust chunking to summary zero shot algorithm that leverages a number of GPT-3 models in a pipeline to reach the final summary. This works for both extractive and abstractive summarization and just requires prompt instruction tuning to move across use cases. Most of the time we build a “model zoo” of prompt frameworks for different summarization use cases such as news articles, blogs, interviews, and legal documents. This framework has been tested and used in production for 30+ page interviews and 20+ page legal documents.
While the pipeline lives as both chunking and summarizing it can be easily used separately to handle smaller documents with summarizing and other use cases with chunking. We’ll talk more about the full implementation in the few shot section.
Few shot learning with GPT-3 refers to taking the underlying task agnostic large language model and showing the prompt actual examples of how to complete the task. The model combines its trained understanding of how to predict the next token in a language sequence and the “pattern” it picks up on in the prompt through examples to produce a much higher accuracy result. Accuracy is an odd idea here, as it really just follows the examples and tries to fit its quick learning to the new input. As you can imagine, if your examples are incorrect (sentiment analysis) or don't contain the output you would want, you’ll get a result that you don’t want.
Few shot learning with relevant examples has been shown to boost the accuracy of GPT-3 up to 30% for some tasks, and the boost for summarization is no different. Relevant prompt examples help guide our GPT-3 summarizer to include specific language and topics in our summary without needing overly structured prompt instructions that can lead to overfitting.
One key difference is our new ability to steer the model towards results that exactly fit what we want without needing multiple steps. We can use our examples to affect both extractive and abstractive type summarizations to account for what we want. In the example below I’ve added a few examples of space related text and created an extractive summarizer. I’ve defined in the examples what text I care about and the model uses that in the input text. One key way to see the results of examples is how rigid our output becomes in terms of exact sentences. When running this same exact prompt and instruction with zero shot the sentence would be changed just a bit in the output. The model would use words such as “They” or “it” as it tries to make the sentences flow a bit better. Through the examples, I show the model that I want exact language pulled out without even telling the model in the instructions I want that.
Each of the examples in this prompt has two sentences in the output and the model picks up on that and uses two for the new result. I add an extra sentence to each of the prompt examples and our input text adds one as well.
This can be seen as both good and bad. While it gives us more control and steering it makes it harder to allow the model to decide if we should have 2 or 3 sentences. This trade off creates new parameters to think about.
We now have new biases we need to account for when building prompts and/or testing our results. While these biases such as beginning text and bottom text bias existed before with our zero shot prompts, they become much more prevalent as we start using few shot learning. We’ll need to keep these in mind as we define our prompt example structure.
One of the main differences between the two systems is how we set up the architecture for producing summaries. It’s no secret that GPT-3 performs better when the examples provided are relevant to the input text. Examples that discuss the same topics, are from the same news article or are closely related help GPT-3 better understand how to map input language to output language for our specific task.
Most of our full scale production summarization architectures are few shot learning based as we’ve seen them to produce the most flexibility and highest “accuracy” towards our goal.
Let’s take a look at what we use in our production GPT-3 summarization pipeline. While this architecture is mostly focused on larger document summarization it does work to some extent for short documents. The key change is the chunking algorithm side and some prompt language tweaks. Smaller documents generally don’t need as large of an architecture out of the gate (assuming the summarization use case isn’t extreme). Let’s break down the architecture.
We already talked about how prompt examples that are relevant to the input text produce much better generated results. The question is how do we make sure we have relevant examples in our prompt for a wide range of inputs? If we can only have so many examples of summarization in our prompt, how do we cover the nearly unlimited number of topics or sources for our input text?
Prompt optimization is the task of putting relevant examples in front of our input text at run time dynamically via an algorithm. The algorithm focuses on looking at a database of prompt examples and choosing the most relevant ones to the input text based on some correlation. This allows us to build a completely different prompt for each input text and know it’s optimized exactly for this text.
All of our prompt optimization algorithms start with some level of semantic similarity comparison. Semantic similarity is a great way to get a high level understanding of the similar language used between two strings of any length. We’re big fans of using the SBERT SentenceTransformers framework to generate embedded vectors to compare strings. At a high level the architecture will look something like this.
Fine-tuning is another important part of the process of steering our production GPT-3 engine towards higher quality summarizations. Fine tuning allows us to create a custom model in GPT-3 with Davinci as the baseline. The massive benefit to this is that there is no token limit to fine tuned models like there is the prompt. This means we can have way more examples affecting our few shot learning abilities. While I don’t think fine-tuned examples have the same level of “steering strength” as in-prompt examples, they certainly move the needle far more than just the baseline Davinci model.
Fine-tuning is especially beneficial in our summarization use case given the length of the input text for many of these examples. Even if our input text is shorter (450 tokens) we can only fit a few examples in the prompt, whereas the fine-tuned model can hold many more. For large document use cases the benefit of fine-tuning grows even more.
The only current downside to fine-tuning is the instruct davinci models are not supported. The instruct models are trained on language that fits the “instruction” format of GPT-3 a bit better than the free flowing text that the original models are trained on. This means you have to use the old school davinci model in your architecture if you want to leverage fine-tuning.
Almost every large text supporting GPT-3 architecture we build uses a chunking algorithm at some level depending on the use case. At a high level a chunking algorithm is simply an algorithm used to break down large text to be processed with a GPT-3 task, then combined back together using an output algorithm that works to create a combined output. These chunking algorithms can be as simple as just splitting the document up into smaller sizes, or as intensive as using multiple NLP models to make decisions on where to split a document.
To us, the chunking algorithm is the most important part of the entire pipeline. It might seem like a relatively simple or unimportant part of the equation, but its location in the pipeline as well as its ability to be the variance control or “ground truth” of inputs makes it incredibly important to the accuracy of your summarization.
The first reason a high quality chunking algorithm is so vital is that it’s the first process in your pipeline, so changes in its parameters and abilities affect all the downstream models. Changing the output look or structure means we usually have to retrain or adjust our GPT-3 models. This is because the downstream models were refined for a specific input structure or data size which how now changed, lowering accuracy. An example would look like this:
Our original chunking algorithm just splits the document into simple 500 token chunks, with any extra at the end being added to a chunk up to 650 tokens or a smaller chunk. All of our downstream summarization models are refined and tested with this algorithm. Through some split testing we see that one of our summarization models works better by adding topic summarization to the front of the input text. If we want to make this change to the chunking algorithm to now include a topic summarization, we need to refine any downstream models that use the output.
The second key reason is that even changes to something as simple as the chunk size require refinement downstream. The GPT-3 models are used to a certain level of variance through processes such as prompt optimization and fine-tuning. Changes to the input text size could cause the accuracy to go down. An example could look like this:
The average number of topics discussed in an interview transcript chunk was 1.57 with our original chunking algorithm. We make changes to the size to reduce the number of runs we have to go through to produce a summary. The new average topics discussed is 2.43 and the GPT-3 models have not built a strong enough understanding of how to summarize chunks that are this diverse.
Another situation where we’ve seen chunking algorithms to be so important are differences in accuracy for different models for different chunking algorithms. Some tasks in a pipeline react much better to certain chunking algorithms than others, which makes it important to have a full grasp on how flexible your downstream models are to changes.
The key GPT-3 parameter is the temperature. Temperature controls how much the model is allowed to “adventure” or take less common routes during generating tokens. At a deeper level this means how often does GPT-3 choose a less favorable (lower probability) token when generating the next one in a sequence. Temperature at 0 means it never generates anything less than the highest probability token, and 1 means it does it the most often. This leads to taking less common routes as once a lower probability token is chosen the next tokens have to be best fit to that token, meaning it normally leads to more diverse responses.
In summarization, the temperature parameter often controls how much the model sways away from the exact language used in the input text. It can try to formulate its own understanding of the input text and combine multiple key points together to reach that. The temperature parameter you use fully depends on what your goal summary looks like, and how much variation in the summary you’re okay with.
Occasionally we’ll use the “Best of” parameter as it’s useful to generate multiple responses and evaluate. This does come at an increase in cost as the prompt will run multiple times.
To clean up the generated output or to decide if we need to rerun based on some decided heuristic metric we’ve developed an output processing module. This is commonly compiled using popular NLP libraries such as spaCy or BERT. We’ve used this to evaluate different split tested fine-tuned models as well.
We’ve built a confidence metrics module that allows you to put a confidence score on each run that comes through. This can be used as an accuracy metric shown to users, backend process starter, or simply a metric used to analyze production runs. We usually build these custom for the summarization use case as the required NLP models can vary based on extractive vs abstractive.
This is a module you don’t see built often and is not available anywhere on OpenAi. Confidence in the world of autoregressive models can be a bit ambiguous. Considering the idea that a lot of GPT-3 tasks can have a “relative” accuracy, meaning what I consider right vs what you consider right vs what GPT-3 considers right can all be different, it can be hard to establish criteria to measure accuracy. Summarization falls right into this category as it’s not easy to establish a method for seeing a good summary other than just saying yes or no.
Here are a few zero shot and few shot GPT-3 based summarization outputs that we’ve used.
The model works to create a variation of extractive summarization that works to summarize the blog post content into a few bullet points. The architecture is very lightweight and relies on strong prompt language that only requires a few examples. The data variance coverage is pretty high but it does lose accuracy for sites with comment sections attached to the end.
These are a few summarization bullet points that are generated from an article we wrote on different NLP techniques for information extraction. The summary does a great job of talking about the various techniques that are discussed throughout the article and fully understanding that they are related to NLP. The summary also includes some of the benefits that are discussed and even parts where we described system enhancements via machine learning rule-based systems.
We combined a complex chunking algorithm with a multistep summarization pipeline to create an extractive type summarization of the key topics discussed throughout long financial interviews. Many of these interviews are over 25 pages long and have multiple turns and dialog switches throughout, making it that much harder to build a chunking algorithm.
The format of an interview also makes it tricky to establish consistency as the back and forth Q&A format can make it difficult for an NLP model to understand the contextual similarity between multiple questions and answers.
There's been a ton of work done recently dealing with long form clinical diagnostic interviews and being able to extract key questions and answers to form an opinion of the current state of the patient. This has been blended with knowledge-infused abstractive summarization to create summaries of the conversation that help medical professionals perform a much better follow-up with patients. The key approach outlined in this research paper takes domain specific knowledge and incorporates it with a linear programming framework used to optimize informativeness. The research team uses three main approaches:
Extractive summarization with the SumBasic algorithms
Abstractive summarization using integer linear programming without the infusion of knowledge
Abstraction over extractive summarization to evaluate the performance of the algorithm.
The knowledge-infused abstractive summarization algorithm produces a 7 sentence long summary that focuses on including questions and response exchanged during the conversation.
OpenAi released a book summarization method that uses a fine-tuned GPT-3 model to generate quality summaries of entire books with state of the art accuracy on the BookSum dataset. The model achieves the same level of accuracy of a summary as someone who read the book 5% of the time and slightly less than that 15% of the time.
The summarization method focuses on combining reinforcement learning from human feedback recursive task decomposition which allows human preferences of what information matters in a summary to be combined with breaking up a difficult high level task into smaller easier ones. They describe it as breaking up summarizing a long piece of text into summarizing several shorter pieces (sounds like something we’ve talked about!). When dealing with larger input texts such as books this method even has benefits during testing and evaluation. If the model focused on the entire book at once the human who is evaluating the summary would have had to read the entire book to see if the summary is relevant enough. With these smaller chunks, the human evaluator just has to understand and provide human feedback for the smaller input section. The recursive task decomposition is described to have these benefits as well:
Easier to trace and understand the text to summary process. Mapping how the input text becomes the reference summaries in terms of key information becomes much easier and more efficient when dealing with smaller summaries.
OpenAi claims there are no limits to the size of the book they can summarize. The only constraint in an equation like this is the context length of the transformer, which task decomposition removes to some extent.
We build custom NLP and GPT-3 based summarization architectures across different domains and use cases. We’d love to learn about what you’re trying to summarize and how our knowledge and models can help move you toward your goals. Reach out today to talk about how you can use text summarization.