Hands-On Expert-Level Contract Summarization Using LLMs
Discover how we use LLMs for robust high-quality contract summarization and understanding for our legal clients as well as other businesses.
Day-to-day sales and meeting conversations between your employees and your clients or your customers and your support teams are all rich with information that's potentially useful to your business. You can get new ideas to improve your quality of service, bring new products to market, or provide new services in demand.
But extracting insights from such information is not easy by any means. Most business conversations are mixed with tangential conversations, personal banter, filler words, and similar noise that trips up regular language processing tools. In this article on dialogue summarization with GPT-4, we explore whether OpenAI's GPT-4 large language model is capable of looking past such noise and giving you the insights you need for relevant tasks.
There are three approaches to dialogue summarization environments using GPT-4: zero-shot, few-shot, and fine-tuned. In this article, we focus on the first two approaches.
In zero-shot summarization, the dialogue to be analyzed is fed directly to GPT-4 with an appropriate prompt. You rely entirely on GPT-4's to follow all the instructions in the prompt. For simple goals, the instructions in the prompt are often sufficient. But for more complex goals, even detailed instructions may not be sufficient. That's when you need the few-shot approach.
In the few-shot approach, the prompt contains both instructions as well as input-output examples to demonstrate what you want GPT-4 to do. GPT-4 can infer what you want based on the patterns in those input-output pairs and reproduce those same patterns for your target dialogue.
In the sections below, we explore zero-shot and few-shot summarization on various use cases.
For our first use case, we select the transcript of a client meeting. The meeting is about implementing summarization for a client who specializes in sports commentaries.
The meeting audio was transcribed to text by a natural language processing speech-to-text model. The full transcript is shown below:
Knowing the key topics in this transcript helps us evaluate the generated summaries later. Some key topics are:
By default, GPT-4 tends to rephrase conversation sections when asked to summarize. GPT-4 can do strictly extractive summarization if told to select sentences rather than write or generate anything.
This prompt helps curb GPT-4's tendency to rephrase and makes it simply select key sentences:
"From the meeting transcript above select the key sentences that convey the gist of the meeting and output them verbatim without any changes, paraphrasing, or rephrasing."
The summary generated by this prompt for our meeting transcript is evaluated next.
The extracted summary is shown below:
As instructed, it has retained sentences from the transcript verbatim in the summary. It does a good job at covering what the service provider does and who they work with, what the customer is looking for, and what the solution would look like at a high level. It even pulls in some key information on what the service provider says about specific details of what goes into the solution.
In terms of the number of key topics covered, this summary scores about five out of seven.
One topic it missed out on is data confidentiality. In the transcript, the client talks for a considerable time about data confidentiality and ownership worries. However, in this summary, they are alluded to only indirectly through a reference to potential hurdles with their legal team. It's a critical topic that's been ignored.
The other topic it partially missed is on integrating this solution with the client's systems. It extracts one sentence about the solution living in the client’s system, but does not discuss further. This could be tuned through aspect based summarization prompts that allow us to focus a bit more on the specific industry. Let’s look at how we can quickly improve this workflow.
We can improve the models zero-shot understanding of what information is important in our summary through a new prompt workflow. We first ask GPT-4 to provide us with a list of the key topics discussed in the meeting. You could make this more domain specific if you want, but I kept it generic to show a better understanding of the topics.
We have life! Our topic extraction picked up on the missed data privacy and ownership concerns as a key topic. It also grabs information about the other topics which we were able to pull already, meaning we won’t lose any context in our next step.
Now we run our same extractive summarization with a small tweak to account for the key topics. We provide the transcript as well as the list of topics. The full production pipeline runs these two steps back to back.
We can see the extractive sentences now include information about AWS & data privacy. I’d still like to replace a few of these sentences with other sentences that are a bit more valuable, but for zero shot with a very simple prompt it's pretty good. This prompt could still be improved with an aspect based approach (we'll see that below) to hone in on the specific industry and use case. Here’s the output summary from an aspect based prompt we use for our specific transcripts focused on customer discovery.
Next, we'll see how abstractive summarization fares.
For abstractive summarization, we use this simple prompt:
"Write a summary of the conversation above that focuses on the key information discussed."
This is a simple soft prompt that could be run through our prompt optimization framework to create a much more dataset specific prompt.
The abstractive summary is shown below:
This is a good summary with six out of seven topics covered. It also focuses on the data confidentiality worries that took up considerable time and includes a summary of the offered solution.
The flow between sentences is also excellent. Every sentence here is logically related to the previous one. In a zero shot enviroment, abstractive summaries generally outperform extractive summaries given they get to blend various sentences together and rewrite key ideas. Abstract summaries get to “ingest” the data in a way that the model prefers, where the extractive summaries have to produce the summary as is, which can read a bit awkward with no examples, even if the model fully understands the key topics. The only change I would consider would be how it doesn't go into great detail on the client's own peculiar challenges in integrating the solution with their existing systems.
While the transcripts used above did not require chunking when used with GPT-4 due to them being about 30 minutes long, some longer dialogues such as webinars or longer meetings don’t fit in the context window of GPT-4. For this we use chunking to split the text up and process (summarize) it in chunks. We then use a model at the end of the pipeline to combine the smaller summaries into one summary. Here’s my architecture diagram I built for this way back in 2021.
One of the questions I am asked constantly by clients is what to do for document chunking. How large should my chunks be? Should I just use a static sized chunk or a sliding window based on content? How do I help my chunk contain as much valuable content as possible while limiting repeating information in different chunks or losing an entire key topic by being split between multiple chunks? Chunking is a critical part of the equation and really drives how well the summarization performs downstream, especially when we consider the idea of combining summaries together based on the key context in smaller chunks.
Let’s look at a chunking framework that works well for dialogue examples. The goal of this framework is:
Note: This approach works best with chunks that are a larger percentage of the entire dialogue. I always suggest going with larger chunks over smaller ones, given what is discussed above.
We split the dialogue chunks into set sizes based on a blend of word count, keywords, and aspects.
We then go to extracting the number one key topic from the above and below chunks. This prompt is a bit different from the one used above as we want to extract the information in a way that lets the model know that there is other information outside of this one key topic that is relevant, but this key topic either highly correlates with the “current dialogue chunk” or doesnt. Another way to do this is to extract a number one key topic from all chunks above and below and combine them into a single “explainer” style topic that reads a bit more like an abstract summary. The goal of this step should make sense. We can now add a bit of information about what is going on in the rest of the dialogue, without bogging down or “overpowering” the current dialogue chunk context.
The blending prompt is used to blend the number one key topic into the current dialogue chunk. We use GPT-4 to generate new language that tells the model that this information has already been discussed or hasn’t been, based on what is talked about in the current dialogue chunk. Once again, this helps the model understand what information already has been deemed as overly valuable and if we should extract it or not in the current dialogue chunk when doing extractive summarization.
The outcome is a new current dialogue chunk that has a bit more information about what topics are discussed above and and below this one. You’ll see when done correctly this greatly affects what key topics are chosen for this specific chunk. This also works much better than the commonly seen approach of a sliding window of all generated key topics from the above chunks + the current dialogue chunk as we now have context of what is below this chunk, and doesn’t bog us down with a ton of very semantically similar above chunk key topics.
Next, we explore the summarization of another kind of dialogue: chats with customer support channels or customer chatbots.
The ability to extract key information from customer chats can help companies improve both the quality of their products or services and their customer service.
For our experiments, we run zero-shot extractive and abstractive summarization on conversations between customers and various customer support channels on Twitter.
The TweetSumm dataset from Hugging Face contains about 1,000 customer support conversations with the Twitter support channels of different companies. Each conversation has about three extractive and three abstractive summaries created by human annotators.
Here’s an example conversation:
Let's find out how GPT-4's extractive and abstractive summarization fare on this.
The reference, human-annotated extractive summaries for the above conversation read like this:
We use the following prompt to tell GPT-4 to select the key details:
"Select the two exact key sentences from the customer support chat above that best describe key custom support issue discussed."
Normally my goal state output definition would be a bit less granular to show wide data variance support coverage across a number of conversation log use cases, but I do think this one could be reused across a number of support issues or channels.
The extracted summary by GPT-4 looks like this:
I asked the model to explain its result as a way to further understand the models selection. This can be used for testing and prompt optimization to iterate the prompt towards a goal state output
The output is very good. These are cleary the key sentences from the from the conversation that best describes the issue. I do wish the model had used a sentence from both the customer and the support, but if we ask for 3 sentences it does exactly that.
Let’s take a look at the results when focusing on an abstractive summary.
For the abstractive summary, we use the following prompt:
"For the customer chat above, write a summary with a maximum of 2 sentences on the customer's problem, focusing on the key information. On a new line, write a single-sentence summary of the solution focusing on the key information."
The human-annotated, reference, abstractive summaries for this conversation are shown below:
Here’s the abstractive summary generated by our GPT-4:
It describes the core problem accurately and matches the reference summaries as well.
It has also followed the formatting instructions we gave to write the solution on a separate line.
Aspect-based summarization when done correctly is a great way to extract different summaries from the same text with zero changes to the prompt structure, outside of what aspect to focus on. This lets you take care of different summarization use cases without building an entirely new system for each use case. This should all happen with the summaries having a high level of differentiation between chosen aspects.
Here we outline a prompt in a format that allows us to easily replace the key aspect to focus on in the instructions without rewriting the prompt language or having the prompt read awkward.
The prompt structure looks like this:
We use a variable {aspect} as our adjustable aspect that we want to extract. On top of zero-shot environments and when the data is available, aspect based summarization is also a pretty good use case for few-shot prompts with a few examples to help the model fully understand how to correlate exact sentences to more difficult or “abstract” aspects. We could use a dynamic prompt if we had a database of example summaries that make sense.
Let’s use an interesting aspect that is specific to this use case. Whats the point in showing examples of generic summarization prompts now that we can quickly substitute out various focus points? Here we focus on console sleep mode.
This is a pretty awesome zero shot example. The model is smart enough to know we don’t need to find the exact keyword. It even understands console turning off should be correlated to a sleep mode! These are the only two sentences that are related to sleep mode specifically as we can see asking for 4 sentences returns sentences that aren’t nearly as correlated to sleep mode specifically.
When I widen the data variance coverage of the aspect we’re using to “various console modes” we can see it extracts sentences related to other console modes or operations to perform.
One thing I constantly see from clients is issues with accuracy directly correlated to simple grammar or communication of the instructions to the prompt issues. As good as GPT-4 is at understanding wide ranges of english language “levels”, this is still a language model that prefers clear instructions and correct grammer. If I remove the phrase “exact key” from the prompt it produces a completely different list of sentences. These sentences are not nearly as correlated to the aspect as the example above. Little stuff like that goes a long way.
A big part of production level prompting is just being clear with what the goal state output should be, and iterating on that when the accuracy doesn’t reach a level we’re looking for. Having these parts of our prompt as concrete as possible allows us to even use aspects that are very abstract and don’t focus on a single identifiable aspect, but forces the model to decide that as well. Here’s an example:
This same aspect-based workflow can be used for key topic extraction. Instead of asking for key sentences we ask for key topics. This is a great way to extract high level short descriptions of the entire customer discussion. You’ll notice that these key topics follow the flow of the entire conversation so you’ll get a grasp of the topic throughout the entire conversation.
I got tricky with this and decided I want any key topics except those related to the connect. A difficult task when you consider the entire conversation revolves around the Kinect.
While I kept this negative prompting framework private, it follows the concepts from one of my favorite prompting articles by Allen Roush.
Want to integrate these workflows into your product?
Contact us for expert help on streamlining your business workflows with GPT-4 and custom models.