Hands-On Expert-Level Contract Summarization Using LLMs
Discover how we use LLMs for robust high-quality contract summarization and understanding for our legal clients as well as other businesses.
Every day, your sales organization conducts countless meetings, each one brimming with valuable insights and action items that help move your deals forward. These insights can help save time and resources spent in the sales cycle and improve conversion rates. However, distilling these insights from the sea of conversation is no easy task. In this article, we delve into how OpenAI's GPT-4 can be utilized to summarize meeting transcripts, cutting through the noise to deliver the insights you need.
There are three primary methods to utilize GPT-4 for meeting summarization: zero-shot, few-shot, and fine-tuned approaches. In this article, we will focus on the first two methods.
In zero-shot summarization, the meeting transcript is fed directly to GPT-4 with a task focused prompt. The model is relied upon to follow the instructions in the prompt that best describe how we reach our goal output. For simple tasks, the prompt instructions are often sufficient. However, for more complex tasks, even detailed instructions may not be enough, necessitating the use of the few-shot approach or prompt variables. We’ve even started using SOPs to help the model better understand how a manual operator would perform the task.
In the few-shot approach, the prompt includes both instructions and input-output examples to guide GPT-4 towards our goal state output. The model can infer the desired output based on the patterns in these examples and apply these patterns to the target input meeting transcript. These task completion examples help steer the model towards what we consider a successful output for our input data.
In the following sections, we will explore zero-shot and aspect-based summarization environments in various meeting scenarios.
For our first use case, we take the transcript of a client meeting. The meeting is about developing a summarization solution for a client in the sports industry.
The meeting audio was transcribed to text using the Whisper API. The full transcript is shown below with fake names to protect the actual identities and businesses.
Identifying the key topics manually helps us set a baseline for what we would consider good topics to cover in a summary. Since we worked with this customer and put together a successful proposal we’ve got a pretty good idea of what information was important.
The key topics include:
Naturally, GPT-4 has a tendency to rephrase segments of the conversation when tasked with summarization. However, GPT-4 can also perform a more direct form of summarization if it is instructed to pick out sentences instead of creating new ones.
We guide GPT-4 in this process by using the following prompt:
We then evaluate the summary that this prompt generates from our meeting transcript.
The output zero-shot summary is shown below:
As per the instructions, the summary has maintained the original sentences from the transcript. It effectively highlights the service provider's role and clientele, the customer's requirements, and a high-level overview of the proposed solution. It also includes some specific details provided by the service provider about the solution's components.
When it comes to covering the main topics, this summary manages to touch upon five out of the seven key points.
However, the summary falls short in addressing the topic of data confidentiality. Despite the client dedicating a significant portion of the meeting to discuss concerns about data confidentiality and ownership, the summary only indirectly references this through a mention of potential legal obstacles. This important issue has been insufficiently covered.
Another topic that isn't fully covered is the integration of the proposed solution with the client's existing systems. The summary includes one sentence about the solution being incorporated into the client’s system, but it doesn't expand on this. To improve this, we could employ aspect-based summarization prompts that allow us to concentrate more on the specifics of the industry. Let's delve into how we can refine this process.
We can enhance GPT-4’s understanding of what information is important in our summary through a new multi-step prompt workflow. We first ask GPT-4 to provide us with a list of the key topics discussed in the meeting. This could be iterated on through an aspect-based or industry-specific if you want to constrain the data variance available at the input.
Our topic identification process successfully highlighted the overlooked concerns around data confidentiality and ownership as a critical subject. It also efficiently captured details about the other topics that we had already drawn out, ensuring we maintain a comprehensive context moving forward. This list can now be used as a new prompt variable that is considered by our extraction model, and includes new pieces of information that the extractive summarization piece did not consider valuable before. I love workflows like this, as they create very few new edge cases as we didn’t widen the data variance, and improve the accuracy of our good results. In other words, results that were 1 out of 7 or 2 out of 7 rarely go down, but results that were 5 out of 7 and 6 out of 7 go up.
Next, we execute our standard direct summarization process but with a minor modification to consider the key topics. We supply both the transcript and the list of topics to GPT-4. In our full production workflow, these two steps are executed in succession to create a one step workflow.
The summarization now includes key sentences pertaining to AWS IAM Roles and data privacy concerns. While this is pretty good, there's still room for refinement. Some of these sentences could be replaced with others that provide more valuable information. However, for a zero-shot approach with a straightforward prompt, the results are really good. The prompt can be further enhanced with an aspect-based approach which we will discuss later. This approach would allow us to focus more on the specific industry and the use case at hand. Here's an example of a summary output from an aspect-based prompt that we've used specifically for transcripts revolving around business strategy development.
Next, we’ll take a look at how abstractive summarization looks.
For abstractive summarization of the same meeting transcript, we use this simple prompt that has removed the idea of exact sentences, while keeping the same goal state output definition.
In client specific products this soft prompt could be run through our prompt optimization framework to create a much more dataset-specific prompt. The term “write a summary” instead of “generate a summary” produces a much better output as well that reads like an article abstract.
The abstract summary output looks like this:
This summary does a commendable job covering six out of the seven main topics. It successfully highlights the lengthy discussion on data confidentiality concerns and provides a snapshot of the proposed solution.
The coherence between sentences is impressive. Each sentence logically flows into the next, providing a clear and concise overview. In a zero-shot setting, abstractive summaries often have the upper hand over extractive summaries because they can weave together various sentences and reformulate key points. Abstractive summaries have the advantage of processing the data in a way the model finds more intuitive, while extractive summaries have to present the summary as it is. This can sometimes result in a somewhat disjointed read, even if the model grasps the key topics fully. The only area of improvement could be a more in-depth exploration of the client's specific challenges in incorporating the solution into their existing systems.
Next, we explore the summarization of longer dialogues such as webinars or extended meetings.
While the meeting transcripts we've covered so far didn't necessitate breaking up the text for GPT-4, given their duration is between 32-40 minutes, lengthier meetings or interviews present a different challenge. These longer dialogues exceed the context window of GPT-4, necessitating a different approach.
In such cases, we use a technique called chunking to break the text into manageable pieces. Each segment is then processed and summarized individually by GPT-4. However, having multiple summaries for a single meeting based on each chunk isn't practical, so we employ an additional model at the end of our pipeline that focuses on creating a single combinator summary. This model's function is to intelligently merge the smaller summaries into one coherent summary that accurately represents the entire meeting.
To illustrate this workflow, I've created an architecture diagram back in 2021, which provides a visual representation of the entire process.
A common question I’m asked from clients revolves around the process and goals of meeting chunking. How large should my text chunks be relative to the model and meeting? Should they opt for a static size, or should they use a sliding window based on content? How can they ensure their segment encapsulates as much relevant context as possible, while avoiding the repetition of information in different chunks or the loss of an entire key topic split between multiple chunks? Chunking is a crucial component of the process and significantly influences the performance of the summarization downstream. This is particularly true when we think about merging summaries together, which are based on the key context found in smaller chunks or phrases.
Let’s look at a chunking framework that works well for meeting transcripts. The goal of this framework is:
Note: This chunking approach works best with chunks that are a larger proportion of the entire meeting. For pretty much any document or transcript chunking use case I recommend using larger chunks. While extremely long chunks could run into comprehension issues due to generalization, this is mostly seen in chunks that come close to the context limits. This work is outlined in this amazing research paper.
First we segment the meeting transcript into sizes based on a blended score of word count, relevant keywords or topics, and use case specific aspects (optional).
Next, we proceed to pinpoint the most relevant topic from the chunks both preceding and following the current one. This is a different approach from the previous prompt as we're seeking to derive information in such a way that communicates to the model that there's more to the meeting than just this primary topic of the meeting, but this particular topic either strongly aligns with the “current meeting chunk” or it does not. An alternative method would be to identify the most crucial topic from all chunks above and below and fuse them into a single “descriptive” style topic that resembles a more condensed summary. We can now inject a bit of context about the overall meeting into the chunk, without overshadowing or dominating the key context of the current meeting chunk.
We utilize the blending prompt to infuse the key topic from the rest of the meeting transcript into the present meeting chunk. The GPT model generates new contextually relevant language that indicates whether the information has already been discussed or not, based on the content of the current meeting transcript chunk. This step assists the model in discerning which information is already deemed significant and whether it should be included in the current meeting transcript chunk during extractive summarization.
The result is an updated current meeting transcript chunk that provides more insight into the topics discussed in the preceding and following chunks. When executed correctly, this process significantly influences the selection of key topics for this particular chunk. This method proves to be more effective than the traditional sliding window approach of generating key topics from all preceding chunks plus the current chunk. This is because we now have context about what's discussed in the following chunk and we're not overwhelmed with a multitude of semantically similar topics from preceding chunks of text.
When deployed effectively, aspect-based summarization is a powerful tool for drawing out distinct summaries from the same meeting transcript without altering the prompt structure. The only change required is the aspect of focus. This approach allows you to cater to a variety of summarization use case or domain requirements without the need to construct a completely new system and prompt for each use case. The result can be a set of summaries, each offering a unique perspective based on the chosen aspects.
In this context, we present a prompt design that facilitates the easy substitution of the key aspect in the instructions. This is done without the need to awkwardly rewrite the language of the prompt.
The structure of the prompt is as follows:
Here’s a result that looks pretty good when provided a vague aspect.
3 of them are clearly related to the next steps and are centered around action items to take as the next steps. The second point is a bit confusing as to why its related to next steps, and I’d expect that summary to come back if my chosen aspect was more focused on key benefits to the customer. It's the sentence that comes directly after the first bullet point which is probably why it was chosen. The sentence “I can also do the same on my end, but I feel like you'll probably better at the elevator pitch and then pricing information and anything you can think of in terms of what it would take to train in the timeline.” should have been used instead. Here was the models reasoning for each of them:
The reasoning for number two is pretty interesting and provides a reason that takes into account the entire conversation up to that point.
Here I chose a much more granular aspect that is only briefly talked about in the transcript. This should be much harder to extract out as these are really the only two sentences that are related to exactly what the chunking algorithm focuses on trying to do. The model does a great job. It doesn’t choose sentences talking about what we focus on, what summarization focuses on, or even what the results of chunking are. These two sentences are directly related to what the chunking algorithm focuses on improving in our entire pipeline.
Here’s an interesting result that comes back when we let GPT-4 decide how many sentences to return instead of setting that value. This is commonly how summarization systems are used where we don’t decide ahead of time how many sentences are relevant, but ask the model to do it for us.
You’ll notice that the results that come back aren’t really correct. The first and third sentences provide context about the customer's product and the length of the sports commentaries, but do not specifically address their current summarization process.
By asking GPT-4 to “check its work” we can actually improve the results that come back when opening the domain of possible results that can come back. This is sometimes called self reflection and allows the model to review the results one more time before returning. We can see that this fixes our issue on the first attempt.
Interested in integrating these workflows into your product? Let’s chat about how we can build this tool right into Zoom and Google Meet via their APIs.