Hands-On Expert-Level Contract Summarization Using LLMs
Discover how we use LLMs for robust high-quality contract summarization and understanding for our legal clients as well as other businesses.
Documents can often have low signal-to-noise ratios, forcing employees to spend a lot of time reading entire documents just to extract the information that's actually valuable to the business. That's why the summarization of large volumes of text has become one of the most appreciated natural language processing (NLP) capabilities of large language models (LLMs).
In this article, we explore OpenAI's latest generative pre-trained transformer's (GPT's) capabilities for summarization. Specifically, we explore how GPT-4 zero-shot summarization compares against the older GPT-3 as well as state-of-the-art, self-hosted language models on their summarization capabilities.
In this article, the term "GPT-3" covers both GPT-3.5 models, like gpt-3.5-turbo and text-davinci-003 (GPT-3 davinci “instruct”), as well as the older GPT-3 models. All GPT-3 evaluations here were done using the gpt-3.5-turbo model which is considered the best model of its generation.
Compared to it, GPT-4:
In the following sections, we compare GPT-4 against GPT-3 and state-of-the-art, self-hosted models on three zero-shot summarization use cases:
Both experiments use zero-shot summarization which means the models are only provided with an instruction and the text to summarize. Unlike few-shot learning, demonstrative examples are not provided to guide the GPT models. We don't use any fine-tuned GPT models either, just the stock ones available at the application programming interface (API) endpoints. However, some of the self-hosted models may have been fine-tuned for summarization.
In 2022, Goyal et al. evaluated GPT-3 on news summarization using the GPT-3.5 text-davinci-002 model. They compared its results with a state-of-the-art, self-hosted model called BRIO that's fine-tuned for summarization.
Here, we find out how GPT-4 stacks up against both GPT-3.5 and BRIO on the same datasets and benchmarks.
Like the paper, we summarize articles from the CNN/Daily Mail news dataset. The dataset provides human-annotated reference summaries that are highly extractive but not fully so (for example, select phrases may make it to the summaries but not their parent sentences).
Since these articles are quite short and don't exceed the token limits of any of the models, all the details have a chance to make it to the summaries without getting cut out.
We instruct both GPT-4 and GPT-3 to summarize via the chat completion API. The temperature parameter that influences the creativity of generations is set to 0.75 for all experiments.
For summarization using BRIO, we use Hugging Face's pipeline interface.
We then evaluate the generated summaries using the reference-based and reference-free metrics mentioned in the paper, like ROUGE, BERTScore, and SUPERT.
Additionally, we also have them evaluated qualitatively by human evaluators who are asked to select their most and least preferred summaries and provide subjective reasons for those choices.
Two different instruction prompts are run with both GPT models. The first prompt is the same simple one used by Goyal et al., with number of output sentences=3:
The second prompt is more complex, designed to test GPT-4's ability to follow complex instructions:
One test article is a medical story of a kidney transplant that ended up saving six lives thanks to a computer program that finds groups of genetically compatible donors and recipients who can exchange kidneys safely. An excerpt from the article and the gold reference summary are shown here:
GPT-4 and GPT-3 generate these summaries with the simple prompt:
GPT-4 ignores the donor's story and focuses almost entirely on the computer program and its creator. In contrast, GPT-3 focuses equally on both the donor and the enabling program.
The complex prompt results in a more balanced focus by GPT-4. However, the concept of a chain of matches is better explained by GPT-3:
Unlike all the above summaries, BRIO focuses entirely on the donor and attributes the success to big data instead of the specific program and its creator:
ROUGE metrics measure text overlaps. If a generated summary has many of the same words and n-grams as its reference summary, its ROUGE F1-scores will be higher. They're good metrics for extractive summaries but not ideal for abstractive summaries.
Across five random news articles, GPT-3 with the complex prompt scores better than all the other GPT variants. However, GPT-3 with the simple prompt is nowhere near. Which instruction in the complex prompt is causing this contrasting behavior remains unclear. A prompt ablation study or a chain-of-thought deconstruction may throw more light.
BRIO outperforms all the GPT models on these metrics, suggesting that BRIO's summaries are somewhat more extractive and less abstractive. The paper's authors too got the same results:
BERTScore is a much more semantic evaluation of the summary because it uses a language model to evaluate how close the generated summary is to the reference summary. The table below shows the mean F1-scores of each model's BERTScore:
All the models perform equally well but zero-shot GPT-4 is slightly better than zero-shot GPT-3 and on par with the fine-tuned, state-of-the-art model.
Summarization of long-form documents like articles, white papers, legal agreements, medical reports, or research papers can help many industries improve employee productivity by optimizing time spent on reading.
Unfortunately, all LLMs have token limits that hinder their use for this task. The future looks promising, given that there's already a GPT-4 variant with about 8,000 words available to some users, and alternative models like Anthropic that support 25,000 words are on the horizon.
But for now, most GPT-4 users are limited to 8,192 tokens or about 2,000 words. To overcome this limit, we must use strategies like chunking. In this section, we evaluate GPT-4's summaries for long-form documents using long-form blog posts as examples.
We run these summarization experiments on the longest posts available in the wikiHow knowledge base. It's available as a summarization dataset with reference summaries.
As of March 2023, the state-of-the-art summarization model for this dataset combines a BERT encoder with separate decoder generators for abstractive and extractive summaries, introduced by Savelieva et al.
We explore how GPT-4 and GPT-3 zero-shot summarization compare with it. Both models are asked to generate abstractive and extractive summaries using suitable prompts, and the generated prompts are evaluated on ROUGE metrics.
We deliberately select posts that are longer than the token limits of these GPT models. This allows us to evaluate recursive summarization through chunking. We use the LangChain map-reduce pipeline and set the chunk size to 3,000 tokens and chunk overlap to 100 tokens to preserve some local context between chunks.
For the abstractive prompts, the temperature parameter is set to 0.75 so that the GPTs are free to rephrase the posts. But for extractive prompts, it's set to zero to prevent any kind of rephrasing and just select sentences in the posts.
Out of the 215,364 posts in the dataset, only nine are longer than 8,192 tokens (GPT-4's token limit) and only four of those nine are general posts rather than highly technical posts. These four are used for the evaluation.
For abstractive summarization, we didn't provide any custom prompt and just relied on LangChain's prompt:
For extractive summarization, we provided the following custom prompt:
Since GPT-3, in particular, showed a tendency to rephrase sentences despite being told to just "select 3 sentences," it was asked to "just pick 3 sentences" instead and explicitly instructed not to rephrase any sentence. This prompt iteration worked much better than “select”.
An example post and its reference summary are shown below:
The abstractive summaries generated by GPT-4 and GPT-3 are shown below:
Specific instructions have been generalized into more abstract guidelines in these summaries.
The extractive summaries are shown below:
Since our prompt limited it to just three key sentences, it's picked three that convey the main ideas of the article. Arguably, GPT-3 has done a slightly better job here with more emphasis on the do’s and don'ts than GPT-4.
The mean ROUGE F1-scores for GPT-4 and GPT-3 are shown below:
We see that GPT-4 abstractive scores less than GPT-3 abstractive, indicating that GPT-4 is much more abstractive for the same prompt.
At the same time, GPT-4 extractive scores are higher than GPT-3 extractive, indicating that GPT-4 follows strict instructions better.
Overall the mean F1-scores are low because the reference summary is generated by summarizing each paragraph but our prompts are designed to zero-shot summarize an entire long-form document rather than paragraph by paragraph.
The metrics from the BERTSum model in the paper are shown below. Since it summarizes paragraph by paragraph, it scores higher:
Many businesses find themselves overwhelmed by the deluge of reviews and feedback their products and services receive. Often, these reviews contain nuggets of useful information that are valuable to the business. But the volume of data makes it difficult to isolate the useful information from all the noise.
In this context, opinion and aspect summarization can help businesses extract useful information from product reviews by isolating reviews along multiple aspect axes. This approach was explored by Bhaskar et al. in their paper, Zero-shot opinion summarization with GPT-3. In this section, we extend their exploration to GPT-4.
Our approach uses one of the summarization pipelines used by the paper on the hotel reviews from the summaries of popular and aspect-specific customer experiences (SPACE) dataset. This dataset contains about 100 reviews for some 50 hotels with each review spanning about 10-15 sentences. In addition, it provides reference summaries along different aspects of a hotel, like its service or its cleanliness.
The pipeline replicates the topic-wise clustering with chunked GPT-4 summarization (TCG) approach. The topics here are the six aspects on which the reviews are evaluated — rooms, building, cleanliness, location, service, and food.
Its summaries are evaluated using ROUGE and BERTScore metrics and compared with GPT-3's metrics from the paper.
This pipeline first labels the review sentences with aspects to yield six aspect clusters. Each cluster of sentences focuses on a single aspect. We then summarize each aspect cluster using GPT-4.
For the aspect labeling, the paper uses GLOVE vectors. However, we experimented with two alternatives:
We found that the first approach of labeling sentences using GPT-4 chat prompts was sometimes unreliable and often unacceptably slow. The first prompt we tried was based on counts and looked like this:
However, for this prompt, GPT-4's unreliability manifested in many ways:
To overcome these problems, a second prompt was tried that didn't rely on explicit counts but enforced it implicitly by forcing GPT-4 to fill in a structured JSON result:
This worked reliably. But the problem was that it was unacceptably slow. Classifying 50 sentences would take about 30 minutes and we had hundreds of sentences per aspect across 10 hotels. Lesson learned: This approach is not good in any production scenario.
The alternative was to use embeddings.
The embeddings API enabled us to label thousands of sentences within a few seconds.
To enrich the semantic meanings of the aspect labels, we converted them to these seven aspect phrases:
OpenAI's latest "text-embedding-ada-002" engine generated 1,536-dimensional embeddings for these seven aspect phrases.
We then generated embeddings for all the review sentences using the same engine. Since these are unit-normalized embeddings, finding semantic similarity was as simple as running a matrix dot product between the sentence embeddings matrix and the aspect embeddings matrix.
Each row in the resulting matrix corresponded to a review sentence and contained cosine similarities with the seven aspects. We picked the highest two values in each row as the best label and the next best label. We picked two labels instead of just one because many sentences were actually relevant to multiple aspects.
All the review sentences of each hotel were then clustered under their best and next-best aspect labels. We ended up with seven aspect clusters per hotel.
Like the paper, we grouped each aspect cluster into chunks of 30 sentences, summarized each chunk separately, and then applied recursive summarization.
We used the same prompts as the paper:
We got a final summary of about 30 sentences for each aspect and for each hotel.
The image below shows examples of sentences grouped under two aspects of hotel reviews — rooms and service — using the embeddings API:
Some of the sentences are clearly relevant to other aspects like location too. That's why we labeled each sentence with not just the most relevant label but also the second relevant label.
The image below shows some of the aspect-wise summaries generated by GPT-4 through zero-shot summarization:
The summaries contain valuable insights extracted from around 100 reviews, covering both positive and negative aspects. It's clear that they've provided many actionable areas of improvement. That kind of useful information would have needed a lot of staff and manual effort in the past.
The ROUGE and BERTScore metrics from the paper for GPT-3 pipelines are shown below:
We focus on the results of the TCG pipeline because that's the one we adapted for GPT-4 too. The scores for the same pipeline but implemented using OpenAI embeddings API and GPT-4 recursive summarization are shown below:
Compared to the GPT-3 pipeline, our GPT-4 pipeline doesn't score as high on ROUGE but scores higher on BERTScore.
In this article, we explored how GPT-4 fares against GPT-3 and state-of-the-art self-hosted summarization models. GPT-4's zero-shot summarization advancements enable businesses to extract high-quality information and insights just by using suitable prompts without having to spend time fine-tuning custom language models. Talk to us to find out how you can use it in your specific workflows.