Hands-On Expert-Level Contract Summarization Using LLMs
Discover how we use LLMs for robust high-quality contract summarization and understanding for our legal clients as well as other businesses.
Automatic text summarization of legal text like contracts, agreements, or court judgments using artificial intelligence techniques comes with difficult problems like not including critical sentences, missing topics due to chunking problems, partial coverage of sections, and including irrelevant sentences and topics. Overcoming these problems requires careful analysis and the use of appropriate pipelines that are use case specific.
In this article, we review a state-of-the-art pipeline called Deep Clustering Enhanced Summarization (DCESumm) for improving legal document summarization using deep clustering techniques.
Legal documents are often long and divided into multiple distinct sections. They can also contain specific boxes and ordering that makes it difficult to evaluate the informativeness of a sentence to the overall document, which is a prerequisite for creating a good summary.
The example legal documents below demonstrate some of these problems.
Agreements and contracts are typically divided into distinct sections as shown below:
If the user wants every section to remain in the summary too, then section-level summaries must be generated. Some sections can be quite lengthy. Every sentence's relevance and informativeness must be evaluated in the scope of its parent section.
Most legal contracts and agreements are written in very sophisticated language. Lawyers try to hide important information by using obscure words and difficult phrases. These can be hard to read and understand even for humans. When generating summaries for such contracts, the model might not mention these tiny but relevant details. To fix this, individual summaries should be generated for every section in the contract, and if the sections are too long, they should be broken down into smaller chunks and processed accordingly. Once all the chunks are processed and summarized, they can be combined with a well written prompt that retains the valuable information from all the chunks.
This approach ensures that no important information is lost even when the context length is very long or when the provided content to summarize is very sophisticated. Because the contract is broken down in chunks, it is made sure that every part of the contract is processed in a more focused manner.
A typical judgment is shown below:
Some judgments have sections and others don't. Additionally, different topics or themes (e.g., injuries, negligence, or compensation in the above example) may be dispersed throughout the judgment and referred to multiple times. This can cause issues when understanding the key topics of the document and the relevancy they have in the final summary, especially when writing extractive summaries.
For such documents, a good summary involves identifying these themes throughout the document, evaluating their related sentences and the informativeness of each sentence in the entire document, and including only the most informative sentences in the summary.
How relevant is a sentence to the overall document? How much does it contribute to the overall informativeness of the document? Is this sentence strongly correlated to the key topics or entities? These questions decide whether a sentence belongs in a reasonable summary. DCESumm answers them through two key innovations.
Most summarization approaches tend to evaluate sentence relevance using simple metrics like cosine similarity or Euclidean distance.
However, an alternative is to use supervised machine learning models trained on labeled summarization datasets. This enables more complex modeling of the relevance scoring. DCESumm does so with a deep neural network.
A trained relevance scoring model may score some sentences wrong due to ignoring their global informativeness. So, the target document's unique structure and global context should be used to correct the sentence scores.
For that, DCESumm looks for clusters of high and low relevance, and uses them to boost or lower the individual sentence scores. A low-scoring sentence in a highly relevant cluster is probably more informative than its score says and should have its score boosted. Similarly, a high-scoring sentence in an irrelevant cluster is probably a mistake and should have its score reduced.
In addition, DCESumm implements the clustering itself in a better way for improved relevance modeling. Instead of a traditional distance-based algorithm like K-means, it mixes supervised approach and unsupervised clustering to learn a cluster topology that's customized for the legal domain.
The DCESumm pipeline is shown below:
Its components are explained in the following sections.
The first pipeline component takes the target legal document, extracts all its sentences, and converts each sentence to a contextual vector embedding.
For embeddings, practitioners typically use a Bidirectional Encoder Representation from Transformers (BERT) model, one of its derivatives, or even large language models like GPT embeddings. DCESumm just reuses a pre-trained LEGAL-BERT model for this.
LEGAL-BERT is trained on legal datasets to learn the terms, concepts, phrase patterns, and sentence structures prevalent in legal documents. So, its embeddings work better on legal documents compared to the standard BERT that's trained on general text corpora.
Since LEGAL-BERT produces an embedding for every token in a sentence, a sentence-level embedding is generated by averaging all the token embeddings. Conceptually, this approach is similar to Sentence-Transformers, and an equivalent model from the latter can be used instead of LEGAL-BERT.
The next component is a trained multi-layer perceptron (MLP) network that calculates a relevance score for every sentence from its embedding vector.
Architecture-wise, it's just four dense layers with a dropout layer between each pair to improve the generalization. The input to this model is a single sentence embedding. Its output sigmoid neuron calculates a relevance score for the sentence as a probability.
For training this network, the training and testing datasets are derived from an extractive summarization dataset like BillSum as a dataset for binary classifiers:
This gives us a training set where each row has a sentence embedding and a zero or one label to indicate its absence or presence in the summary.
The neural network network is trained on this dataset. Its learned weights form a model that's able to predict whether a sentence, represented by its embedding, is likely to be in the document's summary, and if so, with what probability. This probability is its sentence relevance score. What's nice about this workflow over something like GPT-4 summarization is you have a concrete value you can reference for how valuable a specific sentence is. When summarizing with GPT models you don’t have that key insight that can be used to iterate and improve models. Clear evaluation metrics towards the goal state output are critical for iterating models in production.
DCESumm's clustering is based on the deep embedded clustering approach. It consists of multiple deep neural networks as explained below.
The first component is a stacked autoencoder network that reduces a high-dimensional embedding to a low-dimensional feature vector that K-means clustering can handle.
This network consists of the following layers:
It's trained on the sentence embeddings, and since it's an autoencoder, the input and output are the same embeddings. Its sole purpose is to produce a high-quality five-dimensional representation at the fifth layer for a given high-dimensional sentence embedding. Once trained, only the encoder stack till the fifth layer is used.
From its previous phases, the pipeline already has the sentence embeddings of a document. Each embedding is reduced to a five-dimensional feature using the autoencoder. These five-dimensional embeddings are run through standard K-means to obtain an initial set of sentence clusters and cluster centroids.. The initial number of clusters is set to 35% of the number of sentences in the document.
Normally, K-means optimizes clusters by repositioning their centroids such that L2 distances to their member points are minimized. The implicit assumption is that the feature space topology is Euclidean.
But since it may not actually be Euclidean, a more general approach is to use convex optimization like stochastic gradient descent (SGD) to incrementally optimize the cluster assignments. This has the added advantage of simultaneously improving the sentence representations too. The technique is explained in detail next.
The cluster labels predicted by K-means form a probability distribution. They're considered as soft cluster assignments to be updated in the next iteration.
From them, a second probability distribution is derived using the student's t-distribution. This is called the auxiliary target distribution which is considered as hard cluster assignments.
The commonly used difference metric between any two probability distributions is the Kullback-Leibler divergence (KLD). In every iteration of the SGD, the KLD is minimized by nudging the soft cluster assignments towards the hard cluster assignments. Then the representations are updated and cluster centroids are recalculated.
This is repeated till the soft and hard cluster labels are within a threshold. The cluster assignments are then frozen and used for the next step.
At this point, have a set of optimized clusters, each consisting of a set of low-dimensional sentence embeddings and a cluster centroid in that feature space.
First, the sentence relevance scoring model from before is reused on the updated sentence representations to calculate new sentence relevance scores. The overall cluster score is the median of these new relevance scores.
Finally, the new sentence relevance scores are weighted by the cluster scores, as shown in this formula:
If a cluster's relevance score is high, it'll boost the scores of all its constituent sentences. But if it's low, it'll bring down all its sentence scores too.
For each document, the sentences are sorted by their enhanced relevance scores, and the top N sentences are selected as the extractive summary. N here depends on the average summary length of the selected dataset.
Generated summaries are evaluated using either reference-based metrics or reference-free metrics.
Reference-based metrics compare some characteristics of a generated summary against its reference summary. For example, ROUGE and BLEU scores measure n-gram overlaps between them while ignoring the semantic meanings they convey, which is fine for strictly extractive summaries but not for loosely extractive or abstractive summaries.
Other reference-based metrics like BERTScore and MoverScore do consider the semantic similarity between the summaries by evaluating them using trained language models. However, when evaluating domain-specific vocabulary like that in legal summaries, these models may score wrong due to training on generic datasets.
In general, reference-based metrics don't seem to match the subjective evaluations of summaries by people. So, more modern approaches use reference-free metrics like SUPERT or SummaC to evaluate the linguistic aspects of summaries like their factuality, faithfulness, semantics, and genericity using characteristics that people use for subjective evaluations.
Since DCESumm produces strictly extractive summaries, the paper evaluates them against the reference summaries using simple ROUGE scores.
The paper scores better on the test datasets compared to its alternative baseline methods:
It also performs better than other state-of-the-art deep learning models on extractive reference summaries:
An example generated summary against its reference summary from BillSum is shown below:
In this article, we explored a state-of-the-art summarization system that uses innovative techniques to evaluate the relevance and informativeness of sentences.
At Width, we have extensive experience in implementing legal document summarization techniques as well as other natural language processing (NLP) pipelines to improve legal document understanding, including using the latest large language models like GPT-4.
Contact us to know how your law practice can improve its productivity using such techniques.