Hands-On Expert-Level Contract Summarization Using LLMs
Discover how we use LLMs for robust high-quality contract summarization and understanding for our legal clients as well as other businesses.
In this article, we explore our Width.ai patient record summarization pipeline that can:
In this task, we're interested in extracting five essential pieces of information from patient reports:
Many pages of a report may have all this information. We want to extract from every page and summarize all the extracted information in a single timeline.
The patient reports can be PDF or image files. PDFs can be of two types:
Five sample patient records are analyzed in the sections below.
Although it looks like a simple layout, these types of documents can cause issues for basic OCR implementations or Amazon Textract. They have column based key value pairs, left to right key value pairs, and sections that use a bit of both. The red box is one of the best ways to show this.
One of the most difficult parts of extracting information from these documents is very long tables. This is due to the fact that these systems have to correlate fields to the column titles with positional information and NLP understanding of the titles. As these move farther away from each other it becomes harder for the positional information to correlate the two and not confuse the fields with values outside of the table. At some point the model has to wonder if the field text that is far away from the column title is even a field!
This is one of the most interesting patient records to process into a structured format. This record format contains very nested key value pairs that have multiple layers to them. Headers come in multiple different boldness levels and sizes, which makes it even more difficult to recognize and extract.
Extracting the text in the schema that the document follows is critical for being able to use this information downstream. If the text is extracted in a way that doesn’t help us understand what information correlates to other related information, we lose understanding of what values mean to us. Let me show you what I mean and how it affects us.
Standard OCR extracts the text left to right with line breaks where necessary. We can see that extracting tables left to right doesn’t work and puts the read order of information out of order. If we were using an LLM downstream to extract information from this document it would greatly struggle.
AWS Layout Extraction is supposed to be the solution for this. It extracts the text in a format that fits the read order of the document and labels to what type of text the text is (text,headers,table).
If you check the above document you’ll see that it misses a ton of text and has much worse OCR than the raw text extraction above. But the key issue is that the read schema isn’t entirely correct. The text isn’t correlated to any specific headers, and sub headers are not correlated to headers. These documents read more like nested structures where the keys above (headers) are relevant to the smaller and more precise values. The key/value pairs and headers to sub headers do not exist in a vacuum. The format that AWS Layout Extraction pulls this data into isn’t really useful. It should recognize that “Admission Information” is not the same level as “Visit Information” and that “Admission Information” is actually a subsection to “Visit Information”. This once again causes a ton of issues for LLMs downstream to understand how the data correlates together and is supposed to read.
This is how the data should look with a nested schema that clearly defines what is a parent of what, what fields are children, and what text is a key and value. This is an extraction from a similar document ran through our medical document processing pipeline. You can see each section gets its own key value set under each header.
This patient record has an unusual two-column layout. These type of documents can be challenging as they do not represent a large percentage of the data variance seen in this use case, which means that document processing systems don’t have much training data to learn the layout, or systems just don’t handle this mapping well. For us it’s the exact same use case as our resume parser which achieved over 98% on two column resume formats.
The scanned report above has an additional block of handwritten entries at the bottom. We’ll talk more about it further down.
The technical challenges of this task are the same ones that most real-world document digitization and information extraction projects run into:
The illustration below shows our patient report summarization pipeline based on an LLM with a prompting framework and other deep-learning models.
The pipeline consists of these components:
We explore each step in detail in the sections below.
Each PDF is processed page by page. Sections that span multiple pages are recombined later with help from the page classification and combinator module. This is the first place where pipelines start to fall apart. These medical documents can contain 1000s of pages and need to be processed down not only for length or size reasons, but also for keeping relevant context tightly together. LLMs become more generic as the context windows grow (sorry 100k prompt size models) and we need to avoid that for accurate summaries. Splitting this information up allows us to control these things much easier without running into edge case issues that ruin our pipeline.
Understanding where to split pages is an equation in itself. If you split a document in the wrong spot there’s no way for the downstream models to fix the issue and use the missing context. This means that part of the accuracy of the downstream models is directly tied to our system here. The record splitting algorithm is directly baked into our framework.
As listed earlier, some of the scanned PDFs & images may be of poor quality. Our text extraction pipeline uses a custom vision-language model that does text extraction and OCR simultaneously to reduce spelling mistakes and other misidentification errors common in OCR-only approaches. Ensuring that the input images to this pipeline are of a high quality greatly improves the accuracy.
We'll explain some of the common preprocessing techniques we use to improve text extraction.
These image preprocessing operations are applied to the scanned patient record PDFs to improve the accuracy of character recognition:
Since there's a chance that a page may contain handwritten notes, we use a fine-tuned object detector R-CNN model to localize blocks of handwritten text. Luckily, since handwritten text patterns are often drastically different from printed text, the detection accuracy tends to be high.
The primary challenge here is when handwritten single words or short phrases are inserted at some position above a printed line. Such inserted text may change the meaning of the text in medically consequential ways and must be merged with the main text based on their locations.
After the handwritten regions are identified, special image preprocessing techniques are applied to improve the accuracy of handwriting recognition. These techniques are based on continuous offline handwriting recognition using deep learning models. The techniques include:
The results of handwriting recognition with an off-the-shelf simple text recognition library are shown below:
Notice that its handwriting results are not accurate.
In contrast, these are far better results from our fine-tuned, multi-modal, language-image model:
Text extraction consists of the following steps:
Character identification based only on image pixels can be incorrect if the scanned image contains either imaging artifacts like missing pixels, noisy text, and blurring or poor handwriting (a widespread problem in the medical field).
To avoid such problems, state-of-the-art text extraction uses multi-modal language-vision models. They don't identify characters based on just the image pixels. Instead, they use a lot of additional context like:
This additional context drastically improves the accuracy of the text extraction from both printed and handwritten sections.
Below, we explain our state-of-the-art text extraction approaches.
Text extraction and layout understanding are done by our vision-language model that's trained for document layout recognition based on OCR images. Given the image of a page, it identifies layout elements like sections, field-value pairs, or tables and produces their bounding boxes and text.
During training and fine-tuning, it's supplied with word embeddings, image patches, position embeddings, coordinate embeddings, and shape embeddings generated for the training documents. The multi-modal transformer model learns to associate spatial and word patterns in the document text with their appropriate layout elements.
A sample run of our model on a patient record is shown below:
In the example above, the model correctly identified all the free-floating name-value fields at the top of the page, including fields like age, which don't have any values.
The model also identified distinct logical sections in the text based on their layout and formatting patterns. These sections are extracted as name-value pairs where the names are section headers and values are the text under the section headers. Sometimes, a paragraph under the previous section is identified as a separate section. However, such mistakes are easily corrected in the post-processing stage.
The text extraction must not only understand the regular text but also special marks like check marks and circles that may be medically relevant as shown below:
Such unusual data outside the purview of regular text extraction is the reason we fine-tune image-language models for our document understanding pipelines. These recognitions can be trained into the image-to-text architectures outlined above to produce a single pipeline.
So far, we identified the text from images, identified layout elements, and got the text within each element's boundaries. However, the extracted text fragments may not be directly suitable for patient record use cases. For example, the text in signatures, page numbers, seals, or logos is just noisy, irrelevant information that must be discarded.
Text pre-processing uses deep learning models to ignore superfluous text. We do that by borrowing the approach of language model pre-training over clinical notes for our text extraction model. Basically, we teach it to recognize the text patterns unique to health records and ignore the rest.
We adapt its sequential, hierarchical, and pre-training approaches to our multi-modal text extraction model. It involves fine-tuning the layout understanding model with additional embeddings for medically relevant information like section labels and ICD codes.
The process laid out above produces extracted information that's structured as JSON data like this:
In this stage, GPT-4 prompts and prompting techniques are used to summarize the extracted information from each relevant section. Since each of the fields we're interested in — names, ICD codes, diagnoses, medications, and dates — have distinct text patterns, we just prompt GPT-4 to identify which is which and add the entry to the summary timeline.
For example, a patient record with relevant information in different locations is shown below:
The raw, unstructured extracted text without any location hints is shown below:
Despite the lack of location hints, GPT-4 correctly infers the details:
Abstractive summary prompts are useful for isolating the useful information from the rest of the text in a chunk. We use prompts like: "In the medical record fragment below, summarize only the information about names, dates, ICD codes, medications, and diagnoses, and discard all the other text."
Once the abstract summary is available, extractive prompts enable us to be more clinical. We use prompts like: "Select the ICD codes in this medical record corresponding without modifying them."
Prompting techniques like tree-of-thought prompting, chain-of-thought prompting, and planning and executable actions for reasoning over long documents (PEARL) are useful for isolating useful information from unnecessary information in the text.
We use prompting frameworks like these when creating summaries of events that span across multiple documents. Many use cases of patient record summaries involve building timelines and single event understanding across multiple documents. These plan and execute frameworks help us manage the process of creating one goal state summary based on all the documents provided with different dates/times/locations that are correlated to a single event.
These frameworks also allow us to build summarization systems that are dynamic enough to be adjusted based on specific focus information. The system takes in specific areas of focus such as body region, ICD codes, or specific patient information and writes summaries that focus just on these topics. This allows you to write summaries of an entire patient history that focus on exact topics.
Since patient records can be lengthy, running into hundreds of pages, chunking strategies are necessary to break them up. Chunking on section boundaries, summarizing each chunk, and finally recursively summarizing the summaries enables us to create detailed summaries from lengthy patient records.
In the post-processing stage, we consolidate the summary timelines from all sections and arrange them to get a comprehensive set of health care events related to the patient.
For this patient record:
The generated summary timeline looks like this:
In this article, you saw our state-of-the-art artificial intelligence approach to create a consolidated summary timeline of all the health care events recorded in a patient report. Such a summary spanning years, possibly decades, is extremely helpful for medical professionals, health care administrators, health insurance companies, legal firms, arbitration companies, and in court.
Contact us to find out how we use modern AI techniques to help you incorporate such information summarization workflows in your hospital, lab, or medical practice.