Automate Health Care Information Processing With EMR Data Extraction - Our Workflow
We dive deep into the challenges we face in EMR data extraction and explain the pipelines, techniques, and models we use to solve them.
Document classification is a common task in business as every document has to undergo some business workflow, and sending it the wrong way can be expensive, especially in regulated industries.
In this article, we explore accurate classification algorithms using the latest innovations in deep learning, computer vision, natural language processing (NLP), and machine learning models.
Document classification is a machine learning task to identify the class or type of document. For example, given a large set of scanned documents, your business may need to sort them into invoices, receipts, contracts, pay slips, and expenditure reports. The types of documents are often domain- and task-specific.
A class is determined based on a document’s text content, visual features, or both. Other aspects like metadata, location, or file name may also determine its class.
Two common concepts related to document categorization are:
In the following sections we’ll dive into various automatic document classification techniques.
Since we’ll refer to transformers often in this article, here’s a short overview of them.
Transformers are a family of deep neural networks designed for sequence-to-sequence tasks (e.g., language translation or question-answering). Using techniques like self-attention and multi-head attention blocks, transformers understand the long-range context in a text (or other data) and scale well during training.
A transformer consists of an encoder network or a decoder network or both. You can train and use them separately or as a single end-to-end network. An encoder generates rich representations, called embeddings, from input sequences. A decoder combines an encoder’s embeddings with output from previous steps to generate the next output sequences.
GPT-3 provides us with a pre-trained model that lets us leverage a baseline task (unsupervised document classification, sentiment analysis, summarize, etc.) agnostic understanding of language with a guided understanding of classifying text via few shot learning and fine-tuning. GPT-3 provides us with a few key benefits for long document classification over other architectures.
We’ll use this model as a part of a pipeline for classifying long documents (20+ pages) leveraging our proven long document GPT-3 pipeline. We can also use this few-shot environment to dramatically speed up the manual document classification process required to create training data.
We’ve designed an architecture that allows us to scale our long document classification to longer documents without needing to adjust input size for large variance between different input documents.
The text extraction module is used to extract document text from PDFs, images, and Word documents. While the most common use case is extracting text from unstructured documents where the text simply flows in a natural left to right direction, we can fine-tune this module to extract text in a more structured format based on the type of document. Documents such as legal documents, invoices, and other documents with tabular formats have special positional structures that should be accounted for and the text should be structured.
The goal of this module is to reduce our document text input size by removing information that is not relevant to classification. There’s a few reasons we do this:
This module will be a fine-tuned deep learning algorithm focused on understanding what information in our input text generally has the lowest correlation to the correct class. The more information we remove from the input text, the less future data variance we can cover. The tradeoff is that our downstream classification model can use more few-shot examples and understand our current evaluation dataset better.
Imagine we are reducing 1,600-word chunks to a single sentence theme. It will be harder for future inputs with new data variance to reach the same accuracy considering how much information we had to remove to reach a single sentence in most cases. The tradeoff here is we can include more examples in our classification model and can fit this exact dataset well. I recommend starting much wider in the amount of information we keep in this step when the dataset is small and data variance coverage is low.
Depending on the size of the document after preprocessing, you still might need to chunk the document. At a high level the chunking algorithm is an algorithm used to break apart the large document text to be classified in parts, then combined back together using an output algorithm that works to create a combined classification. These chunking algorithms can be as simple as just splitting the document up into smaller sizes based on a set number of tokens, or as intensive as using multiple NLP models to make decisions on where to split the document text based on keywords or context.
Prompt optimization allows us to dynamically build our GPT-3 prompts based on the given input text. The idea is to adjust the prompt language and prompt examples used in our prompt based on the input text. The prompt optimization algorithm chooses these based on a trained understanding of what specific prompt examples from a database lead to us having the highest probability of a successful output from GPT-3. This can be based on a keyword match, semantic match, or similar length.
The benefit of this algorithm is that we can provide GPT-3 with information that is relevant to our input text that better shows GPT-3 how to reach a correct classification for similar text. This method has been shown to improve the results of GPT-3 models up to 30% in some tasks! It makes sense that GPT-3 would be able to better classify a basketball box score if the few-shot examples provided are also box scores instead of a static prompt that has a bunch of different text examples.
Fine-tuning the davinci model is a great way to steer the task agnostic GPT-3 model toward our classifying documents task. Fine-tuning does not have a prompt token limit which means we can provide GPT-3 with a ton of training examples showing how to classify our long documents. Each training example is prompt-sized and allows us to stuff much more of our preprocessed document text into each example then we will have in our runtime prompt with everything we’ve seen so far. It’s best practice to make your fine-tuned examples similar to what the model will see at runtime, in terms of length, especially if you’re going to forgo the runtime few-shot examples and just rely on the fine-tuned model.
Fine-tuning won’t provide as much value as prompt optimization until we reach a certain number of training examples per document class. With fine-tuning, no single training example is leveraged heavily by the model when classifying an input document whereas few-shot learning examples in the prompt are the focus of how the model understands completing our task correctly. This is why it's better to use few-shot learning while the dataset has low data variance coverage and doesn’t give deep context for each document class.
We have two main tasks that help us clean up and create our results for classifying these long documents.
Custom built confidence metrics allow us to put a model confidence value on the outputs of our GPT-3 model. This is an incredibly useful tool to gain insight into how the model classified the document, and allows you to perform real-time tasks such as regenerate poor results, use user feedback, and understand how well your document classification model is performing in production.
Once we’ve generated our classifications for each chunk of the long form document we need to combine them to create a single document classification. There’s a number of different ways we can do this, and similar to prompt optimization, there isn’t one best fit way for every use case. The most common way to create a single document classification is to simply choose the class that was chosen for the most chunks, and return a confidence value based on how many chunks are chosen out of the entire set. A more complex version of this is to evaluate each chunk's confidence score and chunk size of the entire document. With this, the larger chunks are more valuable to the equation.
In this section, we explore bidirectional encoder representations from transformers (BERT) for document classification tasks.
BERT is best thought of as an approach for training transformer encoders for language tasks. It proposes special techniques at training time like:
These techniques turn an encoder into a versatile pre-trained language representation model that you can quickly fine-tune for specific language tasks.
The BERT research paper also provided two transformer encoders that were trained using this methodology, called BERT-large and BERT-base. Depending on the context, BERT may refer to either the approach or the pre-trained models.
BERT does not propose any new network architecture and just reuses the original transformer encoder architecture. Its capabilities stem from its training strategies. The two pre-trained BERT models use the same architecture but differ in their internals:
The DocBERT model fine-tunes both BERT models for document classification. It does this by attaching a simple, fully connected, softmax layer that reports the probabilities of classes for an input embedding.
The input to the softmax is the final hidden state corresponding to the [CLS] input token that marks the start of a sequence. This hidden state acts as a latent representation of the input sequence, making it useful for classification tasks.
This classification model (transformer encoder + softmax) is fine-tuned end-to-end on training datasets like:
Although both BERT models achieved the best F1-scores on all four datasets, their enormous sizes made them expensive and slow for both fine-tuning and inference. The next best model achieved comparable F1-scores (usually within 3-4% of both BERTs) and inferred 40x faster with less than 4 million parameters. The inefficiency of the BERT models is unacceptable for many use cases.
Can BERT’s awesome capability be transferred to a lighter model to achieve both high accuracy and performance? The DocBERT paper explores algorithms like knowledge distillation (KD) to transfer DocBERT-large’s capability to a lightweight bidirectional long short-term memory (BiLSTM) network.
The BiLSTM is first trained normally on a labeled dataset. DocBERT-large is also fine-tuned on the same dataset. Since the latter’s F1-scores are higher, it’s designated as the teacher and the BiLSTM as the student.
Next, a transfer dataset is created, and the class probabilities inferred by DocBERT on it are set as soft targets for the student. The student aims to fine-tune its trained weights on this transfer dataset so that it matches the teacher’s class probabilities with the least error.
Using this technique, the KD-BiLSTM improved on its own baseline scores and got close to DocBERT-base’s scores while being 25x smaller and 40x faster than it!
The discussion so far gives the impression that BERT is only for language tasks. But that’s not true, and BERT has been used for tasks that combine computer vision and NLP. One such application is visual document understanding that replicates human understanding of complex documents like invoices, contracts, or court records.
Document classification using both visual and linguistic information is often needed. For example, process automation may have to classify documents to send them to different business workflows.
LayoutLM is a visual document understanding model that combines BERT pre-training with visual aspects of text blocks. Both aspects are combined as embeddings to the BERT encoder. The classification layer attached to it learns to identify the document using both visual and textual aspects, just like people do.
In this section, we explore techniques to overcome the limitations of pre-trained BERT models when processing long documents.
A drawback of all transformer models is that self-attention is quadratic to the sequence length. That’s why the pre-trained BERT models cap their input sequence length to 512 and truncate everything else because longer documents require quadratically higher computational power.
Another drawback is with the positional encoding scheme that blends position information into the input embeddings. It’s trained only for sequences under 512 items. For longer documents, it has to be retrained. So, in practice, 512 has become a hard limit of the BERT models.
Long documents like legal agreements or business plans have multiple sections. Reviewers may need high-level labels like “warning” or “safe” to help them focus on the critical sections.
Many text classification tasks like sentiment analysis may apply different labels to different sections in the same document. But this isn’t possible using BERT.
Hierarchical transformers solve this with a simple algorithm:
For their experiments, they use the smaller BERT-base model for efficiency. The LSTM used is a small network that produces 100-dimensional document embeddings. It’s called RoBERT, for recurrence over BERT. The second transformer is similarly a small one with just two transformer blocks. It’s called ToBERT, for transformer over BERT.
Both RoBERT and ToBERT are fine-tuned on three text classification datasets:
Arranging sequential networks in a hierarchy allows them to overcome their sequence length limits. The BERT-based model scored higher accuracies on some datasets over other support vector machine (SVM) and convolutional models.
Service level and other legal agreements can run into dozens of pages filled with dense legalese. To help reviewers save time, you can summarize their contents.
For higher confidence that nothing critical is being missed out, you can also run topic identification and sentiment analysis on each section and show them as section labels using hierarchical transformers. They help reviewers focus on the most critical portions of such documents.
Real-world document processing can be quite messy. Using a mortgage industry case study, we’ll see the type of problems that crop up in paperwork-intensive industries and explore how document classification solves them.
A loan audit involves reviewing a set of documents called the loan document package. A typical package can have hundreds of scanned pages like land titles, identity documents, income documents, signed declarations, and more. The pages are supposed to be arranged in a particular order to make them easy to process.
But in reality, they are often haphazardly grouped. Identifying and grouping loan documents is a major bottleneck for banks and mortgage companies. So they rely on business process outsourcing to automate some of it and complete the rest manually. But because the documents can be complex, mistakes aren’t uncommon. This raises the costs and time required for processing.
Some semi-automated techniques are in use out there but are largely unsatisfactory. Using document templates for parsing the information can be faulty and laborious. Custom rule-based pipelines fail when they run into edge cases. Once a pipeline makes mistakes, people stop trusting the entire pipeline and revert to manual verification.
The industry needs automated solutions that can robustly and reliably process most documents with little human involvement.
A clever solution to the problem is identifying just the typical starting and ending pages of each document type. They often have very unique layouts that are easily identified. Each document type will have two classes — “[type]-start” and “[type]-end.” If a page isn’t one of these start or end classes, then you just classify it as “other.”
Each scanned page is processed by an optical character recognition (OCR) engine to extract its text. Any unwanted text is discarded.
The page text is then processed by a doc2vec machine learning algorithm to produce a dense feature vector that represents all the text and its patterns on that page.
Using the feature vector, a logistic regression machine learning model infers the class of that page along with confidence scores and other metrics.
Logical rules are applied to the model’s output sequence to check if all pages of a document type are together. A pipeline like this reduces human effort considerably and the remaining edge cases can be easily managed by the outsourced staff.
As we’ve looked at above, the two main constraints in document classification are a high number of classes (especially relative to the number of examples we have per class) and having enough training samples to understand the difference between classes. It’s no surprise that these two constraints go together well to cause issues in production level document classification systems.
We’re going to look at a document classification pipeline that allows us to classify documents into a large number of classes in a constrained zero shot data environment.
Data labeling of documents can be an extremely expensive and time-consuming task. Industries that require a certain expertise to review documents such as legal or financial are even more expensive given the cost of these resources. Data labeling documents into a single class out of a large amount (high-class) is extremely time-consuming as many classes in this environment are very similar to each other, and small details in the documents are what leads to the variation.
Zero-shot classification is the perfect solution for working around these ideas. True zero-shot classification requires no fine-tuning and no prompt examples to guide the model to a correct output. We will provide the classes available, the input document we want to classify, and the prompt instructions to perform the task. Since we are not providing prompt examples the instructions we use will be very important, as they are the only information used to help the task agnostic GPT-3 model understand how to correctly complete our task.
Zero-shot learning methods are also a great way to speed up the data labelling process for a future fine-tuned or few-shot model. These models will almost certainly be more accurate than the zero-shot solution in the long term. If we understand that our zero-shot model is 80% accurate, we can use this to put a label on all documents we want to use in the future and have a manual reviewer quickly check them before training. This assisted review is much more efficient than full manual review.
Let’s look at an example of an NLP pipeline that leverages a GPT-3 model to perform zero shot classification.
This pipeline focuses on extracting information from documents in a format that provides us the most contextual information relative to the classes we have available. From there the prompt language and instructions are critical to be able to form relationships between document text and classes with little prior understanding of the relationship.
The key step to this pipeline is how we extract our text in a format that provides GPT-3 with an idea of what information is important. It doesn’t make much sense to extract all header, body, abstract, and other common document fields as the same unstructured text, given that different text clearly has varying value when used differently in documents. Downstream, we can tag important information in ways that tells GPT-3 that this text was more valuable in the document. This idea is the same as what you can do with tags such as <h1> when using marketing copy as a variable in a GPT-3 prompt (as seen here).
There are a ton of pre-trained architectures that you can leverage to extract text from documents in a better format that will allow us to assist zero shot GPT-3 in understanding what information is valuable to the provided document. Libraries like Kleister-NDA let you extract key entities from legal documents and start putting tags around key information without needing to fine-tune the model.
If you’re willing to fine-tune this text extraction module for better accuracy, leveraging architectures like LayoutLMv2 are perfect for this document understanding task. This architecture contains a spatial-aware self-attention mechanism into the Transformer architecture that allows the model to understand relative positional relationships among blocks of text.
The goal of this module is to prepare our extracted document text in a format that helps GPT-3 classify by reducing the amount of irrelevant information from our document text and applying tags based on the relationships learned in the previous step. We’ll use input preprocessing tools as simple as removing stopwords and fixing grammar, to complex summarization algorithms that focus on creating large extractive summaries that keep big amounts of the document.
A good zero shot GPT-3 prompt has a few key features that allow us to turn this task agnostic model into a document classification model.
These are used to provide GPT-3 with clear instructions on how to complete the task. This allows us to steer the task agnostic GPT-3 model towards our classification task and provide key information that helps differentiate classes such as what variables to focus on, what info in the document should be deemed valuable, and any other rules we believe are important.
Prompt language is used to provide GPT-3 context around what text is being used in the prompt. This can be variables, rules, or even tags that structure the information a bit more than you would otherwise have. We write Python code in this step that creates this prompt language and automatically adds it to our prompt when building the layout.
During the development process it’s best practice to split test a number of prompt language combinations with varying levels of granularity. Granularity means how specific you are when explaining what your input is. The risk with more granular prompt language is that it might not be completely correct across our entire data variance.
In the example above, I say “various text sources” which is less granular than saying something like “from blog posts” and even more granular “from blog post titles and abstracts.” But if our dataset contains text from blogs, research articles, and reports, the granular text would not line up well with the language differences across the sources. This means GPT-3 will try to apply the same rules across different types of text because we said in our prompt language that it is all the same.
The prompt language also includes our classes we want to use for classifying this document text. We can set up a prompt variable for this that simply lists the available classes in the prompt. I recommend providing a bit of context around what the class entails for each class, considering we’re using a zero-shot environment and it's difficult as is to correlate the input documents to classes. This can be as simple as a short description of what the class is alongside the keyword. We’ve seen that this extra information can go a long way in classification and we’ve used it for use cases like classifying products to the Google Product Taxonomy by leaving the upstream categories in when laying out what classes are available. It’s much easier to correctly categorize “Apparel & Accessories > Costumes & Accessories > Masks” than just “Masks.”
The text that was extracted and preprocessed from the previous steps is added to our prompt. In some use cases this can be from multiple sources.
Now that we’ve created our zero-shot prompt for high class document classification, we can process through GPT-3. If we have a fine-tuned model, we can leverage that instead of the base model. It might sound like it doesn’t make much sense to talk about fine-tuned models when we’re constrained to zero shot, considering we normally focus on zero shot when we don’t have enough data to use few-shot or fine-tuning. But there are a number of ways we can still leverage fine-tuning to increase our accuracy.
Using an existing document classification dataset to create a fine-tuned model can actually increase our accuracy in a different document classification use case with different classes. If the input documents are relevant and can be fed to GPT-3 in the same format we can leverage them as a way to show GPT-3 how to accomplish a similar task. This is a great way to get your zero-shot prompt off the ground and give the task agnostic model more of an understanding of your specific task in a transfer learning type setup.
A new method proposed by Google focuses on fine-tuning language models on various tasks phrased as instructions and then evaluating them on unseen tasks. The fine-tuning uses a number of different setups (zero-shot, few-shot, CoT) which allows for better generalization to these unseen tasks. This is a great way to give GPT-3 a better understanding of correlating task specific instructions to outputs and the dataset uses a bunch of classification use cases.
Here’s a quick overview of the 1,836 different tasks used for fine-tuning.
In the post processing stage, we can generate confidence metrics focused on understanding how confident GPT-3 is in the class that was chosen. We leverage the logprobs that are generated for each token and a custom algorithm that understands the correlation between logprobs and the model’s confidence in the output.
You explored some advanced techniques for document classification in this article, techniques that were invented to solve the real-world problems most industries face. Width.ai builds custom document processing solutions for use cases (just like these!) that you can leverage internally or as a part of your product. Schedule a call today and let’s talk about if document processing software is right for you. Contact us!