92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
Most document processing tasks that require information extraction to automate a previously human focused process are not black and white. Successful information extraction from these documents isn’t strictly an abstractive or extractive summary, named entities, or one word topics. While what’s shown in research and pretrained models follows the above, production systems typically want a blended mix of the above that best fits the use case and the exact input document.
We’re going to look at how we built a state of the art NLP pipeline for blended summarization and NER to process master service agreements (MDAs) that vary the outputs based on the input document and what is deemed important information. This is a mix of the popular NER and long form summarization work we’ve done in the legal document domain that leverages our key knowledge of production level GPT-3 pipelines. You can read about our legal document cover sheet processing and long text summarization.
The goal of the pipeline is to pass in a master service agreement document and extract both summaries of the information and any specific points we would want to note about the two parties. The goal of the readable output is to make a review of the document much quicker through the summaries created and provide specific extracted named entities to store in a database for reference. These are the most commonly seen ways for processing long form legal documents to solve a downstream use case. Let’s quickly look at key reasons why the blended nature of the summarization & dynamic NER is beneficial for solving the key problems.
Most summarization work today is focused on solving very black and white versions of abstractive and extractive summarization for any input use case. Real world use cases that are actually beneficial are some blended version of these two. The problem most companies run into is the pretrained models and/or datasets are set up as abstractive or extractive and therefore can lead to less than stellar results in the customer's eyes when they want just a small tweak towards the other side.
GPT-3 ability to leverage small amounts of data to learn a relationship on the fly allows us to get blended summarization off the ground quicker, make adjustments to the blend easier, and leverage things like prompt optimization to create the flexibility required for this task.
Standard named entity recognition focuses on creating a set of entities that we want to extract from documents such as counterparty, address, service provider, price, and finding them in documents provided. We must create all entities we wish to extract upfront and label them in each document. If we want to add a new entity for the model to find we go through and retrain the entire model to find examples of this new entity.
What happens if the set of entities we care about is very flexible given the available information in a given example? What if the information we care about surrounding a specific topic or entity changes for different document examples? With standard NER we’d have to make sure we outline everything up front and account for any additional information we could ever want to extract. If this entity isn’t very common or is very different from the other entities it can be very hard to get the NER model to pick up on this new field.
Our multi-task model learns what an entity is instead of what information leads to a given entity. These entities are then recognized and extracted just like a NER model and become either part of the summary or there own separate output, given the user case and preference.
Let’s take a look at the essential parts of the pipeline and evaluate how they play a role in blended summarization with a multi-task NLP model.
Here we leverage a piece of our state of the art document pipeline to extract the text and understand the positioning of specific text. We use this same pipeline for document processing tasks such as invoices and resumes where we want to extract exact fields oftentimes split by headers.
This pipeline already contains high powered OCR built for tough image scans of documents and contains a deep learning framework for understanding the positioning of key text which in other use cases helps us map fields. Here we can use it to recognize clause headers that will help us keep specific context together. Considering the level of input variance and noise this pipeline is built to handle, these PDF scanned MDAs are a breeze.
As I’ve written about countless times in the other GPT-3 articles on Width.ai the chunking algorithm is one of the two most important parts of a production level GPT-3 system for long form input tasks. This algorithm focuses on splitting the long form MDA documents into smaller chunks while doing a few important things:
1. Making sure context isn’t split between chunks. This often leads to the summary containing too much information about a single topic, or the topic being removed from the summary completely. This is now even more important as the loss of context can affect the dynamic NER part of our model.
2. Provide some standardization to the size of input chunks to GPT-3.
3. Keep costs down on longer documents.
4. Understand dynamically changing size based on the number of pages per document.
5. Using the previously recognized clause headers as guidance for part of the chunking algorithm.
We built a custom chunking algorithm for this task while leveraging models we’ve used for large summarization tasks such as topic extraction from 2 hour financial interviews.
We’ve built a custom multi-task GPT-3 pipeline to go from the input master service agreement text to a generated output that contains a blended summary and any entities the model deems valuable. The summary varies in length based on the amount of key information and the overall length of the input, and is a mix of abstractive and extractive summarization based on key information in the input document. The number of entities varies as well based on the amount of exact information worth noting.
The key to this GPT-3 pipeline's success is the work done with prompt optimization and fine-tuning. Prompt optimization is a custom framework we build for all of our GPT-3 products that focuses on dynamically building the prompt at runtime based on what our input is. This allows us to dynamically put relevant information in front of our input and help guide GPT-3 to a solid generation. This algorithm has been shown to increase GPT-3s accuracy by over 30% in many tasks and is a key component to increasing data variance coverage in production.
Fine-tuning allows us to provide initial “steering” to GPT-3 for what a successful output looks like. We’ve fine-tuned the baseline davinci model on a number of examples to give our model baseline context of the task at hand. This becomes especially important when you consider we are asking GPT-3 to perform two tasks at once.
The goal of the module is to combine our smaller summaries into a new summary that condenses the information. This process not only helps condense to a more concise summary but also helps remove duplicate information. This is a trained model that provides a much higher quality output than just squishing the summaries together, and allows the end user to tweak the size of the final summary.
The final piece of our multi-task NLP pipeline is a bit of output cleanup and product improvement. Considering the output from GPT-3 includes both our blended summary & named entities as one we build simple processing to do the following:
1. Split these two tasks apart
2. Convert the string NER fields into key-value pairs
3. Any bad character cleanup
Confidence metrics allow us to put a model confidence value on the outputs of the GPT-3 models in real time. This is incredibly useful to allow you to regenerate poor results, use user feedback, analyze and optimize models, and understand exactly how well your model is performing in production. This is a custom model that evaluates the output relative to high-quality token sequences for the same use case.
We ended up with a document processing pipeline used to process master service agreements into a blended information summary and dynamic named entities. Most of the training data master service agreements are between 20 and 30 pages but we’ve processed much longer agreements into longer summaries and more entities.
There are a few immediate results that we achieved with this pipeline that are worth noting:
1. Reduction in service agreement review time. The goal of the blended summary and entities is to provide a lawyer/reviewer with all the important information they would want to see when reviewing a master service agreement. This means a broader and more abstractive summary of information that is just worth mentioning, and the ability to include exact information for key points. The blended summary above includes both of those and two named entities we would want to have when reviewing any service agreement. A top level review of a master service agreement might take an hour of time whereas generating this blended summary takes less than 30 seconds. With the hourly rate of a lawyer in the US being around $300, this pipeline is an awesome time and cost reducer.
2. Quickly update an internal client database. These named entities can be extracted and stored in a contract review database for easy reference. Fields that appear in any MSA such as company name, contract value, payment terms, and others can be automatically stored away while the blended summary information is manually reviewed for risk.
3. Pipeline allows for easy summary optimization. As mentioned before, production level summarization is almost never black and white between extractive and abstractive. A “goal” summary for a customer that provides valuable information is normally some mix of these two based on the exact document that is fed in. This makes it very difficult to evaluate the multi-task output using machine based evaluation metrics. The accuracy of an output is relative to the reader in what information they feel should be included in the summary and how it should be portrayed. This evaluation is critical to being able to iterate on the pipeline and make improvements in the right direction for what the customer believes is a better summary. This pipeline allows for easy human based evaluation and iteration with the use of prompt optimization for the GPT-3 models. We use human feedback on outputs both during development and production use to allow the models to constantly be improving.
4. Base pipeline can be retrained for different document types. One of the long term results we achieved from this is developing the pipeline in a way that can easily be retrained for different documents other than MSAs. As long as the inputs and outputs stay the same, the middle models can be easily fine-tuned on the new document set.
A beautiful summary and NER of a few termination agreement clauses.
Schedule a call today to take a look at how we can process long form documents into summaries, entities, or both! We’ve built these same pipelines for input documents up to 100 pages in length and a number of legal use cases. You can get started quickly leveraging one of these pipelines to automate legal document processing or any other long form input.