92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
A resume writing company was looking to build a resume parser to allow customers to quickly upload their existing resume to their system to ingest any existing user date used for creating new resumes or other downstream ai systems, such as the content generation tool we also built. The system needs to be able to handle 1000s of resumes a day and process each resume in a real time environment given the workflow the tool is used in.
Resumes can come in a ton of different formats. Each one can have a different order, a different number of columns, a variety of different fields, and even multiple pages. This means that format and the order in which the text is read matters as it doesn’t always read left to right and word order makes little sense. What this leads to is that a simple approach of extracting the text with OCR and using an LLM or NER model isn’t the best idea, given the text will read awkwardly when extracted. Sure we could train an NLP model on the extracted text in all the different formats and try to find a relationship in the awkward word order, but this scales poorly considering the column format of resumes.
Another key source of data variance that must be tackled with resumes is the file type. Many resumes can be images or non machine readable PDFs. This means that we need to generalize the input processing sequence to handle these the same as machine readable PDFs with a defined schema. This is pretty common when building document processing systems as you don’t have control over the data schema unless integrating directly with APIs, which would defeat the purpose of needing a document processing system.
As we’ll see below we were able to overcome this and reach SOTA accuracy in a challenging domain with a dataset with huge data variance.
The resume extraction tool focuses on 20 unique fields and entities for extraction. These fields can be split into 3 main categories based on what the extraction process looks like and how the data is recognized.
A great example of the last field would be “About me” sections. We recognize the entire section as the field instead of specific entities. This allows us to grab entire text and build relationships between columns and text based on position, text, and other fields. We can see the same idea in the education and work experience sections and those have more text to correlate together such as company name, job title, job length and the long form description.
As with most resume parsers we want to use this pipeline in a real time workflow. Users upload their resumes and the data is ingested to perform operations. This affects the model options, the infrastructure we can use, and the managed services we can use. Many resume parsers nowadays try to leverage OpenAi, which is very slow and has poor parallelization.
Our pipeline uses custom trained and deployed models that allows us to parallelize the API calls and use smaller models. Our runtime gets down to 1.7s on just 4 cores. Scaling to 8 or more cores, and adding GPUs to the equation allows us to easily get under a second.
One of the things I see often during development of these document processing systems is customers choosing the wrong models and architecture during POC and MVP stages that favor accuracy using managed services. The goal should be to pick models that scale well and that you have full control over if you ever want to reach a product that can actually be used in production real time settings. Many of these managed services can never reach this due to constraints on runtime speed or flexibility. We then have to start over and replace these systems with new models that allow us to scale the speed and accuracy. Never choose models during POCs and MVPs that can’t be iterated on or don't allow you to control deployment.
As always, the requirements were to get this system as accurate as possible. This is a resume parser that is going to be used on a wide data variance of input resumes. PDFs, images, scans, and other input types and the output extracted fields will be used in downstream operations for generating new resumes.
We also wanted to be able to use very large resumes with multiple pages without running into context issues or length issues. Poorly built GPT based systems can run into both these issues as documents grow in length and go over to different pages.
If this resume went onto a different page the publications section would be split without the correlating header on the second page. Models that require chunking would struggle with this very common use case for resumes that are generated with Microsoft Word.
Resumes have a ton of variance in terms of how they are formatted. They are not like many legal documents which come from a single source provider, or medical documents that do the same. Resumes have a ton of different sources and many people just design them in word docs. This variance makes it more difficult to parse resumes and understand the schema of the data we are interested in relative to the overall document layout.
We group our resumes based on the number of columns of data that exist in the page. As the number of columns grows it becomes harder to extract the information as there are more “groupings” of data to account for. Resumes with more columns are also less common and generally come from older resume formats. Resumes with 4 or 5 columns account for less than 5% of the total resumes.
Here’s an example of a two column resume. Data flows downward with more space between columns and it is easy to tell the difference between the different sections. As you can imagine, as the number of columns increases the space between them goes down and it becomes harder to correlate specific titles to data. This is the same concept as tables of data in documents.
Given that resumes have an infinite number of formats and schemas based on how someone wants to outline their work, the system needs to have an incredibly detailed understanding of how to correlate text based on position and semantics to allow us to generalize enough to cover difficult formats.
Here’s an example of a 3 column resume. Using systems like Textract + GPT would struggle with this document as the OCR would read the text left to right and blend the columns together when passed to the LLM.
We built a custom document processing pipeline leveraging our SOTA document processing framework we leverage as a base architecture for all our document processing use cases.
On our training set of 9000 resumes, some provided by the client and some generated through our document augmentation pipeline we achieved 94% accuracy across all fields and all resume formats. Our accuracy on the most common 1 and 2 column resumes reached 98% across all fields. We even reached 81% accuracy on the incredibly complex 3 and 4 column resumes that even GPT struggles with due to the inability to use left to right text systems given the text from different sections bleeds together.
The accuracy with the most common formats was ridiculous. 98% across all fields in a very large sample size. By the end of fine-tuning popular fields such as phone number, skills sections, and email reached 99%+. The overall accuracy when including all 1 and 2 resume formats was still over 94%.
The contact information related fields in these noisy resumes had an accuracy of 93.2% across the fields. The 81% accuracy is dragged down by the two noisiest and least common fields out of the 20. These are both text sections instead of fields that require not only recognition of the section but classification of the text in the recognized section.
Real results on a text column in the resume that contains all types of fields we want to identify. Our architecture recognizes the text section in yellow and correlates it to the columns in green. We can see that inside each text section we can extract specific entities or just keep it as free text. The contact text section contains all three types of text and our solution can parse out the specific entities we care about no matter if they’re simple entities or entire sections of text. The positional OCR understands the difference between text in different sections and how it correctly correlates to various titles and other text.
Our custom document processing framework allows us to fine-tune the architecture for a ton of different use cases and documents and push out production level systems in less time. This proven architecture has reached 90%+ accuracy on legal, resumes, medical, and financial documents. Schedule a time today to chat about how you can get started with your own document processing system.