92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
EAS is an education industry SaaS company that needed a way to process unstructured and raw college course information into different database fields automatically to use throughout their tool.
The format used for college course names, school codes, course levels and more varies heavily across all the universities in the United States, and the raw data comes in with no standardized order or relationship that allows for rules. On top of that, we wanted to extract some key topics discussed in the class from the raw descriptions that allow you to build search engines and other tools based on those keywords.
The extremely high amount of variance seen with university course information is the most difficult challenge that must be addressed when building our extraction tool. The other very difficult problem we solved in this solution is producing contextually similar keywords without raising the temperature parameter in the GPT-3 model. Given our exact keywords are extracted in the same model we have to balance the temperature to ensure our argmax extracting works while allowing the model to adventure to generate our higher order keywords.
We built a custom NLP software pipeline to take in these raw data inputs, move them into our custom GPT-3 model, and into the correct database columns with correct classification of each data point.
This NLP software pipeline significantly reduces the workload needed to process this incoming data. The keyword extraction allows us to use this data inside the SaaS product for search engines and data clustering. This extremely time consuming and important process is now completely automated for the client.
Our GPT-3 model achieves over 91% accuracy when extracting the entities from the raw text. On the keyword extraction front our model competes with the accuracy of the small spaCy model, which is quite impressive considering the few shot learning of this GPT-3 tool. In terms of pipeline runtime our entire process runs in under 4 seconds.