92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
An asset management company with offices all over the globe was looking for an NLP based software solution to extract the key business topics discussed in financial interviews about a company's growth, health, size, and much more. These topics are generated based on the general ideas discussed and a few important business points made throughout the 1 on 1 dialog. The interview transcripts were passed into our software and the models produced a list of 4-7 key business topics that a reader would be interested in learning about.
These interviews can be extremely long with some being 20+ pages, which can make it extremely difficult to summarize long sections of dialog into a key topic that was discussed. The format of an interview is also tricky when considering key topics covered, as the back and forth Q&A format can make it difficult for an NLP model to understand contextual similarity between multiple questions and answers. Given the high similarity between main keywords in any given chunk of these interviews (same companies, same topics, same financial points discussed) a keyword based model to summarize parts of the text does not work.
One of the known struggles with language generation models is proper nouns and linking them to other entities mentioned in the input. Not only that, but understanding and using contextual information to figure out what a proper noun actually is can be tricky as models tend to lean towards generalizations when deciding. For instance COVID will often be thought of as a software product or company if the keyword is used in dialog discussing a product or company.
We chose to use GPT-3 for this task and custom built and optimized our input prompts based on output results from a large, high variance dataset. We fine tuned our models' hyperparameters to not only generate key topics with an understanding of what was discussed in smaller parts of the dialog, but with a temperature parameter that allows the model to adventure in its ability to create its own sentences without being just an extraction tool.
Our software pipeline takes advantage of state of the art GPT-3 api’s and resources, which we combine with our knowledge of recent breakthroughs in optimizing and tuning these generative models towards specific use cases. Although these are all interviews in a similar domain, they discuss a wide range of topics and we must account for some level of generalization in our training examples. Our cutting edge understanding of GPT-3 prompt optimization allowed us to account for a high level of variance in the training set and create a system that can allow for much more as the training set grows with new examples.
The end result is a powerful NLP based software pipeline that produces 4-7 key topics discussed in a long form interview dialog.
Interested in seeing how GPT-3 and other NLP tools can be used in your industry? Let’s talk - Contact Us