92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
Matching product data based on UPC code is a use case that allows companies such as retailers, warehouses, and ecommerce companies to automate the process of comparing two different data entries for similarity. This software system allows you to automate the processing of product orders, vendor agreements, barcode scanning, and inventory management with very little input data.
Product UPC codes come in a number of different formats based on industry, country, and source which makes it difficult to build data matching products that can cover all the use cases with high accuracy. On top of that, many of these UPC codes come from sources such as web scraping or OCR which can cause a high level of noise or loss of data.
Being able to compare UPC codes between products across sources is a huge advantage that gives a new angle to product matching. Oftentimes product matching requires a number of different data points to line up as well as relatively low noise from each of the sources. In situations where we have to use OCR or other similar methods the percentage of data that is lost during text extraction is high enough to affect most text matching tools. These other data matching tools require constant updates and revisions to account for new noise and variations from sources.
Width.ai built an NLP based solution that learns the deep relationship between different formats of UPC codes with varying levels of noise. The architecture supports both exact and partial UPC codes and allows for matching between the two. Poorly or incorrectly formatted UPC codes are a go as well and are still highly accurate. UPC codes from multiple industries and sources were used during training to allow for a huge boost in the resiliency of the accuracy long term.
Not only does the model tell you if two UPC codes are a match, it will tell you why or why not they are a match. The model generates it’s reasoning for what criteria led to the decision. While the model does explain itself fairly well, there are instances where the model gives us some reasoning that does not match the use case or explains the wrong type of match. We consider this an added benefit that is not related to our confidence metrics, although it has shown to give nearly an 8% boost in accuracy.
The model understands what information is different between the two UPC codes and if removing it leads to the same UPC code. The ability to reason is shown in full effect below where the relationship between partial and full UPC codes is understood.
It goes without saying but UPC codes that are different come back as a no match, and the model adds reasoning that makes sense.
The model is pre-trained already on a number of different use cases which means there is very little fine-tuning required as new users onboard and start using the NLP architecture in production. Even with multiple large language models being used the runtime per comparison on a single instance is less than
As mentioned before common methods of extracting UPC codes such as OCR and web scraping commonly lead to fields running into each other or line differences on a product label being skipped. This can lead to an insane amount of noise that fuzzy matchers cannot overcome. This is by far the most valuable part of our deep learning based solution. Here’s an example where the product title field was accidentally added to the UPC field because of an OCR error. This would be a nightmare for other data matching solutions.
No issue for our NLP architecture. The learned relationship between UPC codes is too strong.
Want to learn more about product matching or other data matching use cases? We’ve built a number of product matching solutions and offer all levels of customization to fit your use case. Take a look at these solutions:
https://www.width.ai/post/data-matching-software
https://www.width.ai/post/product-matching-in-ecommerce
https://www.width.ai/post/product-recognition
https://www.width.ai/post/convert-unstructured-text-to-structured-data
Interested in seeing how GPT-3 and other NLP tools can be used in your industry? Let’s talk - Contact Us