92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
Rokt, a digital marketing and ecommerce technology company was looking for an automated way to categorize products used in their customer’s ads and their purchase optimization platform. This allows them to provide further analytics around the ads, and be able to do things like understand performance at a category level instead of just a specific product level. They were currently using an old school keyword and fuzzy matching based categorization system and were not able to continually improve the accuracy past a certain threshold.
Rokt knew they needed a more accurate system and due to their size could not rely on manual human resources. They were able to leverage our product categorization API service and fine-tune it on their specific product data and completely automate the process. We also went through a number of data cleanup stages to massage the data into a better format for training. We’ll talk about the dataset issues in the solution outline.
When ad placement or digital marketing companies run ads for customers they collect analytics as to how the specific ad and product performed. When these are upsell ads they are placed on pages for other products as well (see image). While this data is extremely helpful for specific product level performance, it can be even more valuable to see entire category level performance for a single vendor or all vendors. This also allows you to better choose future ads as you can understand the full category of options that are the same. Some vendors have a taxonomy mapping on their site, but it isn’t standardized and requires manual lift or mapping to understand product to category mapping across storefronts.
If you’re relying on a keyword based or fuzzy matching system, factors like language differences, offbrand names, missing information, and vendor variations can severely disrupt accurate categorization. Additionally, these systems aren’t built on predictive models, so they cannot produce confidence scores that are directly based on the model, which means you can’t flag products with close matches for manual review.
The customer needed a better solution to support their growing business with new vendors and new product categories they wanted to expand into. Their current keyword based solution wasn’t working well, and in no way was set up to support new larger vendors from Germany. The complexity of products and product attributes in the manufacturing and industrial space makes it even harder for them to categorize these products. Vendors also expect catalogs to be up and online in a fairly quick turnaround time, meaning manual approaches do not work.
The customer needed a better solution that can provide higher accuracy results and standardize these products to a single taxonomy that covers a huge range of categories. When we talk about these types of use cases 10 million products per month is a fairly standard size, so manually reviewing products is a non-starter, and the amount of data variance is generally pretty high, which further validates the lack of success of keyword based or fuzzy matching systems.
A few key specifics:
As we’ll see below, the amount of data records that were actually used was much less.
Our AI-powered PIM automation platform, Pumice.ai, provides a taxonomy-based categorization API driven by a state-of-the-art categorization framework. We begin with a foundational categorization model, which is then fine-tuned using customer-specific taxonomies and data, enabling the model to adapt to each customer’s unique categorization requirements. Since taxonomies and product data formats vary across organizations, we always perform this fine-tuning on customer data and deploy a specialized model for each use case. This tailored approach guarantees at least 90% accuracy on the customized dataset.
The customer's dataset was more than large enough for us to do fine-tuning on our model. The data was already categorized which allows us to use this without any changes required. When we actually dove into the data we found some interesting things that needed to be addressed. Although the dataset had 3.3 million product records, only 1.3 were actually unique. Not the same product with a different name or SKU, the exact same record. We had to remove these from our training set and reduce our workable set to 1.3 million. The other issue was the dataset was incredibly imbalanced with a huge majority of products being in only 25 categories. This bias can cause inadequate feature representation, poor minority class recall, and misleading accuracy metrics.
With every fine-tuning cycle, we create an evaluation sheet that provides a breakdown of accuracy metrics at the category level. This serves to keep the customer informed on our progress, allowing them to follow along as we fine-tune and understand our position, while also enabling us to make smarter, data-driven improvements. These evaluation sheets reveal valuable insights about where the model encounters issues, helping us make targeted adjustments in the next iteration to resolve these challenges. Commonly, one such challenge is a lack of sufficient data for certain categories, which leads us to collaborate with the customer to develop a strategy for handling these gaps. Additionally, these sheets allow us to assess the impact of specific fields on categorization performance. Sometimes, customers supply extra fields from vendors—such as vendor categories or IDs—that they believe might enhance the model, though these can sometimes impair its accuracy. Our approach is to gather as much data from the customer as possible, using evaluation metrics to inform our decisions. This process was especially relevant for this customer, who had the added complexity of many products with identical titles and descriptions.
We can see even some top level and second level categories have very little products. This is completely different from the Apparel & Accessories > Clothing category, which alone has 713k.
We also review metrics related to semantics to fully understand at a word level how much specific words in the product data correlate to the given categories. This helps us greatly when reviewing larger categories that still have lower accuracy, as we can see if there is an imbalance of language inside the category. For instance if 80% of the products in a category are extremely similar, and the other 20% are very different or have high data variance within that subset, we can understand why the accuracy struggles.
We were able to achieve exactly 97% accuracy on both the top level and bottom most level of categories on our test dataset. We split the dataset 80/20 and limited some classes to less products than were provided to minimize the class imbalance. We see what we usually see where classes with very low product counts are generally lower in accuracy. This happens specifically when the data variance in that small product set is high, but we still have a low count. Think 10 products in a category but they’re all different SKUs. Here’s a subset of the test set we looked at.
We can see that “Media” and “Cameras & Optics” both end up with 100% accuracy. Most likely because the training dataset has zero data variance, whereas “Arts & Entertainment” ends up at 83% while having a similar product count. “Animals & Pet Supplies” is an interesting one with a relatively low number of products, but an extremely high accuracy. Our model has been trained on a ton of pet supplies products. As the customer rolls out the Pumice categorization API to production we will be collecting data and further fine-tuning the model.
When we look at categories that are more closely related we can really see the model's ability. As categories get more similar to each other we can estimate that products get more semantically similar. For example “Saucepans” vs “frying pans”. It becomes harder to differentiate categories as we go down the tree. Our ability to hold the same accuracy number as we move down shows a strong relationship between data and model. One key note, accessories categories are some of the hardest categories to get high accuracy in. This is mostly due to them having the highest data variance (lots of different accessories), and them being similar to the mainline category. Our accessories categories did very well and were not a drag on the overall accuracy.
The Pumice.ai architecture supports up to 50 million product records per day with built in storage of records for fine-tuning and future optimization. This made it easy for us to further fine-tune on the error set and the categories with low product data as well as incorporate new fields.
While this customer didn’t have product images our categorization architecture supports images as a data field for categorization. We strongly recommend images as they can be a key field in many use cases. While not all use cases improve with images, when images are provided we split test the results with just text vs text + image.
Our product allows you to get started with a production version of the model in weeks not months. Any taxonomy, any amount of existing product data, and any amount of product fields. 100s of ecommerce, marketplaces, and analytics platforms already use Pumice for their categorization. Reach out via the “Get In Touch” page for more information, and provide “AGPT” in your message for free implementation.