92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
Width.ai created a new retail product image classification model that outperforms the SOTA results from CLIP and Fashion CLIP on the most popular dataset in the domain. These models are commonly used in product matching use cases where photos are taken with lower resolution and zero control for noise and image angles.
If you are unfamiliar with product matching on retail shelves I highly recommend you start with this product matching architecture blog post to understand the entire pipeline, and why we use a similarity based approach for SKU image matching over a class based system.
I also recommend taking a look at our previous version of this product SKU image classification model that does a great job of outlining the process for reaching these benchmarks. This workflow is where we went from CLIP to Fashion CLIP with customizations to the architecture to fit this domain.
Let’s start with a bit of an introduction to the key pieces of the equation.
Fine-tuning in machine learning involves slightly adjusting the parameters of a pre-trained model for a new, related task. It's beneficial for several reasons:
The CLIP model, known for its capacity to comprehend and link text and images, is trained on a vast internet corpus. However, its generalized training may not fully equip it to handle specific or specialized content. To maximize CLIP's potential for a particular task or domain, fine-tuning is essential.
Fine-tuning is a pretty standardized part of building product matching systems. Most use cases have unique variables such as setting, camera quality, and product count that create variations of how to accomplish the task. The goal of fine-tuning in this specific part of product matching is to better match the cropped products in the environment to product records in the database. This database has the classes of SKUs we know exist.
The dataset we focused on optimizing for is RP2K: A Large-Scale Retail Product Dataset for Fine-Grained Image Classification. It contains more than 500,000 images of retail products on shelves belonging to 2000 different products.
This dataset is awesome for product matching use cases as the images used come from a high noise environment in a cropped from the shelf format. Most of the datasets companies use for training and testing their product matching systems come from Open Food Facts, which are stale product images with very little noise. While these are pictures of the products you want to recognize, they do not look like the products that come from the shelf in terms of angle, size, and brightness. You might see a super high accuracy on the dataset from Open Food Facts, then move to a real product use case and see your accuracy evaporate.
You can see there is a ton of variation in color, angle, camera quality, blur and other features that will show up in the real retail shelf recognition. The logos are way easier to see in simple hand held products or ecommerce product images. We used to try to augment this noise to improve the mapping of the clean training dataset to how the images will actually look when compared for similarity. Here’s an example of the image used on Open Food Facts, and it’s clearly cleaner to view than what would be seen on a shelf.
Comparing our result to the RP2K dataset will be a much better representation of the real world use of the product matching.
Before we get into the details of the results of our new model let's look at the results of the most common models used for product similarity in the product matching use case. We want to see how well CLIP and Fashion CLIP are at finding matching SKUs in the database. Take a look at the chart below to see how they perform in this task.
We can clearly see that for the kind of images that are present in the RP2K dataset, and real product matching use cases it’s very challenging for the baseline CLIP model, and the fine-tuned Fashion CLIP model to perform well. in our evaluation CLIP reached 41% while Fashion CLIP reached 50% Top-1 Accuracy. This means that when we provide an input cropped product image the correct result is returned as the top result 41% & 50% of the time. As you can imagine, these numbers go up as we expand the parameter to the correct result being in the top 3 or top 5 returned results. That being said, it's concerning that the correct result appears in the top 3 results under 65% of the time with either model, and CLIP never moves past 60%.
1. Creating Embeddings for Training Images:
2. Computing Embeddings for Test Images:
3. Calculating Cosine Distances:
4. Finding the Nearest Neighbor:
5. Comparing Class Labels:
6. Evaluating Correctness:
In summary, this evaluation method assesses the performance of a model by checking how well it can recognize and classify images in a test dataset based on the similarity of their embeddings to those of the training dataset. It measures accuracy by comparing the predicted labels to the true labels of the test images. This approach is often used in tasks like image retrieval or image classification to evaluate the quality of a model's representations and its ability to generalize to new data.
These embedding models can sometimes give the illusion of excellent performance if the same image is used both during the initial embedding calculation and in a subsequent reverse image search. In such cases, there would often be a 100% match, making it challenging to accurately assess the model's true capabilities.
Fortunately, the dataset we selected for evaluation has a balanced distribution of classes, approximately 2,388 classes in both the training and test sets. Moreover, on average, there are five different image samples available for each class. This balanced and diverse dataset helps us avoid the issue of overestimating the model's performance due to exact image matches during testing, allowing for a more reliable evaluation.
Our brand new model reached an accuracy of 89% in terms of Top-1 retrieval compared to the 41% & 50% above. This comes from breakthroughs in how we utilize the weights of the model and how we set up our hyperparameters. The ability to train this architecture without the need for large amounts of data augmentation and evaluate on a dataset that fits with the real world use case makes it easier for us to iterate this accuracy forward than with the other two models.
The new model has a much deeper understanding of the products in a retail environment where the images are not clean and the quality is not always high. This is the point where most of these products fall apart if they don’t already have enough data to train on through real world image collection. This model gives us an elite starting point that only improves from there.
Width.ai builds custom product matching systems used in retail environments to recognize and match SKUs, products, and other items. We’ve reached 90%+ accuracy in a ton of domains and have scaled these systems up to 3.5 million SKUs. We’d love to chat about your product matching or warehouse automation use case!