92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
Image similarity matching for use cases such as product matching is a popular use case in the realm of ecommerce SKU management, product recognition in retail settings, and seller onboarding. These are use cases that are pretty common in our PIM automation platform Pumice.ai as the input products are image heavy as opposed to product text. These use cases traditionally grow in difficulty as the number of SKUs and granularity of SKUs increases meaning the variations between SKU images becomes smaller. Anyone can tell the difference between a pair of shoes and a shirt, but what about the same shirt product with 15 SKUs?
We’ve implemented a custom product image similarity architecture that leverages a fine-tuned CLIP model + a custom supporting pipeline to reach 92.44% Top-K=1 accuracy & 99.3% Top-K=3 accuracy on a massive product image dataset. This pipeline focuses on finding similar SKUs between an input product image and a full database (10 million plus) products. We used images from all categories of the Google Product Taxonomy with a focus on apparel, tech, and home goods.
Image similarity models are becoming increasingly important in various ecommerce applications such as image retrieval, image clustering, and product recommendation systems. In recent years, many image similarity models have been proposed, each with its own strengths and weaknesses. In this quick literature survey, we will review some of the most widely used and effective image similarity models.
In conclusion, the above-mentioned image similarity models have been widely used and have shown promising results in various image-related tasks. The choice of model depends on the specific application and the data available. However, the recent advancements in deep learning have led to the development of more sophisticated models, and it is likely that further progress will be made in this area in the future.
CLIP (Contrastive Language-Image Pretraining) is a groundbreaking AI model developed by OpenAI that has significantly impacted the fields of computer vision and natural language processing. The model is designed to understand images and their corresponding textual descriptions, making it a powerful tool for various applications, such as image classification, object detection, and zero-shot learning.
At its core, CLIP is a multi-modal deep learning model that combines the strengths of both computer vision and natural language processing. It leverages a unique training approach called Contrastive Pretraining, which aims to maximize the similarity between image and text embeddings for corresponding pairs while minimizing the similarity for non-corresponding pairs. This technique enables the model to learn a rich and meaningful representation of images and their textual descriptions.
The architecture of CLIP consists of two main components: an Image Encoder and a Text Encoder. The Image Encoder can be a ResNet or a Vision Transformer, responsible for converting images into fixed-size feature vectors. On the other hand, the Text Encoder is a Transformer model with GPT-2-style modifications, responsible for converting textual descriptions into fixed-size feature vectors.
One of the most remarkable aspects of CLIP is its ability to perform decent zero-shot learning. Unlike traditional models that require training on specific classes or labels, CLIP can generalize to unseen labels. This is achieved by computing the cosine similarity between the image and text embeddings during the zero-shot classification phase. The text prompt with the highest similarity is chosen as the prediction, allowing CLIP to accurately recognize classes and objects it has never seen before.
The success of CLIP can be attributed to its massive training dataset, consisting of 400 million image-text pairs, and its efficient use of computational resources. The model's robustness to distribution shift and its ability to handle complex image patterns make it a valuable asset in various AI applications. Training the largest CLIP model, RN50x64, required significant compute power, taking 18 days to train on 592 NVIDIA V100 GPUs. The largest Vision Transformer model, ViT-L/14, took 12 days to train on 256 NVIDIA V100 GPUs. These models were trained using mixed-precision, gradient checkpointing, half-precision Adam statistics, and other techniques to optimize memory usage and accelerate the training process.
Product similarity or product matching is an embedding focused similarity task to understand the similarity between products either in a one to one or one to many (in a database) relationship. This is often used as a combination of image and text similarity models to fully understand the relationship between products. This allows you to compare products with very different names and attributes that should be labeled as the same underlying SKU.
These two jackets offered by different marketplaces would be nearly impossible to compare for similarity with old school methods such as keyword extraction, NLP rules, or semantic similarity. Even though they’re the same product!
We built a custom SOTA image similarity pipeline that outperforms the base CLIP model on the task of matching an input product image to the same SKU image in a database of millions of products. We combined a fine-tuned CLIP model with an in-house custom feature understanding model to reach new highs in this domain.
For evaluating the model we used the standard top-1,3,5 result accuracy metrics. This means we compared how often the correct match was the best match, in the top 3, or in the top 5. Additionally, we used a delta similarity metric to gauge the distribution of similarities between database images and product images. The above is how the training progresses as the model is trained. To compute the delta similarity, we first calculated the similarity between each input image and its corresponding product image. We then randomly selected additional product images and calculated the average similarity between the input image and those provided product images. We subtracted this average similarity from the similarity between the input image and its corresponding product image to obtain the delta similarity.
This allowed us to further analyze the distribution of similarities between input images and product images and gain insight into the performance of the model.
A data analysis with t-SNE plot shows that product images are much more varied in nature than input images, and rightly so.
CLIP uses a symmetric cross-entropy loss function as part of its contrastive learning approach. The model is trained to maximize the cosine similarity of the image and text embeddings of the real pairs in a batch while minimizing the cosine similarity of the embeddings of the incorrect pairings. The symmetric cross-entropy loss is computed over the similarity scores of the image and text embeddings.
We chose a unique & different loss function for this specific fine-tuning use case & optimized our hyperparameters to keep from overfitting to this dataset. As you can see from the above epoch iteration dataset we did a great job of keeping the model from overfitting once we had a set of hyperparameters we felt found the global minimum for our loss function.
This fine-tuned CLIP model alongside our custom featuring understanding model quickly learned how to compare products for underlying SKU similarity and left us with more room in the tank for future iterations to improve the accuracy further. This pipeline doesn’t even leverage our custom text similarity architecture, which currently reaches over 90% accuracy for product similarity use cases in Pumice.ai.
Width.ai builds custom NLP & computer vision software for businesses to leverage in use cases like the ones talked about above. Want to schedule a time to talk to us about how we can build something like this for you? Schedule a time to chat on our Contact Us page.