92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
In recent months, large language models (LLMs) like GPT-4, LLaMa, GPT-J, and others have increasingly been adopted by different industries. While initially limited to language tasks, they have since been integrated with software agents that interface with other systems to carry out tasks.
So, the LLMs have now become like artificial brains that take natural language instructions from users, deconstruct them into steps, and instruct agents — again using natural language instructions — to carry out tasks in the real world.
In this ecosystem, agents that can accept natural language instructions to carry out different tasks are crucial. One such task is image segmentation. As an example of its use, a retail employee can instruct an LLM to "Count all the shampoo bottles on this shelf." The LLM in turn instructs a segmentation agent to look for "shampoo bottles" in a camera feed and isolate them.
In this article, we look at a segmentation model called CLIPSeg that is built for such text-guided tasks and integration with LLMs as an agent. Specifically, we'll understand how CLIPSeg uses OpenAI's contrastive language image pretraining model, or CLIP, for text-guided zero-shot and one-shot segmentation.
CLIPSeg, proposed by Lüddecke and Ecker, is a language-image model for semantic segmentation that segments images using the text prompts or prototype images you provide.
Under the hood, CLIPSeg leverages OpenAI's powerful CLIP image-text model that enables seamless use of text-prompting for computer vision tasks.
CLIP equips CLIPSeg with the following image segmentation tasks:
Its text and visual prompting capabilities enable CLIPSeg to work as an agent with large language models like GPT-4 to facilitate many business use cases.
CLIPSeg has a transformer-based, encoder-decoder architecture. Its encoder is a pre-trained CLIP vision-language model based on the vision transformer (ViT-B/16) network. It generates an embedding and attention activations for a target image.
For the decoder, CLIPSeg stacks three standard transformer blocks that combine the target image embedding, its activations, and the conditioning prompt to output a binary segmentation mask.
We explore all these architectural aspects in more detail in the next sections.
A single embedding, generated by the encoder for a target image, is insufficient to identify precise segmentation masks. A lot of useful information required by the decoder is not available in the embedding. However, it is available in the attention activations of the encoder's transformer blocks.
So CLIPSeg opts for a U-Net-like architecture where its three decoder blocks are connected to three of the vision transformer encoder's blocks. Their attention information enables the decoder blocks to more accurately infer both nearby and distant spatial and semantic relationships in the image.
In this section, we explore CLIPSeg's technique to make text or image prompts influence the decoder's segmentation results.
Text or image prompts are additional multimodal inputs that the model must combine with a target image in some way. Influencing a model's results in this fashion, with additional input, is called "conditioning." For example, a popular conditioning technique for natural language processing (NLP) combines token position embeddings with input text embeddings. But that approach just influences the first layer.
Conditioning techniques that influence not just the first but every layer of the model give better results. CLIPSeg uses one such technique called feature-wise linear modulation (FiLM), proposed by Perez et al.
FiLM works as follows:
Having understood FiLM, we can now explore how CLIPSeg works with different prompts.
Zero-shot text-guided segmentation is when the text prompt you supply isn't part of the CLIP dataset or training. It's a novel prompt that CLIPSeg is seeing for the first time. Despite that, it infers visual concepts that are semantically related to that prompt, identifies them in the query image, and segments them with the following steps.
In one-shot segmentation, you supply three inputs — the target image, a prototype image of an object you're interested in, and a mask to isolate that object in the prototype image. The latter two comprise the visual prompt. One-shot segmentation looks for objects in the target image that resemble the object and segments them using the following steps.
In zero-shot environments the supplied text prompt is run through CLIP's text encoder to obtain an embedding vector. This is a bit different in few-shot as a visual prompt consisting of the example object image and its object mask is run through CLIP’s visual encoder. Note that this is not a simple image processing operation of isolating the object using the object mask and then obtaining an embedding for the isolated object. Instead, the object mask must condition the multi-head attention weights themselves so that attention is restricted to the image patches in the unmasked areas while ignoring the masked areas.
In CLIP's joint text-image embedding space is prepared using contrastive pair training, related text-image pairs, along with their semantically nearby texts and images, cluster close to one another.
So the embedding for the text prompt will be near to its semantically related image embeddings. Essentially, CLIPSeg converts the text prompt to its equivalent visual concept in the form of an embedding.
The target image to segment is run through CLIP's visual encoder network, which is just a vision transformer ViT-B/32 model by default. However, its goal is not to obtain an embedding for the image.
Instead, it's interested in the attention activations of the third, sixth, and ninth encoder blocks to act as sources of skip connections to the decoder blocks. To keep the decoder model light, just these three blocks are selected, but more can be added for improved accuracy.
Next, CLIPSeg conditions the last selected encoder block's activations on the prompt embedding (from the first step) using FiLM. The prompt embedding is supplied to the two FiLM linear layers to obtain 𝛄 and 𝛃 matrices. They are combined with the encoder block's activations through an affine transform.
By default, the CLIPSeg decoder has three transformer blocks.
These three transformer blocks output the image patch tokens that constitute the segmentation mask for the given prompt.
The final steps convert these mask tokens to a binary segmentation mask that's the same size as the target image. It does this using a transposed convolution layer (or a couple of them). These layers, also called deconvolution layers, upsample the decoder's mask tokens to a binary segmentation mask that's the same size as the image.
Let’s start out by taking a look at a few product shelf images with simple prompts. These are the first prompts that would come to mind when using a model like this and trying to extract exact stuff from shelves. The original research paper clearly outlines a few ways to optimize the image prompts in a few-shot environment, but zero-shot is a much more difficult task and fits more use cases.
For the prompt "orange bottles," notice that CLIPSeg segmented mostly just the orange bottles or bottles with orange wrappers on them:
These are pretty good results considering the zero-shot nature. But they lack the accuracy needed at a production level, and missed a few key bottles that we would expect the model to segment. The heatmap clearly shows a pretty close confidence score between the yellowish bottles right above the deep red confidence bottles and the larger Sunkist bottles in the top right. It also grabs a few red bottles as a part of a larger segmentation chunk that we definitely don’t want. If we’re going to use a threshold score + size score to extract these in a pipeline we’d like to reduce the size and green color confidence score of the yellowish bottles, and increase both of these for clearly orange bottles. Both of these would increase the accuracy over a given dataset.
One thing to note - We’ll assume the range of what we consider the color orange to be is wide, from light to dark orange. If we wanted to focus on one or the other the adjustment would have to be made here in our “high level” prompt to focus on one or the other.
The first thing we do is recognize the largest “chunks” of recognition in our image and evaluate them for likelihood of being products. We do this with a custom model we trained for evaluating confidence scores and segmentation size, but this can be done pretty well with a much simpler approach as the later steps will clean up any wrongly classified segmentation chunks. This equation is very similar to other tasks like price tag recognition where we care about the surrounding area and any products touching this defined area. We crop these smaller areas out of the image with padding around the area.
Now when we run the same simple prompt we see a much tighter segmentation with very high confidence values around the correct products. This works well here already with a pretty simple chunk but as we’ll see below we want to keep the optimizations coming!
The approach up to this point where we just crop and rerun the same text prompt works well on some chunks, but does not solve all of our issues. For the middle segmentation chunk we have a problem when using the original “orange bottles” prompt.
Although the confidence score isn’t as high on the red bottles as it is on the orange ones, it’s pretty high compared to some of the ones we saw in the original image and even has a few light red sections.
We trained a GPT model to optimize the provided original high level prompt into a more granular version that takes advantage of the models understanding of specific keywords and phrases. The concepts behind how it’s trained and evaluated stem from the work done here on hard prompt optimization. I would consider these prompts to still be soft prompts, as they’re still very readable, but do incorporate many of the ideas. You can also set this model up in a few-shot environment to get it off the ground and have it work very well. We provide our high level prompt and the outputs of an image-to-text model to generate this prompt: 🍊🍾orange bottles🍊🍾. You can see the outputs are much better for this specific chunk. While in theory you could do this manually, the automated pipeline nature is what we’re after.
Same simple prompt on the lower left chunk we saw before. While the recognition looks the same at first glance if you look closely at the heatmap you’ll notice the red confidence got darker and the heatmap piece covering the other bottles became lighter with less red.
CLIPSeg does a great job with this very high level prompt with little granularity in the goal state output.
The paper also explores whether the visual prompt for one-shot segmentation — consisting of a prototype image of the object of interest and a mask to isolate that object — can be pre-processed in some way to improve the probability of detecting that object in the target image.
After trying out various combinations, the researchers recommend the following pre-processing steps on the prototype image:
The segmentation probabilities with and without pre-processing are depicted below:
In this one-shot segmentation demo, we want to find all cereal boxes with a certain color or of a certain brand on this retail shelf:
Instead of just optimizing our text prompts and building new pipelines, we can supply an image example of our specific box we’re interested in to create granularity.
CLIPSeg generates the following segmentation mask to match the prototype image:
This is a great way to add granularity to our pipeline for specific SKUs or objects without requiring difficult prompt optimization.
The goal of CLIPSeg training is to train its decoder stack consisting of three transformer blocks, the FiLM layers, and the transpose convolution layers. It does not retrain or fine-tune the CLIP encoder stack.
The model can be trained on different datasets, like PhraseCut or COCO. The training uses simple binary cross-entropy loss, between the predicted binary mask and the ground truth, as its loss function and minimizes that loss.
Let’s take a look at how CLIPSeg compares to other SOTA segmentation models.
On zero-shot segmentation, CLIPSeg achieves high mean intersection-over-union (mIoU) scores on unseen classes (which is the very objective of zero-shot segmentation):
It outperforms all the other networks on unseen classes by a huge margin. This shows the power of the CLIP model to find semantically related prompts.
But on the seen classes, its mIoU benchmarks aren't as impressive relative to other models. This is because the other state-of-the-art models are trained on thousands of images spanning just a dozen classes from the Pascal dataset, while CLIPSeg, though trained on a much larger number of classes, has to learn from far fewer images per class. Plus, CLIPSeg's segmentation masks are not as precise compared to the other models below that are optimized on large image datasets with few classes.
The Grounded Segment-Anything model does a great job as well of extracting just the orange bottles in a zero shot environment. This model combines the works of the Segment-Anything model and Grounding DINO.
The same can’t be said for the Segment-Anything with CLIP model, which does not perform as well for this use case as our above work.
The idea behind this pipeline is actually very similar to what we outlined with our custom pipeline using GPT-4. This pipeline works like this:
On one-shot segmentation, CLIPSeg's metrics are on par with other segmentation models.
It's interesting to see how CLIP based visual backbones generally score lower than architectures with the older backbones ResNet50 and ResNet101.
CLIPSeg's simplicity enables you to enhance and fine-tune it to your specific business needs. Any approach that can enhance or fine-tune CLIP will also benefit CLIPSeg.
For example, you may want to fine-tune the model to your specific retail inventory so that during store cycle counting, your employees can segment products by brand and model, or even the stock keeping unit number (SKU). For that, you can fine-tune CLIP using a custom loss function to optimize for SKU similarity.
The architectural choices themselves provide opportunities to improve CLIPSeg's accuracy, including:
CLIPSeg's simplicity and its ability to run on smartphones open up the following applications in different industries.
In retail environments, employees can use CLIPSeg for time-consuming tasks like cycle counting of shelves and detecting empty shelves or gaps in shelves. Employees just need to type or speak product names, and the model will locate and segment matching products on the shelf.
Similarly, customers in supermarkets can enter, or speak, their desired product names in kiosks or on their smartphones. A CLIPSeg model then looks for matching products in the shop's camera feeds, identifies the exact shelves where they're located, and helps the customer get there.
Segmentation methods are extensively used in medical image diagnosis because, in addition to locations, the shapes and extents of objects are also crucial to diagnosis.
When using a text-guided model like CLIPSeg, medical technicians and professionals can just type, or speak, their objects of interest in a medical image like an X-ray or a CT scan or MRI that shows soft tissues. A CLIPSeg model that's fine-tuned on medical datasets can then automatically segment those objects in the images.
Drone and satellite imagery is another use case where contours and extents are as important as locations. The ability to quickly scan through massive images (which can be as large as 10,000x10,000 pixels) for objects of interest helps reduce time, effort, and costs.
Image and video editing are other applications where segmentation is extensively used to isolate objects. Text-guided segmentation promises to improve the productivity of such time-consuming tasks.
How is CLIPSeg capable of doing all this? To understand that, we explore its model architecture next.
In this article, you got an in-depth understanding of how CLIPSeg works and a glimpse of its business applications. You can use it along with large language models in your business, too, to accelerate your business workflows and improve productivity in a variety of tasks like product identification, document understanding, image editing, and more.
Contact us to get started with integrating large language models for NLP with computer vision tasks like segmentation and object detection for your business workflows.