92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
Large language models (LLMs) like GPT-4 are increasingly being used to automate complex business workflows.
Complex computer vision tasks are becoming increasingly accessible to laypersons, thanks to text-guided models that combine image processing with large language models. In this article, we explore text-guided open-vocabulary segmentation, a necessary first step to creating an AI agent capable of visual segmentation.
In an earlier article, we explained a list of segmentation concepts like semantic, instance, panoptic, referring, zero-shot, one-shot, and few-shot segmentation. But what is text-guided, open-vocabulary segmentation and where does it fit within that list?
To understand that, you must first observe the biggest trend in AI right now — the advent of LLMs like GPT-4, ChatGPT, and LLaMa. Their ability to understand complex natural language instructions and automatically work out suitable action plans by coordinating multiple AI agents is improving by the day. Natural language is increasingly becoming a sort of universal programming language to instruct AI agents and automate any kind of workflow in any domain.
Text guidance for image processing tasks like segmentation fits into this AI workflow. In the past, you'd need to know how to use Photoshop or hire a developer to write image processing code. But now, a non-coding artist or video editor can just talk to an image processing agent, give it some editing instructions using layperson's language, and it'll carry out the segmentation or other tasks. This is called text-guided segmentation.
Additionally, these editing instructions need not be limited to the word and concept annotations that the segmentation model is trained on. By leveraging the impressive linguistic knowledge of LLMs, the model can understand instructions it hasn't seen before during training and map them to the visual concepts it finds in new unseen images. This is called open-vocabulary segmentation and is another name for zero-shot semantic segmentation when text is used.
Below, we explore two state-of-the-art pre-trained vision-language models for text-guided segmentation.
Inspired by LLMs that can handle any language task with suitable prompts, the segment anything model (SAM) by Kirillov et al. from Meta is a general-purpose approach for promptable segmentation that accepts multimodal prompts, including text prompts, and returns meaningful pixel-level segmentation masks for images.
Further, we can use prompt optimization frameworks to iteratively generate optimized hard prompts that improve the accuracy of the segmentation masks.
SAM's versatility in prompting equips it to support a wide variety of segmentation tasks. It supports these tasks categorized by the nature of results:
For each of the above tasks, SAM is capable of these subtasks categorized by the nature of inputs:
In addition, SAM segments with millisecond response times on CPUs. It's also ambiguity-aware, meaning that if a prompt is ambiguous, SAM returns multiple candidate masks with confidence scores.
Together, these capabilities make it suitable for any real-time segmentation task with user interaction on any device.
The image below shows zero-shot, text-guided, open-vocabulary segmentation using SAM:
Even when given an uncommon text prompt like "beaver tooth grille," SAM is still able to map it conceptually to the image and isolate the object with a segmentation mask. It doesn't do so well on some prompts like "a wiper." But supplying an additional point hint by the user helps it select the object. It's even capable of segmenting according to singular or plural forms, like "a wiper" versus "wipers."
SAM advances state-of-the-art image segmentation with three contributions:
In the next section, we explore the architecture and internals of these pre-trained models.
SAM is a standard encoder-decoder transformer. Its magic lies in its unique mask decoder design and its prompt encoder component that influences the decoder's activations through user-supplied prompts in the form of text, points, boxes, or masks. We explore the internals of each of these components.
Given an input image, the image encoder's job is to derive an embedding (tensor representation) that encodes all its visual characteristics like shapes, contours, colors, edges, textures, and more in a multi-dimensional vector.
In principle, you can use any neural network as an image encoder. SAM chooses to reuse the pre-trained, vision transformer-based, masked autoencoder model from the 2021 paper, Masked Autoencoders Are Scalable Vision Learners. It expects a 1024x1024 RGB image, splits it into 14x14 pixel patches, and encodes its visual aspects to generate a 256x64x64 embedding tensor per image.
SAM defaults to the ViT-Huge (ViT-H) model with 636 million parameters that generate 1,280-dimensional embeddings. However, as this graph from their ablation study shows, the ViT-Large (ViT-L) model, which weighs just about half of ViT-H with 308 million parameters and produces 1,024-dimensional embeddings, performs equally as well. Both show considerable improvements over the small, 91-million-parameter, 768-dimensional ViT-Base (ViT-B).
For smartphone or web browser deployments, we recommend the ViT-L model as the encoder.
The prompt encoder's job is to encode the prompts, in whichever modality the user has chosen, to prompt embeddings. This is achieved in different ways depending on the modality as explained in the following sections, but all of them generate several prompt tokens with a 256-dimensional embedding for each token.
SAM uses OpenAI's popular contrastive loss image pre-training model (CLIP) as a text encoder to encode free-form text prompts as word embeddings. Specifically, it uses the largest publicly-available CLIP model, ViT-L/14@336px.
A point prompt is encoded as a sum of its positional encoding on the image with a pre-learned embedding that represents a point as being in the mask foreground or background.
Similarly, a box prompt is encoded as a pair of positional encodings, one for its top-left corner and another for its bottom-right corner.
Mask prompts are encoded differently, however, because they're dense. A mask prompt is passed through a convolutional block to generate an embedding.
All the actual segmentation happens in the mask decoder. It consists of the following components:
The inputs to the mask decoder are:
We explore the decoder's components next.
The mask decoder has two stages of two-way attention blocks. Their job is to semantically understand and associate the objects in the image and the concepts in the prompts. For this, given a set of image and prompt tokens, they derive mask tokens and mask embeddings and update all the mask and image embeddings based on cross-attention.
The inputs are processed by the four layers in each block.
The first layer is a multi-head attention layer that determines self-attention weights for the sparse inputs, between the prompt tokens and between the previous output tokens. The higher a pair of tokens scores, the more semantically associated they are compared to other pairs.
The second layer is a multi-head attention layer that determines cross-attention between the sparse prompt tokens and the dense input image tokens. Some of the semantic associations between the concepts in the prompts and the objects in the image are determined here.
The third layer is a multi-layer perceptron (MLP) to dimensionally expand the self-attention and cross-attention weights determined so far.
The fourth layer is a multi-head attention layer that determines cross-attention in the opposite direction, from image embedding to the prompt tokens. This is why they're called two-way attention blocks.
The updated embeddings must be upscaled to match the input image size. This is done by two transposed convolution blocks.
The mask tokens generated are mapped to per-pixel probabilities by a three-layer classifier MLP. It outputs 1024x1024-sized segmentation masks that isolate objects corresponding to the user prompt.
A smaller regression MLP ranks candidate masks by IoU scores calculated from the mask tokens.
As SAM is already pre-trained on the massive SA-1B dataset, it already knows how to segment most real-world images.
To improve the IoU metrics on your specific data, you can freeze some of the mask decoder's layers and fine-tune the weights of the rest using your custom images.
For instance segmentation, the recommended approach is a pipeline that combines object localization and SAM. The object detector acts as a prompt generator by detecting object instances and generating bounding boxes. These boxes are passed to SAM's prompt encoder. SAM generates a segmentation mask for each detected object.
SAM is quite versatile. However, as the paper itself shows, it gets some prompts wrong. This points to some weaknesses in its semantic capabilities.
In the next section, we explore an enhanced model that is designed to overcome some of SAM's weaknesses.
Proposed by Zou et al. just days after SAM was released, the "segment everything everywhere all at once" model (SEEM) is also a promptable segmenter but promises to be even more versatile than SAM. In this section, we explore its capabilities and internals.
As the illustration above shows, SEEM's prompt support is much more versatile and allows for much richer user interaction for segmentation tasks. These strengths include:
The images below show SEEM's open-vocabulary abilities on different kinds of images like photos, drawings, and cartoons:
These examples shows the models understanding of size of various objects in a comparative framework against others.
SEEM text-guided open-vocabulary segmentation examples (Source: Zou et al.)
The key to SEEM's prompt versatility is that it first embeds all prompts in a joint image-text embedding space, much like CLIP. This is in contrast to SAM, which uses a different embedding space for its text, geometry, and mask prompts.
By embedding all visual prompts and text prompts in the same joint visual representation space, SEEM can support composite prompts that enable users to express their intent unambiguously by combining text and visual prompts.
Another benefit of using a joint embedding space is that semantically-related text embeddings can be derived for visual prompts, making it highly semantically aware at all times. This is useful for incremental segmentation using memory prompts where a mask from a previous request is easily turned into a semantic query to provide the boundaries for a new request.
SEEM's higher levels of semantic awareness and better interactivity vis-a-vis SAM are depicted below:
SEEM adapts the X-Decoder architecture shown below:
For the vision backbone, SEEM uses a dual attention visual transformer (DaViT). For the language encoder, it uses either the unified contrastive learning model or Florence.
For embedding visual prompts like boxes or scribbles, it uses the visual backbone to extract features from the vicinity of the prompt and embeds them in the joint text-image embedding space.
Unlike SAM's massive SA-1B, SEEM is trained on modest volumes of data.
SEEM is compared with other modern segmentation architectures below. The (T) and (B) refer to tiny and baseline models depending on which visual and language backbones are used. We can see that its metrics compare well against other models:
Though it's just trained on the COCO dataset, SEEM holds its own. It's possible that if trained on the SA-1B dataset, it may outperform everything shown here, including SAM.
We’ve seen that both of these models can struggle when provided images where the objects we want to segment contain different backgrounds or overlapping backgrounds. The model's reliance on understanding the entire scene relative to its trained vocabulary make images like this difficult to complete. It was pretty surprising to see that images that visually are as simple as this can give a model with this much training issues, and I made a few jokes about it on Linkedin.
As we can see this sheep causes issues. Here’s another prompt that's even more granular and still isn’t right.
These are pretty simple image examples that we would think the model could understand. While I can’t think of many use cases where you’d have something like this, it’s a pretty good demonstration of the issues. Here’s another one where the results from SAM are baffling.
This pipeline as outlined by IDEA-Research combines the works of Grounding DINO and Segment Anything and works to detect and segment items with text inputs. The goal of this pipeline is to combine the strengths of both of these tools to solve more complex problems including segmentation.
Grounding DINO is an innovative deep learning framework designed to tackle object detection and referring expression comprehension (REC) tasks. It takes advantage of a dual-encoder-single-decoder architecture to effectively identify and extract relevant information from input images and text. In this section, we provide a brief overview of Grounding DINO's main components and how they come together to achieve state-of-the-art performance in both object detection and REC tasks.
1. Feature Extraction and Enhancer: Grounding DINO uses an image backbone (e.g., Swin Transformer) to extract multi-scale image features and a text backbone (e.g., BERT) to obtain text features. Once the vanilla image and text features are extracted, they are fed into a feature enhancer module for cross-modality feature fusion. This module helps align features of different modalities, enhancing their representation.
2. Language-Guided Query Selection: To effectively leverage the input text to guide object detection, Grounding DINO employs a language-guided query selection module. This module selects features from the image that are more relevant to the input text as decoder queries , initializing them for further processing.
3. **Cross-Modality Decoder**: The cross-modality decoder combines image and text modality features, allowing the model to better align information from both sources. Each cross-modality query goes through a series of self-attention, image cross-attention, text cross-attention, and feed-forward neural network layers in the decoder. This structure ensures that text information is effectively injected into the queries for improved modality alignment.
4. **Sub-Sentence Level Text Feature**: Grounding DINO introduces a "sub-sentence" level representation that eliminates unwanted word interactions while maintaining fine-grained understanding. Attention masks are used to block attentions among unrelated category names, preventing unnecessary dependencies while retaining per-word features.
5. **Loss Function**: The model uses a combination of L1 loss, GIOU loss for bounding box regressions, and contrastive loss with focal loss for classification. These losses are used in conjunction with bipartite matching and auxiliary loss in a similar fashion to existing DETR-like models.
You can see it gets all of our animals!
The following examples demonstrate uses of text-guided segmentation in different businesses.
In retail settings, inventory tasks like cycle counting and quarterly counting are time-consuming but necessary chores. With text-guided segmentation, these tedious tasks can become automated. A store employee equipped with a smartphone can point the camera at a shelf and speak or type the products or produce they're interested in counting.
In the illustration below, the system is asked to segment out "oranges:”
Similarly, other time-consuming inventory tasks like cataloging in bookstores or libraries can be streamlined using text-guided segmentation. In the example below, we segment out all the "blue books" on a bookshelf:
Large language models, natural language prompts, and AI agents that can understand them and automate workflows are the trends of the future in the workplace. With them, you can automate your business workflows like retail inventory management, invoice matching, text extraction from scanned and handwritten documents, document understanding, information extraction from documents, signature verification, and more.
Contact us for guidance on automating complex image and video processing workflows using large language models like GPT-4 and ChatGPT to streamline your business operations.