92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
Human language translation with artificial intelligence is a task that was often reserved for those who either worked on Google Translate or had a ton of translation data for all languages to train a large language model. Before the rise of transformer based neural network models, you were stuck building strict LSTM machine learning models that were much slower and less accurate. Transformer based models have grown in popularity for a number of natural language processing tasks including language translation and if you have the data required to train architectures such as T5 you can make progress over time.
GPT-3 (Generative Pre-trained Transformer) gives you the ability to create language translation software with any level of training data and get extremely good results right out of the box. The underlying GPT-3 model is trained on billions of words from sources such as Wikipedia and learned to predict the next token in a sequence with its 175 billion parameters (no longer the largest language model). Through prompt based programming you can turn the task agnostic autoregressive model into a language translation model that picks up on the task pretty well.
In this article, we’ll take a look at how we build production level language translation models that can support any number of languages or training data to be as flexible as possible. After all, the key benefit to GPT-3 is its ability to pick up on tasks in just a few examples of what to do, instead of having to retrain huge language models. Here are the steps to take to reach a production level architecture.
Building out an understanding of a few high level ideas related to your specific use case is incredibly important to understand what requirements you need to build. Having even a high level answer to these questions makes it much smoother to move from simple GPT-3 prompts to a high level product. Most of these questions are asked for almost any generative pre-trained transformer related architecture.
Understanding how many languages you want to support translation support for out of the gate is vital. Not only should you know so your input and output systems support the language, but from a data variance standpoint, we will make decisions based on the answer. The more languages you want to support the more data variance coverage your model must cover. Data variance is a high level way of saying how many different variations of the problem could be used as input for the model. This includes the number of languages, length of input text, number of translations to do at one time, etc.
The pathway we take down below when actually building our GPT-3 prompt for language translation will to some extent be determined by how much training data we already have. While GPT-3 can get started with any level of translation examples we usually cover more data variance with more examples that show the model how to complete the task. As we can see in the graphic below GPT-3 becomes much more accurate as the number of relevant examples in our prompt increases. This is due to the fact that we have now shown GPT-3 a variety of different inputs and how to complete the task correctly. The more data we have to work with, the easier it is to support different languages and input sources with a production accuracy.
Your data variance can grow exponentially as you increase the number of data sources that will be running through your GPT-3 model. As you can imagine there is a different level of understanding required from GPT-3 to translate a single question vs translate an entire research paper. While you can build an architecture that supports both it’s just good to understand that when you start so you don’t build a model that is tightly fit to it’s training data.
GPT-3 has a number of different underlying engines that you can choose from that can perform tasks like this at a number of different levels. The general idea is that as the models get more expensive per run their ability to understand tasks becomes better.
Davinci is by far the most popular model used today for almost any GPT-3 task. It is the most capable model and has shown the ability to perform tasks at higher accuracy and with less instruction. This means it needs fewer language examples on average and does not require as strong language to understand the task. Davinci also offers an instruct version that has been fine-tuned on instruction type text to allow it to have an even better understanding of prompt type language. This was once the biggest language model at 175 billion parameters.
Curie is the next model in line and is about 1/10 the cost of Davinci. While it is generally less powerful than Davinci on most tasks, the Curie model is still strong at tasks such as language translation. This is mostly due to the fact that translation is an easier task than things like complex intent, cause and effect, and key topic extraction. The model is very fast compared to Davinci and is worth looking at as an option when balancing cost vs accuracy.
We almost always recommend using the Davinci model if you want to reach production level accuracy and data variance coverage. As your product lifecycle moves along and you’ve gathered more translation examples you can look to Curie if cost is an issue. Given that both models can be fine-tuned to reduce per run costs there isn’t as much benefit long term to choosing Curie.
GPT-3 uses a text prompt to allow you to instruct the model on what the task is that you’re trying to accomplish. The framework for this in production includes:
The key component of this step is building out prompt language that gives GPT-3 the best ability to understand our task for a high level of data variance. In situations where we have a ton of translation examples of many different languages or data sources, the underlying GPT-3 model can rely on just the prompt examples without needing to understand the instructions as much. For the most part, it’s very difficult to cover a huge range of high accuracy language translations at scale without well-refined prompt language that gives GPT-3 a great shot at producing a quality result in tough situations.
As the data variance you want to support at a high level grows your prompt will be strained and tested much more often which can result in users receiving poor results back. Unlike other GPT-3 tasks such as marketing copy generation, you don’t ever want to give back poor results as it’s difficult for the end user to know your results are bad. How are they supposed to know the translation doesn’t make any sense? We spend a ton of time optimizing the prompt variables to reduce the likelihood of a poor translation, no matter if the input text is 20 words or 5,000.
Playground examples of language translation are good and all, but does that really work in production? If GPT-3 has a token limit of how many translation examples you can include in a prompt, how do we support thousands of languages or massive translations? It’s no secret that prompt examples relevant to the input text (examples of French to Chinese translation when our input task is French to Chinese translation) explode your accuracy and GPT-3s ability to understand the task. It makes sense that prompt examples that show GPT-3 how to fulfill the exact task have shown to boost accuracy up to 30% in some use cases. So if there is a token limit how do you include enough examples to cover all these languages?
Prompt optimization algorithms are used to put perfectly relevant examples to our exact language translation in our prompt at runtime to optimize the results. These algorithms focus on dynamically creating a prompt that is tailored to the input text to maximize accuracy. This allows you to cover an insanely large amount of data variance in our language translation task and feel much more comfortable with the output translations. We build these prompt optimization algorithms for almost all our GPT-3 tasks, and is the single most important step to moving to a production system.
Fine-tuning is another important part of the process when talking about moving our accuracy and data variance coverage to production levels. Fine-tuning allows you to generate a custom GPT-3 engine model with one of the engines as a baseline. The key benefit to this approach is there is no token limit when running fine-tuned models. The examples that are added during training are not included in the prompt token count at runtime. This means our GPT-3 model can have a baseline understanding of the task at hand before seeing any prompt examples. While fine-tuning does not provide the per task “steering strength” as added prompt examples, it certainly moves us closer.
The ability to fully understand how well your pipeline is performing in production as well as having a quantifiable way to see improvements in the accuracy of your product is a requirement. These two modules give you a deep insight into the abilities and quality of your language translation application.
Confidence metrics are custom built scoring algorithms that give you an idea of the probability that a given translation run is good. We mentioned above that “accuracy” with GPT-3 can be somewhat relative given the generation nature of the outputs. In many use cases, different parties can have completely different ideas of what is correct. This makes it difficult to quantify what a good result is and measure the confidence of the result. For this reason and others, we build our confidence metrics custom to the task at hand using various NLP scoring methods. This allows you to have some level of confidence in the real-time results of your language translation application.
Testing suites are used to quantify improvements during all stages of development and the pipeline's lifecycle. This helps you split test different language variations in your GPT-3 prompt, different prompt optimization frameworks, or variations in the number of models you use in a pipeline. Understanding what changes in your GPT-3 prompt actually lead to better translations is a huge part of making progress in improving your data variance coverage or the number of languages you can support at an acceptable accuracy level.
Bleu score is one of the most popular algorithms used for evaluating natural language processing pipelines such as language translation, chatbots, and abstract summarization. The key idea behind this algorithm is to compare a generated sentence to a human-created sentence and score how similar they are to each other. Bleu score leverages N-grams and word compared precision to create a final score between 0 and 1 for how close the generated sentence is to the human sentence. A brevity penalty is added to penalize sentences that are too short and keep the 0 to 1 score more relative to any length of sentence. Bleu score has both strengths and weaknesses that should be understood.
Word Error Rate (WER) is another popular algorithm used in language translation and speech to text applications. The algorithm works to compare the generated sentence to a target sentence word by word. The algorithm uses insertions, deletions, and substitutions over the total words in the translation to understand the difference between the two. WER is based on a string comparison algorithm known as Levenstein distance.
Language translation with GPT-3 is a popular task for people who want to get started with the OpenAi engine. The high level idea is offered as a playground example where you can get your feet wet and see the baseline level power of GPT-3. That being said the process of going from nice playground examples to a full-scale production pipeline is much different and includes a number of steps most users don’t even know exist. The architecture required to go from a proof of concept to the production system we’ve laid out today is much different and requires a real understanding of data variance, prompt based programming, and evaluating large language models to produce human like text.
Width.ai focuses on building deep learning pipelines just like this one for any domain or business use case. We’ve built this exact architecture to take basic GPT-3 ideas and turn them into full sized businesses that clients can use. We’ve written a number of deep guides to using GPT-3 for various tasks and are the leading experts in developing applications for its use.