92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace
How we were able to reach 97% on top level categories, and 92% on bottom level categories with a multilingual dataset for a customer of Pumice.ai.
Training a large language model (LLM) from scratch is not a trivial undertaking. The volume of data and the computing resources required involve all kinds of inherent and emergent complexities.
In this article, find out how to train an open source LLM on MosaicML, a platform that specializes in training LLMs and managing their training complexities.
MosaicML is a platform and its software for efficiently training large language models and deploying them for inference. It provides all the logic and tools you need to set up massively distributed training runs on very large datasets.
Training LLMs involves several problems like:
MosaicML addresses all these problems with features like:
MosaicML is pretty awesome and really does make it very easy to train any model. One of the most difficult things we work with clients on is the actual deployment of these models. Everyone is eager to train their own LLM for their specific use case, but nobody wants to talk about deployment! Deploying these and managing the infrastructure required is a challenging task on your own. MosaicML lets you do that with costs very similar to AWS.
To understand MosaicML's capabilities, you should study the MPT-30B family of foundation models that were trained from scratch using MosaicML.
MPT-30B is a decoder-only, transformer-based, autoregressive, causal language model. Its features include:
MPT-30B claims to outperform OpenAI's GPT-3 on a variety of natural language generation benchmarks:
We assessed MPT-30B's replies qualitatively for common use cases like summarization, customer service chatbot, creative writing, and code generation. It did great in the first three tasks but not in the last task.
The abstract summaries it generates are excellent. In this example, it succinctly summarized the abstract of a research paper:
It summarized an entire play, thanks to its 8,192-token context length:
Its conversation capability, a necessity for customer service chatbots, is also good as shown in this example asking for details about banking products:
Its creativity impresses in this poignant poem:
However, the programming code it generated wasn't generally impressive:
While it produced syntactically correct code, they were usually functionally incorrect.
We get an idea of the scale of the training from the datasets MPT-30B was trained on. They consisted of 1 trillion tokens spanning the following large datasets among others:
In this section, we explain how you can train a massive LLM like the MPT-30B from scratch on the MosaicML platform. We'll explain MosaicML's foundational components like Composer as well as convenience helpers like llm-foundry.
MosaicML manages all infrastructure using Kubernetes (K8s) container orchestration. Since K8s can be deployed on any public cloud or on-prem infrastructure, MosaicML is effectively cloud-agnostic.
You can create any number of K8s clusters with as many GPUs and nodes as your company needs. MosaicML then takes care of right-sizing its infrastructure requests based on the volume of training data.
As a reference point, let's see the training infrastructure needs of the original MPT-30B model:
These GPU numbers are mind-blowing! It reportedly took 2 months to finish these three training runs. Even though training wasn't round-the-clock but done in three intermittent sessions, it's still an incredible number of GPUs.
While MosaicML provides infrastructure for model deployment, it doesn't provide for training. You'll have to provision GPUs either on public clouds like Amazon or Azure with the necessary bureaucracy like requesting quota hikes. However, the chances of getting dozens of GPUs are quite low due to global GPU shortages. A better strategy is to use specialized GPU clouds like CoreWeave.
From the dataset table above, it's evident that you'll need anywhere from a few hundred gigabytes to terabytes of training data for such LLMs.
MosaicML can integrate with popular cloud storage solutions like:
Transfer your training data to any of these storage providers.
The next step is to configure sharding for your massive training data for multi-node distributed training across an entire cluster.
For this, MosaicML provides the StreamingDataset abstraction layer that efficiently shards your data from any supported storage and supplies it to a cluster node.
A training YAML file provides essential details like the following:
For example, here's a configuration file for an MPT-1B (1 billion parameters) model:
You can see that it mentions the preferred type of GPUs and the number of GPUs to allocate for this training run. It also describes the commands to execute the training run.
Similarly, the llm-foundry project also provides a deployment configuration for MPT-30B and many other models.
MosaicML provides a library of techniques to improve or speed up your training. You can simply configure these techniques in your deployment YAML. Some notable techniques include:
Use the mcli command to start the training run based on a given deployment configuration:
The mcli tool initiates the training and prints its progress:
You can scale up to use more GPUs on the same node, if available. Either modify the "gpus" parameter in the configuration or override it from the command line like this:
MosaicML scales out automatically if the requested number of GPUs exceeds the per-node maximum of the cluster. In this example, if each node has a maximum of eight GPUs, MosaicML automatically provisions three nodes and distributes the training across them:
Use mcli to monitor your runs:
You can also monitor training metrics by enabling TensorBoard logging:
Using early stopping, you can configure thresholds so that if there are no improvements in key metrics between epochs, the training session is automatically stopped.
As the training is going on, MosaicML creates checkpoints as specified in your configuration:
Checkpointing configuration (Source: MosaicML)
MosaicML stores the checkpoints in your local filesystem, cloud storage endpoint, or a Hugging Face Hub that you've configured. You can use these checkpoints for inference.
Fine-tuning an existing model follows the same training steps above. The only difference is that in your configuration file, specify the "predefined" flags:
You can see an example of fine-tuning MPT-30B for instruction-following in the mpt-30b-instruct configuration.
After training or fine-tuning, MosaicML enables you to deploy your model either on its infrastructure or your preferred cloud infrastructure, depending on your data privacy, data security, and other compliance requirements.
MosaicML's starter edition enables you to deploy your private model on MosaicML's infrastructure and obtain an API endpoint if you're just using one of these base models without any fine-tuning:
For your own fine-tuned models, you need the MosaicML enterprise edition.
Use these steps to deploy your model:
The deployment configuration file tells MosaicML details like:
An example deployment configuration is shown below:
Use the command-line utility to deploy the model. MosaicML does all the provisioning needed to publish it:
The model is automatically deployed and made available at an endpoint.
Each deployed model is given a unique name by MosaicML. You need that to send requests to the model. Run the utility to list all the deployments:
You'll see your deployments listed:
You can submit prompts to the deployed model from your applications. The deployed model is identified by its MosaicML name. Use the following code to submit user prompts and get completions from the LLM:
In this article, you learned how an LLM can either be trained from scratch or fine-tuned. You also found how you can deploy these models on cloud or on-prem infrastructure.
We use platforms like MosaicML to help our clients go to market with their advanced services that use LLMs to improve their customer service or optimize their business workflows. Contact us to find out how we can do the same for your business!