Hands-On Expert-Level Contract Summarization Using LLMs
Discover how we use LLMs for robust high-quality contract summarization and understanding for our legal clients as well as other businesses.
The large language model space has been heating up like never before with new entrants, models, and amazing capabilities announced every month. In this comparison of AI21 vs. GPT-3, we explore one such new entrant, AI21 Labs, and how their Jurassic-1 models fare against OpenAI's GPT-3 models.
AI21 is a company that's created large language models called Jurassic-1 and Jurassic-2 on the same scale as OpenAI's GPT-3. Their Jurassic-1 Jumbo model clocks in at 178 billion parameters, comparable to OpenAI's DaVinci model. They also have a Jurassic-1 Grande model that's been trained to follow instructions on the same lines as InstructGPT.
We'll start with some essential differences between these two model families before we compare them on language tasks.
AI21's main product is called AI21 Studio. It's a web application that features various language tasks like summarization, paraphrasing, ad copy generation, and more. Even non-technical users will find it easy to use for common text generation and processing tasks. Plus, AI21 Studio also exposes application programming interfaces (APIs) for programmatic access from your custom software.
Here are some key differences in features and functionality between AI21 Studio and GPT-3:
AI21's models are called Jurassic-1 and are based on the same autoregressive transformer neural network as GPT-3. Some brief details about AI21’s models:
AI21 pricing is more expensive than GPT-4 or GPT-3:
Is AI21 worth it? Let's find out by running them against each other on multiple language tasks.
We evaluate the models on four basic natural language processing and text generation tasks with practical uses for businesses and education:
Zero-shot strict summarization is the ability of a language model to do extractive summarization according to a given set of rules. This is useful in medical and legal fields where extraction is preferred because abstractive summarization may inadvertently change the meaning.
Zero-shot summarization of technical articles is challenging because they contain domain-specific concepts and external knowledge familiar to a specialized audience. Any summary must sound correct to such an audience.
For this experiment, we asked the models to summarize this section from our technical article on the spaCy text processing framework:
The gist here is that spaCy provides an advanced text processing pipeline that's superior to the more common approach of using multiple frameworks with glue code.
The temperature is set to zero for all models to remove any randomness in the predictions. We qualitatively evaluate the summaries as follows:
The main prompt used was "write a strictly extractive summary with at least one proper noun. Do not use more than 10 words." But the GPT-3 models seemed unable to understand "strictly extractive summary" and generated abstractive summaries.
Only an alternative prompt that explained how to do extractive summarization without using that term — "select one sentence from the passage that summarizes it" — came somewhat close. But even then, it wasn't strictly extractive.
Surprisingly, ChatGPT fared badly on the main prompt and became verbose on the alternative prompt.
The j1-grande-instruct model fared quite well on this task. It understood the main prompt and produced a nearly-verbatim sentence from the passage. The alternative prompt was perfectly understood.
However, the non-instruct models produced mostly junk output.
AI21's instruct model does better at extractive summarization than all the GPT-3 models if the text has a single main theme.
Summarizing important sections and extracting information from documents is useful for automated document understanding and verification of contracts, court cases, and other legal documents. For example, we have implemented summarization on many legal use cases like understanding master service agreements and improving legal contracts.
In this experiment, we evaluate whether large language models can generate such information without any few-shot training data. These sections from a contract are supplied to the models:
We instructed the DaVinci model to generate a summary and extract important information. However, it appears that it's only able to do one of them at a time:
Splitting up the instructions into two sentences works better. The summary is abstractive but retains all important information. Interestingly, more information is extracted in response to the standalone prompt (above) and it's different from that for the mixed prompt (below).
ChatGPT generates a fairly extractive summary but its named entities are not quite what we want
Unfortunately, the Jurassic-1 models didn't do well on this task, both in summary generation and information extraction.
GPT-3 is capable of processing legal documents out of the box. However, AI21 models show the following problems:
You can use language models as a copywriter to automatically generate new ads for your products based on their descriptions.
For this task, we start with the following product information as input:
The GPT-3 models did quite well but Curie and ChatGPT didn't always include the brand name. When a new field called "brand" was added and explicitly instructed to mention it, even Curie and ChatGPT started including it:
The AI21 models generated fairly good headlines and followed other instructions like word limits:
For this experiment, we selected this random product from the cross-market recommendations dataset and asked the models to generate an ad body for it:
Good ads are eye-catching and exciting. When we asked GPT-3 to produce such ads with specific language to use, it did well:
We gave the same instructions to the j1-grande-instruct. While it did OK, it didn't really come up with the kind of excited copy GPT-3 generated and didn't follow some instructions either.
Both models did quite well on this task. GPT-3 was effortlessly creative when the prompts asked for it while AI21 remained quite boring throughout.
We also noticed that the AI21 model sometimes duplicated all the results when asked for a particular number of results. If you're planning to automate this stuff, perhaps to distribute ads automatically to Google or Facebook, ensure that you have some post-generation checks.
Large language models are capable of generating proofs for simple math theorems. In this experiment, we examine how AI21 and GPT-3 perform on proof-by-cases using few-shot training.
First, we provided an example of proof from number theory. The proof uses slightly complex concepts and symbols, such as modulo, equivalence, and powers. The models are then asked to prove that the absolute value of any number outside (-1,1) will exceed one.
Despite just a single few-shot example, the DaVinci model is able to follow the familiar pattern of proof by cases and prove the theorem without any complications.
The j1-grande-instruct model demonstrates some quirks in its proof:
While its method produced an acceptable deduction, it's not exactly what was asked.
Can AI21 perhaps avoid its quirks with some few-shot training? To test that, we gave it some few-shot examples of proof by case:
Unfortunately, the proofs it generated every time were all garbage, regardless of the temperature value:
The AI21 model isn't as good as the GPT-3 model on math proof generation. More few-shot examples may help with the output format but it doesn't look like its core math reasoning is capable enough as of now.
AI21's web interface is definitely more feature-rich, making it suitable for non-technical users. It did quite well on general summarization tasks.
But when more creativity was required (like for ad copy) or more domain-specific behavior was asked (for legal documents), it didn't do so well out of the box. So we can conclude that GPT-3 is overall more versatile than AI21 as of now.
Large language models are getting more capable all the time. At the time of this writing, a new 540 billion parameter model capable of using language and robotic sensors together promises new breakthroughs. This is an incredible time for businesses to come on board this ecosystem and use these versatile models for their workflows. Contact us to find out how you can start using these models to make your work easier.