Hands-On Expert-Level Contract Summarization Using LLMs
Discover how we use LLMs for robust high-quality contract summarization and understanding for our legal clients as well as other businesses.
Bard is Google's alternative to OpenAI's GPT-4 and ChatGPT services. But how well does it work for business use cases? Can you use it for your applications? Does it provide application programming interfaces (APIs)? And how does it fare against GPT-4 on language tasks?
Find out all that and more as we pit Bard vs. GPT-4 in this analysis.
Bard is Google's general-purpose chatbot powered by large language models (LLMs) and meant for all users. Like OpenAI's ChatGPT, it's a web application where users can enter complex prompts or questions on any subject and get back helpful results.
PaLM 2 is the LLM powering Bard. We explore some of their capabilities and differences below.
Both Bard and PaLM are capable of searching for images and interpreting your uploaded images. This is a key difference between what ChatGPT and GPT-4 offer, as they do not currently support images as inputs or outputs.
Bard correctly identified most of the food items and ingredients in this photo we uploaded:
Compare Bard's results to the photo's description: "Top down view of a delicious grilled chicken fajita family mexican food dinner with peppers, onions, grated cheese, flour tortillas, limes and salsa on a table."
You can't upload photos to the PaLM API but you can include links to images in the prompts. However, PaLM’s results are not as accurate or specific as Bard's and sometimes pretty hallucinatory (it's probably just using the URL's text rather than the image at the URL):
Bard's and PaLM's results sometimes differ significantly. The differences suggest that the PaLM instance behind Bard is being actively updated with recent information while the PaLM instance available to you through the APIs is frozen with information from 2021 or 2022.
For example, asking Bard for the latest news from the EU (in July 2023) returns accurate current affairs:
In contrast, the PaLM API (accessed via Google Cloud's generative AI studio) responds with accurate but old news from two years ago:
A potential problem you should be aware of is hallucination. Neither Bard nor PaLM are immune from hallucinations, which may surprise some users since Bard is often used like a search engine and sometimes even provides source links. In the example below, both were asked to list the differences between two phones, one real and the other fake. We created this very cool doc that outlines some prompting frameworks that help remove hallucinations with LLMs (check it out!)
Both invented features for the latter instead of warning that the phone didn't exist or they didn't know anything about it:
For the rest of the article, we'll use Bard and PaLM interchangeably. Let's see how Bard performs on common language tasks.
In the forthcoming sections, we evaluate PaLM against GPT-4 on common production language use cases like:
We first evaluate select examples manually and qualitatively.
After that, we run both through an extensive LLM benchmarking suite that evaluates the results using another LLM. Combined, these should give you a pretty good idea of what to expect from these two LLMs in production.
In this test, we asked PaLM and GPT-4 to accurately create extractive summaries for this snippet from a legal contract with the following prompt:
"Pick two key sentences from each section below:"
Bard came up with this:
PaLM API produced this:
And GPT-4 generated this summary:
Both LLMs created strictly extractive summaries in response to this prompt. In our experience, prompts like "pick sentences" or "select sentences" have worked better than instructions like "create an extractive summary."
But correctness aside, in terms of quality, Bard and PaLM did a better job here. Their selection of key sentences as well as their presentation in sections are much better than GPT-4's. I absolutely love the reasoning section Bard provided as that information is critical for output evaluation and iteration of the models ability. They also chose sentences that fit a bit better with the key ideas than GPT-4 did. While we could have provided a topic based extractive summarization prompt (like the one I used in dialogue summarization), it’s interesting to evaluate the outputs with a wide data variance prompt to see the models ability to produce granular results on its own with little guidance.
In this section, we try out both LLMs for ad copy generation and similar use cases.
In the example below, we instructed both LLMs with this prompt:
"Help me construct a catchy, yet scientifically accurate, headline for an article on the latest discovery in renewable bio-energy, while carefully handling the ethical dilemmas surrounding bio-energy sources. Propose 4 options."
Although this specific example asks for headlines, the same prompts are ideal for generating ad copy too.
The LLMs generated the results shown below:
We observe that:
We asked both LLMs to generate appealing ad titles for a product with the following product information:
The prompt given was:
"Write five jaw-dropping ad headlines based on the above product information that makes you want to buy this product."
Bard generated the following titles:
GPT-4 came up with these:
We notice that:
GPT-4’s headlines are much better. They show a better understanding of the product details and products value points to users. Saying things like “salon-worthy hair” shows the model understands how to portray the product's relevance to customers' interest and draw them in.
For production use cases, qualitative evaluation of ad hoc examples is not the best approach. It's neither scalable nor repeatable.
A better system is to maintain a suite of test prompts and contexts that are relevant to your business use case. Run them regularly against your LLMs under test. Then have another LLM compare the results from your LLM with those from a baseline LLM or with manually curated results as shown below. This enables you to automate the evaluation to be repeatable and scalable.
The FastChat MT-Bench web application is built for that very purpose. It comes with a suite of 80 prompts under eight categories, like these examples:
You can also add your own custom prompts and contexts.
It submits each prompt to the selected LLMs and gets their results:
The results are then evaluated by another LLM (in our case, GPT-4 itself, but it could also be a third LLM like Anthropic's Claude). It uses a set of judging prompts like these:
The evaluation results for a prompt are shown below:
The final results, as evaluated by GPT-4, are as follows:
In the next section, we bring out some technical differences between PaLM and GPT-4 that you must keep in mind if you decide to use them in production.
Both PaLM and GPT-4 are causal LLMs with decoder-only transformer architectures. We go into some of their details below.
Some of the model details that are publicly available are shown below:
Surprisingly, according to some sources, PaLM 2 manages to be better than its predecessor PaLM with only 340 billion parameters against the latter's 540 billion.
There's no direct API for Bard. However, the PaLM 2 LLM is accessible through two routes:
The PaLM API is part of the Google Generative AI suite. It's currently available as a beta preview on requesting access but isn't production-ready yet. The suite also consists of a simple web application called MakerSuite to manage your prompts and few-shot examples.
Vertex AI is a paid service that's part of Google Cloud and is available to you if you have a Google Cloud subscription. The PaLM LLM is available behind a Vertex AI API that is production-ready. Note that though the underlying LLM is the same, this API does not have the same structure and semantics as the PaLM API. Vertex AI also provides a Generative AI Studio playground where you can test your prompts and responses.
As of August 2023, neither GPT-4 nor GPT-3.5 are fine-tunable. You can only fine-tune the legacy GPT-3 base models or GPT-3 Instruct models. GPT-3 fine-tuning is a simple workflow. You just upload a training file with pairs of prompts and completions, and call the fine-tuning API.
PaLM fine-tuning isn't available via PaLM API, only via the Vertex AI fine-tuning API for enterprise users. But it's more complicated in terms of storage and authentication:
Overall, GPT-3 seems easier to fine-tune than PaLM.
Fortunately, both LLMs can be fine-tuned by API users. Contact us to help you integrate LLMs that are carefully fine-tuned for your specific business needs.