You’ll Be Surprised: Bard vs. GPT-4 - Which LLM Gets You Better Results?

September 8, 2023

Bard is Google's alternative to OpenAI's GPT-4 and ChatGPT services. But how well does it work for business use cases? Can you use it for your applications? Does it provide application programming interfaces (APIs)? And how does it fare against GPT-4 on language tasks?

Find out all that and more as we pit Bard vs. GPT-4 in this analysis.

Introduction to Bard and PaLM

Bard is Google's general-purpose chatbot powered by large language models (LLMs) and meant for all users. Like OpenAI's ChatGPT, it's a web application where users can enter complex prompts or questions on any subject and get back helpful results.

PaLM 2 is the LLM powering Bard. We explore some of their capabilities and differences below.

Bard vs. PaLM Capabilities

Both Bard and PaLM are capable of searching for images and interpreting your uploaded images. This is a key difference between what ChatGPT and GPT-4 offer, as they do not currently support images as inputs or outputs.

*Bard finds and displays an image that matches the prompt*

‍

*PaLM finds and returns an image link (in Markdown) that matches the prompt*

Bard correctly identified most of the food items and ingredients in this photo we uploaded:

Bard describes photos (Original Photo by Thomas Park on Unsplash) — *Bard describes photos (Original Photo by* *Thomas Park* on *Unsplash*)

Compare Bard's results to the photo's description: "Top down view of a delicious grilled chicken fajita family mexican food dinner with peppers, onions, grated cheese, flour tortillas, limes and salsa on a table."

You can't upload photos to the PaLM API but you can include links to images in the prompts. However, PaLM’s results are not as accurate or specific as Bard's and sometimes pretty hallucinatory (it's probably just using the URL's text rather than the image at the URL):

Bard's and PaLM's results sometimes differ significantly. The differences suggest that the PaLM instance behind Bard is being actively updated with recent information while the PaLM instance available to you through the APIs is frozen with information from 2021 or 2022.

For example, asking Bard for the latest news from the EU (in July 2023) returns accurate current affairs:

In contrast, the PaLM API (accessed via Google Cloud's generative AI studio) responds with accurate but old news from two years ago:

*PaLM is not as actively updated and returns old information*

A potential problem you should be aware of is hallucination. Neither Bard nor PaLM are immune from hallucinations, which may surprise some users since Bard is often used like a search engine and sometimes even provides source links. In the example below, both were asked to list the differences between two phones, one real and the other fake. We created this very cool doc that outlines some prompting frameworks that help remove hallucinations with LLMs (check it out!)

Both invented features for the latter instead of warning that the phone didn't exist or they didn't know anything about it:

*Bard hallucinates the features of a non-existent Samsung S40 phone*

‍

*PaLM also hallucinates the features of a non-existent Samsung S40 phone*

For the rest of the article, we'll use Bard and PaLM interchangeably. Let's see how Bard performs on common language tasks.

Evaluation of Bard vs. GPT-4 on Language Tasks

In the forthcoming sections, we evaluate PaLM against GPT-4 on common production language use cases like:

Legal summarization
Ad copy generation

We first evaluate select examples manually and qualitatively.

After that, we run both through an extensive LLM benchmarking suite that evaluates the results using another LLM. Combined, these should give you a pretty good idea of what to expect from these two LLMs in production.

Use Case: Legal Clause Extractive Summarization

In this test, we asked PaLM and GPT-4 to accurately create extractive summaries for this snippet from a legal contract with the following prompt:

"Pick two key sentences from each section below:"

‍

Bard came up with this:

PaLM API produced this:

‍

And GPT-4 generated this summary:

Evaluation of Legal Summaries

Both LLMs created strictly extractive summaries in response to this prompt. In our experience, prompts like "pick sentences" or "select sentences" have worked better than instructions like "create an extractive summary."

But correctness aside, in terms of quality, Bard and PaLM did a better job here. Their selection of key sentences as well as their presentation in sections are much better than GPT-4's. I absolutely love the reasoning section Bard provided as that information is critical for output evaluation and iteration of the models ability. They also chose sentences that fit a bit better with the key ideas than GPT-4 did. While we could have provided a topic based extractive summarization prompt (like the one I used in dialogue summarization), it’s interesting to evaluate the outputs with a wide data variance prompt to see the models ability to produce granular results on its own with little guidance.

‍

Use Case: Bard vs. GPT-4 Ad Copy Generation

In this section, we try out both LLMs for ad copy generation and similar use cases.

Example #1: Catchy Headline for an Article

In the example below, we instructed both LLMs with this prompt:

"Help me construct a catchy, yet scientifically accurate, headline for an article on the latest discovery in renewable bio-energy, while carefully handling the ethical dilemmas surrounding bio-energy sources. Propose 4 options."

Although this specific example asks for headlines, the same prompts are ideal for generating ad copy too.

The LLMs generated the results shown below:

Qualitative Evaluation

We observe that:

GPT-4 restricts itself to the concepts mentioned in the prompt — renewable bio-energy and ethics.
In contrast, PaLM seems to have made a conceptual leap to the climate crisis but it's not entirely clear if the ethical dilemmas are related to that or not.
The same problem is seen in the results of the follow-up prompt.

Example #2: Attention-Grabbing Ad Title for a Product

We asked both LLMs to generate appealing ad titles for a product with the following product information:

The prompt given was:

"Write five jaw-dropping ad headlines based on the above product information that makes you want to buy this product."

Bard generated the following titles:

GPT-4 came up with these:

Qualitative Evaluation

We notice that:

PaLM tends towards rather abstract benefits like "taking your look to the next level" or "ultimate hair care package."
GPT-4 is much more specific about the benefits of this particular product.

GPT-4’s headlines are much better. They show a better understanding of the product details and products value points to users. Saying things like “salon-worthy hair” shows the model understands how to portray the product's relevance to customers' interest and draw them in.

Evaluate PaLM vs. GPT-4 Results Using MT-Bench

For production use cases, qualitative evaluation of ad hoc examples is not the best approach. It's neither scalable nor repeatable.

A better system is to maintain a suite of test prompts and contexts that are relevant to your business use case. Run them regularly against your LLMs under test. Then have another LLM compare the results from your LLM with those from a baseline LLM or with manually curated results as shown below. This enables you to automate the evaluation to be repeatable and scalable.

The FastChat MT-Bench web application is built for that very purpose. It comes with a suite of 80 prompts under eight categories, like these examples:

You can also add your own custom prompts and contexts.

It submits each prompt to the selected LLMs and gets their results:

The results are then evaluated by another LLM (in our case, GPT-4 itself, but it could also be a third LLM like Anthropic's Claude). It uses a set of judging prompts like these:

The evaluation results for a prompt are shown below:

MT-Bench Results

The final results, as evaluated by GPT-4, are as follows:

PaLM was outright better than GPT-4 on only 5 tests or only about 3% of the time.
It tied with GPT-4 on 28 tests, meaning its results were as good as GPT-4's.
So, overall, PaLM gave good results about 20% of the time and GPT-4 80% of the time.

In the next section, we bring out some technical differences between PaLM and GPT-4 that you must keep in mind if you decide to use them in production.

PaLM vs. GPT-4 Technical Differences

Both PaLM and GPT-4 are causal LLMs with decoder-only transformer architectures. We go into some of their details below.

PaLM vs. GPT-4 Models

Some of the model details that are publicly available are shown below:

Surprisingly, according to some sources, PaLM 2 manages to be better than its predecessor PaLM with only 340 billion parameters against the latter's 540 billion.

PaLM vs. GPT-4 API Options

There's no direct API for Bard. However, the PaLM 2 LLM is accessible through two routes:

PaLM API

The PaLM API is part of the Google Generative AI suite. It's currently available as a beta preview on requesting access but isn't production-ready yet. The suite also consists of a simple web application called MakerSuite to manage your prompts and few-shot examples.

MakerSuite for prototyping with generative ai — *MakerSuite*

Vertex AI

Vertex AI is a paid service that's part of Google Cloud and is available to you if you have a Google Cloud subscription. The PaLM LLM is available behind a Vertex AI API that is production-ready. Note that though the underlying LLM is the same, this API does not have the same structure and semantics as the PaLM API. Vertex AI also provides a Generative AI Studio playground where you can test your prompts and responses.

PaLM vs. GPT-4 Fine-Tuning

As of August 2023, neither GPT-4 nor GPT-3.5 are fine-tunable. You can only fine-tune the legacy GPT-3 base models or GPT-3 Instruct models. GPT-3 fine-tuning is a simple workflow. You just upload a training file with pairs of prompts and completions, and call the fine-tuning API.

PaLM fine-tuning isn't available via PaLM API, only via the Vertex AI fine-tuning API for enterprise users. But it's more complicated in terms of storage and authentication:

First, upload your training file to a Cloud Storage bucket.
Next, create a fine-tuning job using the Vertex AI pipelineJobs API and wait for it to finish.
Load the tuned model and use the same text generation or chat APIs as PaLM.

Overall, GPT-3 seems easier to fine-tune than PaLM.

Want to productionalize GPT-4 or BARD?

Fortunately, both LLMs can be fine-tuned by API users. Contact us to help you integrate LLMs that are carefully fine-tuned for your specific business needs.

‍

Lets Talk

You’ll Be Surprised: Bard vs. GPT-4 - Which LLM Gets You Better Results?

Introduction to Bard and PaLM

Bard vs. PaLM Capabilities

Evaluation of Bard vs. GPT-4 on Language Tasks

Use Case: Legal Clause Extractive Summarization

Evaluation of Legal Summaries

Use Case: Bard vs. GPT-4 Ad Copy Generation

Example #1: Catchy Headline for an Article

Qualitative Evaluation

Example #2: Attention-Grabbing Ad Title for a Product

Qualitative Evaluation

Evaluate PaLM vs. GPT-4 Results Using MT-Bench

MT-Bench Results

PaLM vs. GPT-4 Technical Differences

PaLM vs. GPT-4 Models

PaLM vs. GPT-4 API Options

PaLM API

Vertex AI

PaLM vs. GPT-4 Fine-Tuning

Want to productionalize GPT-4 or BARD?

Keep Reading

How SAP/ERP AI Chatbots Can Boost Your Sales and Customer Satisfaction

[New Feature] Introducing In-Category Product Data Mapping & Analysis - Improve your understanding of products relevance in a category

92% 4 Level Deep Product Categorization with MultiLingual Dataset for Wholesale Marketplace

97% Accurate 5 Level Deep Product Categorization for Ecommerce Solutions & Ad Placement Company

Hands-On Expert-Level Contract Summarization Using LLMs

Turbocharge Dialogflow Chatbots With LLMs and RAG

The Exact Steps to Implement Custom Ai Shopify Chatbots for Customers

How to Deploy Custom WordPress Chatbots for Happier Customers

Improve Your Customer Workflow With AI: How To Build a Zendesk Chatbot using ReAct

How to Build a Custom Salesforce Chatbot with our Powerful Framework

Everything ML