Hands-On Expert-Level Contract Summarization Using LLMs
Discover how we use LLMs for robust high-quality contract summarization and understanding for our legal clients as well as other businesses.
Large language models (LLMs) sometimes have trouble processing complex questions or elaborate instructions that involve multiple conditional sentences, knotty logic, intricate associations between named entities, or math calculations.
You can drastically improve the accuracy of such tasks using a technique called chain-of-thought (CoT) prompting.
Chain-of-thought prompting is a prompt engineering technique to make LLMs answer complex questions or follow elaborate instructions by first generating a sequence of intermediate reasoning steps in natural language.
Researchers have shown that using a few special prompts, and without any additional fine-tuning, you can make LLMs accurately follow elaborate instructions and work out answers to complicated questions by simply asking them to reason step by step toward the right answers.
In the following sections, we survey some fundamental research in CoT reasoning.
Few-shot CoT was proposed first and zero-shot CoT was demonstrated a few months later. But because zero-shot CoT is easier and you can use it for any task in any domain, we explore it first.
In their 2022 paper, Large Language Models are Zero-Shot Reasoners, Kojima et al. explored a zero-shot CoT approach using appropriate prompts.
To force CoT reasoning, they proposed a simple, two-step prompting sequence:
The two-step sequence doesn't always yield usable answers. Sometimes, the LLM hallucinates multiple results. When asked to select an answer from multiple choices, it may select more than one.
To overcome such problems, the paper employs an additional answer-cleansing step of applying task- or format-specific manual rules to the generated answers.
In the section on improving CoT prompting, we explore how you can wield more control over malformed answers.
Math problems can be quite confusing for LLMs.
Here, standard GPT-3 gets an arithmetic problem wrong:
With zero-shot CoT reasoning, the LLM reasons out the steps correctly:
The example below shows a common-sense question regarding the real world:
While the answer isn't technically wrong, it's not the best among the available options.
With zero-shot CoT, the LLM gets it right:
Zero-shot CoT outperforms the baseline LLM on several tasks, especially in arithmetic and symbolic reasoning. Note however that it's not a panacea and performance in some cases may actually reduce.
CoT prompting was first proposed by Wei et al. in their paper, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Their approach used few-shot prompt engineering for in-context learning. In their prompts, they included several pairs of questions and answers with reasoning chains as examples for the LLM. The examples guided the LLM to use reasoning in its generated answers.
In the sections below, we explore few-shot CoT in action.
For few-shot CoT on math problems, the researchers supplied examples like the ones below:
The test problems with CoT reasoning are shown below:
For logical and common-sense reasoning, few-shot examples like these were provided in the prompt:
The paper demonstrated that few-shot CoT reasoning outperformed baseline performance on several math reasoning datasets:
The results below show improved performance of few-shot CoT on logical and common-sense questions:
Are LLMs really reasoning? Does CoT prove that LLMs are on par with sentient intelligence? In this section, we try to develop an intuition for why CoT works. Doing so helps develop insights into adapting these methods for custom workflows.
It's useful to start with an understanding of how LLMs work. You can think of an LLM as a black box that produces one text token at each step based on a certain probability distribution.
The factors that influence the probabilities of tokens in each step are:
At each step, all these are combined to yield an output probability for each token in the vocabulary. The LLM then selects the token with the highest probability and produces it as the output. The entire process then repeats for the next token, including the current step's probabilities.
In-context learning happens because the current step's probabilities are influenced by the probabilities of all the tokens that came before it.
Does the training provided for instruction-following, using reinforcement learning with human feedback or supervised fine-tuning, result in CoT capabilities?
We can eliminate this possibility straight away because of the evidence. Both zero-shot and few-shot CoT reasoning were shown using the standard LLMs like GPT-3, not just their instruction-following variants.
However, it's certainly possible that instruction-following improves the quality of CoT reasoning.
Another intriguing phenomenon, reported by both Wei et al. and Kojima et al., is that good CoT reasoning apparently emerges only in large LLMs above 100 billion parameters, and it actually reduces the performance of smaller LLMs.
Though Wei et al. leave the question unanswered, they do include the possibility that pretraining data may be one of the possible factors.
Combining that with our knowledge of basic LLM operation, one possibility for the emergence of reasoning only in large LLMs is that scale itself had nothing to do with it. It may just be a coincidence that those larger LLMs also happened to be trained on some web-based or textbook datasets that included step-by-step reasoning.
LLMs may not necessarily be reasoning. When they get a step-by-step trigger instruction, the probabilities of tokens get directed toward reproducing the reasoning they had seen during training. They may simply be imitating reasoning by generating reasoning-like tokens based on the probabilities in some datasets by coincidence.
We can test this hypothesis by training open-source LLMs like LLaMA, DollyV2, or GPT-J on reasoning datasets and testing them with CoT prompts. That's one approach for deliberately adding CoT capabilities to custom self-hosted LLMs for businesses.
We saw earlier that the key step in zero-shot CoT is the trigger prompt, "Let's think step by step." The problem with it is that it's hand-crafted. Kojima et al. tested a few hand-crafted prompts, observed their metrics on different datasets, and selected that particular prompt because it gave the best results.
Intuition tells us that we can surely find better CoT trigger prompts if we can somehow systematically explore such a massive search space. Automatic Prompt Engineer (APE) by Zhou et al. is designed to do exactly that — discover better CoT prompts automatically. For this, it again uses the power of LLMs as explained below.
APE's automated discovery of CoT trigger prompts goes like this:
For the dataset, they selected example questions and generated reasoning chains using regular zero-shot CoT.
To generate the trigger prompts, they used the following prompt template for the LLM:
The scoring goal was to increase the maximum likelihood of generating the reasoning chain.
APE discovered that the best scoring prompt was: "Let’s work this out in a step by step way to be sure we have the right answer."
APE achieves on-par or better accuracy on multiple instruction induction datasets:
The table below shows APE's discovered trigger prompts for various tasks:
A major inconvenience and time drain in few-shot CoT is its need for task-specific example pairs of questions and reasoned answers. To streamline that, Zhang et al. proposed a technique to autogenerate the examples in their paper, Automatic Chain of Thought Prompting in Large Language Models.
Their approach combines zero-shot CoT and clustering as explained next.
First, collect a dataset of problems or questions suitable for the task. For example, use the GSM8K for math problems.
A naive approach for selecting examples is by starting with a set of test questions and using cosine similarity to find other questions that are similar to each of them. For these test questions, we can generate reasoning chains using zero-shot CoT.
But the problem there is that if the zero-shot CoT yields a faulty reasoning chain for a selected test question, it's likely to do so for most of the other similar questions. These wrong demonstrations of reasoning mislead the LLM into faulty reasoning.
To avoid that, we must ensure reasonable diversity in the selected example questions. A simple way to do this is:
The selection criteria for the questions are simple rules, like:
This process yields a diverse set of example questions along with their reasoning chains and final answers.
The results below show that accuracy with autogenerated CoT examples closely matches or exceeds those of hand-crafted CoT examples but with significant time savings:
The illustrations below show some autogenerated examples for math problems:
Here are some autogenerated examples for common-sense questions:
Let's see CoT reasoning in practice.
We demonstrate some CoT reasoning tasks in common business activities.
We ask an LLM to generate code to split text into sentences. This is non-trivial because sentences can have real numbers, periods inside direct quotations, parentheses, and so on.
A basic prompt generates the following code:
By asking the LLM to reason about it step by step, the generated code is improved:
Look at this paragraph from a legal judgment:
Legal judgments and contracts are often full of complex sentence constructions, multiple clauses, and legal concepts.
They are difficult to understand for laypersons and time-consuming for legal professionals. Plus, the complex language may result in mistaken interpretations, legal risks, compliance risks, and penalties for businesses.
Using CoT reasoning, GPT-4 can explain such complex legal paragraphs step by step:
Medical records and reports may have complex critical information that medical professionals must get answers from. LLM-based chatbots can save them time by answering questions based on information in patients' health and medical records. The confidence in LLM-based chatbots will be higher if they can demonstrate the ability to reason while answering questions.
Below's a rather confusing medical investigation report:
We ask an LLM to interpret it using CoT reasoning and answer follow-up questions:
Some follow-up questions based on the reasoning:
In customer service or other business activities involving dialogue, customers or business partners may use sarcasm, jokes, or similar expressions. Automated systems may misinterpret such dialogues during sentiment classification or summarization.
In this example, a reviewer posts a sarcastic comment but the system misunderstands it:
But using advanced CoT reasoning, you can make it correctly understand the customer's intention:
Many businesses hesitate to automate their workflows using LLMs due to a lack of confidence in their accuracy, reliability, and repeatability. Just one wrong or ad-hoc answer can incur substantial costs to introduce manual quality checks in business workflows.
By using chain-of-thought prompting and step-by-step reasoning, you can confidently start using LLMs in your critical business processes. We have the insights you need to use LLMs without losing sleep. Contact us to know more about accurate CoT prompting and reasoning in your business workflows.