How SAP/ERP AI Chatbots Can Boost Your Sales and Customer Satisfaction
Using ERP AI chatbots, empower your employees to boost your sales, enhance customer satisfaction, and make smarter decisions in real time.
Large language models (LLMs) like ChatGPT excel at language tasks like creative writing and summarization because of the volume of text data they're trained on.
Now, researchers are discovering that the deep understanding of language that LLMs have, combined with knowledge about the real world present in their training data, makes them impressively capable at language-adjacent problems like reasoning, task planning, decision-making, and action selection — the very capabilities needed for fully autonomous agents.
In this article, we explore various techniques you can use to build LLM-powered autonomous agents. Through case studies, we also study the design and operations of some actual agents.
LLM-powered autonomous agents are software components whose logic and state are controlled by an LLM and demonstrate these traits of autonomy:
In the following sections, we’ll take a look at how we implement SOTA frameworks and techniques to create LLM systems that complete tasks on their own.
Autonomous agents must be capable of decomposing a high-level instruction or goal into a logical set of tasks in the right order. We explore some simple as well as complex techniques to do that.
Chain-of-thought (CoT) prompting is one of the simplest task decomposition techniques because it neither requires any special training of the LLM nor involves any additional software.
All it involves is a simple instruction like "think step by step" or "let's think step by step" added to the user prompt or system prompt. Most LLMs respond to it by breaking down the goal into a logical sequence of steps.
Despite its simplicity, CoT turns out to be surprisingly effective for simple goals.
In this example, we provide a system prompt to GPT-4 explaining the tools available and ask it to use CoT for task decomposition:
The tools are deliberately listed in a random order to prevent the LLM from being influenced by their sequence in the system prompt.
As shown below, CoT forces the LLM to break down the instruction into a logical sequence of tasks using the tools we have described:
The reasoning and sequence of steps are correct. The only missing aspect here is more structured invocations of the tools that a script can process programmatically. It's easy to add that too, but for now, we're only interested in checking how well CoT works.
If the "think step by step" CoT instruction is removed, the steps are not always as logical:
CoT prompting is a great framework for adding a conversational frontend to other generative ai tasks to make a more “agent like” workflow. The planning nature of the output allows us to walk through multi-step chatbot inputs to ensure we perform each task and cover our grounds. The rather simple prompt structure of CoT makes it easy for LLMs to understand how we’ve outlined our task and follow few-shot examples.
Chatbots that require code generation are one of the best use cases for chain-of-thought. We’ve used this framework to generate code queries, run the code that is generated step by step, and evaluate the results to ensure a quality output relative to the provided query. This breaking up of the code required into multiple steps makes it easier to follow the logic and allows you to use simpler queries that are easier for the model to generate. You can even work <THINK> bubbles into the plan generation to add a bit of language understanding to why you are generating the specific code and how it relates to the original query. Here’s an example of what the format looks like for generating a plan with thought sections that describe why we perform this step.
While Chain-of-thought is the most simple way to set up LLM agents, its a really valuable addition to these chatbots that allow you to autotomize the workflow to understand what to do. Adding error messaging or other return results make it so code can just check the results that come back and perform any required operations on its own.
Tree of thoughts (ToT) reasoning extends CoT to achieve better reasoning toward goals where exploration, strategic lookahead, or criticality of initial decisions matter. On some tasks, ToT achieves a success rate of 74% compared to CoT's 4%.
ToT improves decision-making by:
ToT perceives any instruction or goal as a tree of possibilities and searches for the best possible path through that tree. Essentially, it does a tree search using common algorithms like depth-first or breadth-first search.
ToT starts by decomposing the instruction into several intermediate thought states. Each thought state is a node in the tree that represents a thought and was the result of traversing a particular path in the tree. Next, it generates multiple thoughts from each thought state. It then evaluates each thought using a classifier or voting.
An example of ToT applied to a creative writing task is shown here:
Tree of Thoughts is best known as a prompting framework for creative writing and complex math related tasks. The combination of prompt based planning and the look ahead mechanisms solves some of the common issues seen in replicating human level long form tasks that require a bit of understanding of what has already been done and what will be done in the future for any given task.
We’ve used this in our long form content generation agents that focus on generating long form content (blog posts), refining and editing them, adding required images and alt text, and publishing them out. One of the key issues customers see with blog post generation is the models desire to repeat information already said in the sections above. This reads poorly as it looks like someone who didn’t write the sections above is now writing this one. It commonly comes out as very basic information that the reader doesn’t need to read again, and sometimes adds a “summary” paragraph to the end of each generation.
Tree of Thoughts cleans this up quite a bit by providing the entire plan for the blog post to the model as it generates the sections. Some prompting to connect these two concepts together makes sure the model understands that specific topics and ideas will be covered later in the blog based on the plan and the outline the plan generates. The plan also helps to make sure the blog post covers all required topics and doesn’t miss anything. This greatly expands the length of our blog posts and has allowed us to generate blogs that reach 2500+ words without repeating information unless required in a conclusion style section.
For better reasoning and task decomposition, the LLM should be able to evaluate its reasoning and backtrack or refine its steps if necessary. In this section, we review techniques to help do that.
This technique aims to improve upon CoT using a metric of self-consistency.
As we know, LLMs can generate multiple results for a prompt. In regular CoT, we normally just pick the first result as the answer. Instead, we can pick multiple results using top-k or similar sampling techniques.
CoT-SC intuitively recognizes that there must be some ground truth answer for each prompt. If we rerun the LLM multiple times and sample from its top-k results rather than just the first result, a significant number of results will be similar to one another and semantically close to the ground truth answer. This concept of similarity between results is called self-consistency.
CoT-SC samples multiple reasoning paths following a "think step by step" instruction. Then, instead of only taking the first result, it searches for the most consistent answer by marginalizing out the sampled reasoning paths.
The final answer is then decided by majority voting.
Graph of thoughts (GoT) goes beyond ToT's tree model by interpreting each output from an LLM as an arbitrarily interconnected graph with complex operations (branches) like aggregating multiple thoughts into a new one or looping over a thought to refine it.
GoT can be used for use cases like document merging. Besta et al. give the example of creating a new legal agreement by combining legal clauses from other existing agreement documents. If you use GoT to implement an autonomous agent for agreement creation, the existing suitable agreements and relevant clauses are first fetched by suitable data fetching tools after which the merging is done using GoT.
Chain of hindsight is an LLM fine-tuning technique that combines supervised fine-tuning and reinforcement learning with human feedback but without RL.
Instead, the human feedback for a multi-turn conversation is converted to natural language text and directly used as additional training sequences for supervised fine-tuning of the LLM. The hindsight in the name refers to all the feedback provided during the conversation.
Autonomous agents must have the ability to run other tools and services to either directly achieve the given goals or obtain additional information necessary to achieve them. In this section, we review techniques to add tool use and actions to LLMs.
ReAct combines reasoning and action as interleaving sets of thoughts, actions, and observations.
ReAct works by providing few-shot examples that demonstrate reasoning, action selection, and observations following the action as shown in the example above.
The action descriptions are similar to the CoT example above but more structured so that we can process the output.
If you're using GPT-4, the action specification and detection can be made more structured and reliable using the OpenAI function-calling APIs.
Using the API, specify actions and their information in detail as shown below:
The LLM uses this information — the function descriptions, parameter descriptions, and so on — to decide when and how to invoke your actions. The LLM's response contains structured details on how your script must invoke a selected action.
I wrote an entire guide to using ReAct for common use cases like chatbots and document summarization as the framework's ability to use various tools inline with the result generation is perfect for these use cases. The evaluation step at the end helps make sure the results actually make sense given the provided task and query. ReAct is my favorite framework for autonomous agents as building workflow examples for training and prompting are a breeze and allow you to refine the agents ability and understanding of the task down to the very line. I highly recommend this framework in chatbots that leverage external APIs and indexed knowledge bases in the same tool as the management of multiple external data sources relative to the input query can be challenging. This same workflow can be used to attach this chatbot framework to other systems to create a full autonomous agent.
Simply provide the prompting framework with access to different tools with a description of what the tool should be used for and the action step will manage accessing and leveraging the tools. Here’s a look at our banking chatbot architecture that accesses multiple external API services such as an interest rate calculator, a customer profile database, and Mortgage value calculator.
Reflexion achieves self-introspection and refinement by applying reinforcement learning (RL) to LLMs. But instead of updating model weights like regular RL, it provides textual feedback that can be included in subsequent prompt contexts to improve the actions and results.
ToolFormer finetunes an LLM to decide if a tool must be used, select a suitable tool, and invoke it with the correct syntax, semantics, and arguments.
An example conversation using ToolFormer is shown below:
Finetuning can be more reliable than in-context learning but also less flexible. If your agent addresses a narrow problem for which it needs only a limited number of predetermined tools, use ToolFormer. However, if you need the LLM to learn a new tool, you must fine-tune it again.
On the other hand, if your LLM needs access to a wide and easily extensible set of tools in the future, use an in-context technique like ReAct.
One of the reasons we use this framework is the inline nature of the tool usage when generating the response. Most of these systems use the tool, generate a result, then generate the full language result by combining the information. ToolFormer uses the tools right inline with the natural language result. This makes it easier to tweak the conversational tone, length, and style.
Context determines the planning and action decisions of the LLM.
The first type of context consists of the LLM's own weights. They form a type of contextual memory that is non-volatile and usually read-only. They represent all the text sequences and surrounding contexts that the LLM saw during its training. Fine-tuning and reinforcement learning are typically used to modify this memory.
A second type of context consists of all the information present in the user and system prompts. This is more volatile memory (analogous to stack memory in programming) that's only used for the duration of a single request to the LLM and then forgotten by the LLM. The technique of providing relevant information for a single request is called in-context learning.
A third type of context involves storing useful information in an external database and supplying it to the LLM on demand (analogous to disk storage). When the LLM receives a new prompt, it looks for relevant information in the database, retrieves it, and injects it into the prompt. This technique is called retrieval-augmented generation (RAG).
The typical RAG implementation is:
There are helper frameworks that already implement most of the moving parts explained in previous sections. You just have to customize a few bits for your specific autonomous agent. In this section, we review some of these existing frameworks.
Implementing any of the above techniques typically involves stringing together logic that interacts with diverse AI libraries like PyTorch and Transformers, client libraries for external APIs like Wikipedia, and different database technologies like FAISS or MySQL.
One problem is that such logic can be brittle — fine for a proof of concept but unreliable for a production system. Another problem with it is that it's usually not fully reusable across multiple agents; you're forced to reinvent the wheel to some extent for every agent you need.
A better approach is to treat this entire paradigm as a specialized domain, the domain of LLM programming, and come up with a domain-specific language (DSL). In such a DSL, all these techniques like RAG or self-reflection become first-class citizens that can be expressed directly by name rather than indirectly in programming code.
DSPy is a framework that provides an improved approach to programming LLMs using a domain-specific language with all the common LLM techniques as first-class citizens.
A DSPy example for RAG-based question-answering is shown below:
AutoGPT is both a general-purpose autonomous agent you can run standalone as well as a framework for implementing new tools. Its configurable settings make it very powerful for both purposes. With simple configuration settings, you can enable an entire ecosystem of tools for:
This demo shows how AutoGPT can answer complex questions by combining reasoning and tools like web search.
We gave it the following complex task involving legal and medical compliance: "Summarize the key provisions of the California Consumer Privacy Act relevant to a medical assistance app."
In response, AutoGPT first set a role for itself as a legal expert and decomposed the task into this set of logical goals:
Next, it thought and reasoned about the goals to create a plan of action:
It subjected its reasoning and plan to a round of self-introspection:
As its first action, it searched the web to get relevant knowledge (using its web search implementation based on the DuckDuckGo search engine):
However, after fetching this information, it repeated the same search a couple of times, apparently stuck in a loop. We intervened with an instruction to stop searching and proceed with the information already fetched:
AutoGPT completed the task by summarizing complex information (in this case, the provisions of a legal statute) relevant to a specific scenario (a medical assistance app):
BabyAGI is a reusable autonomous agent implementation that's capable of task planning and mainly useful for retrieval-augmented generation use cases using GPT LLMs. It has built-in support for various vector databases.
In this article, we explored state-of-the-art techniques and technologies being used to implement autonomous agents. LLMs were initially meant for natural language processing but they have turned out to be surprisingly good at general problem solving, reasoning, and task planning. You can automate even your complex business workflows to a great extent using just LLMs and natural language instructions.
If you have a novel business idea that you want to test with a proof of concept, you should seriously consider prototyping it using an autonomous agent with an LLM brain. Contact us for help.