Welcome back to our series on Large Language Models (LLMs). In the first part, we delved into how LLMs are made and explored their potential in language analysis. In this part, we will explore another fascinating application of LLMs: data analysis. We will discuss the concept of autonomous LLM agents and how they can aid data analysis tasks, specifically exploratory data analysis. So if you’ve ever wondered how AI can help make sense of data, join us and read on!
Autonomous LLM agents
The traditional turn-by-turn conversational interface of LLMs involves a user interacting with the model through a series of individual prompts, like the demo in our earlier post. This back and forth interaction is useful for one-off queries, but it has some limitations.
For example, in a conversational exchange the LLM cannot perform actions like browsing the internet, uploading a CSV, or running a chunk of code. If a user asks a language model for the latest weather data, the pretrained turn-by-turn LLM can provide information that is current only up to its knowledge cutoff date — a proxy for the freshness of the LLM training data. (The most recent ChatGPT model, for example, was trained on a corpus of web data from 2021 and early 2022). Similarly, if a user requests the execution of a specific code or the upload of a dataset, the turn-by-turn LLM cannot perform those actions directly. Autonomous LLM agents, on the other hand, have the capacity to use tools: they can fetch current information, access live data, or perform dynamic computations.
Autonomous agents can browse the internet and run code because they use LLMs to define goals and tasks, identify the right actions to take, and then generate and execute commands – all with the help of external tools (like search engines or APIs). They use contextual embeddings, which are numerical representations of words or objects that can capture meaning and inter-relationships. Contextual embeddings help an LLM understand the user's intent and context by incorporating entire conversation histories. Additionally, scraped web pages, uploaded CSV files, and other data can be embedded, allowing the autonomous LLM agent to respond based on collective knowledge gained throughout the interaction with a user. This results in more personalized and context-aware interactions.
Use-case: exploratory CSV analysis
Exploratory data analysis plays a pivotal role in uncovering hidden patterns, trends, and insights from vast amounts of data. When combined with the power of autonomous LLM agents, it opens up some exciting possibilities for efficient and insightful analysis. These agents have the ability to process, understand, and interpret the contents of CSV files, providing a seamless and intelligent interface for exploring data, identifying trends, and generating meaningful visualizations. Lets explore a compelling use case for many non-profits: using an autonomous LLM agent to embed a CSV file, to carry out code generation in Python, and to visualize data in plots.
In this example, we will use the ChatGPT Code Retrieval Plugin, which leverages the GPT 4 LLM, an internal service that allows the upload and embedding of CSV files. The plugin also executes Python codegen (with the Pandas library) and runs that code, plotting any visualizations that Pandas creates.
First, we upload an toy dataset of containing information on different cereals:
We see that the assistant starts off by cataloging the schema of the CSV file. It even extrapolates what the columns mean in detail, by inspecting the rows. We also have the opportunity to correct its understanding of the dataset schema at this point. Next, we ask it to generate a histogram of cereals by calories per serving.
After generating the visualization, the assistant is able to summarize any observations or key takeaways. We then ask it to generate a more involved visualization: a heatmap of the data.
As you can see in the screenshots, the autonomous assistant appears to transparently report its process. In its first codegen attempt, an import error occurs because of a missing library. The agent is able to understand the error, communicate this mishap to the user, and re-attempt the codegen. The second codegen has a different error, and the agent follows the same procedure. It gets it right on the third try, and outputs a heatmap visualization alongside useful observations about the result. All of this occurs without user intervention.
This example showcases a very powerful data analysis use case for autonomous agents. By harnessing the reasoning and sequencing capabilities of LLMs, organizations can delve into the world of exploratory CSV analysis with a new level of speed, intelligence, and interactivity – all with limited data science expertise. That said, it is important to note that the outputs here are also subject to the “hallucinations” of LLMs, which we discussed in our earlier post. While this is mitigated by our ability to inspect the code generated by the model, it is still possible for the LLM’s interpretation of codegen results to be incorrect or misleading.
Just as we have learned to use calculators to support mathematical calculations, we can use LLMs to support data analysis. Yet just as with calculators, we cannot rely on them without applying our own mathematical reasoning to the problem at hand. Language models can be valuable tools; they can amplify our linguistic capabilities. But they can not replace our own human intellect and reasoning. We can actively seek ways to leverage LLMs to make our communities better. But we cannot abandon human values, inspection, and oversight.
In this era of LLMs, we are faced with both massive potential for good and a universe of ethical dilemmas. It's worth noting that not all builders or users of these systems will disclose their reliance on AI models or protected data. This raises concerns about transparency, safety, and equity. By openly acknowledging the involvement of LLMs in the creation of new information or tools, we uphold the importance of human agency and empower individuals to make informed choices.