In this guide, we will build an LLM-powered chatbot named "Frosty" that performs data exploration and answers questions by writing and executing SQL queries on Snowflake data.
The application uses Streamlit and Snowflake and can be plugged into your LLM of choice, alongside data from Snowflake Marketplace. By the end of the session, you will have an interactive web application chatbot that can converse and answer questions based on a financial dataset.
Streamlit is an open-source Python library that enables developers to quickly create, deploy, and share web apps from Python scripts. Learn more about Streamlit.
A large language model, or LLM, is a deep learning algorithm that can recognize, summarize, translate, predict and generate text and other content based on knowledge gained from massive datasets. Some examples of popular LLMs are GPT-4, GPT-3, BERT, LLaMA, and LaMDA.
OpenAI is the AI research and deployment company behind ChatGPT, GPT-4 (and its predecessors), DALL-E, and other notable offerings. Learn more about OpenAI. We use OpenAI in this guide, but you are welcome to use the large language model of your choice in its place.
The Snowflake Marketplace provides users with access to a wide range of datasets from third-party data stewards, expanding the data available for transforming business processes and making decisions. Data providers can publish datasets and offer data analytics services to Snowflake customers. Customers can securely access shared datasets directly from their Snowflake accounts and receive automatic real-time updates.
st.experimental_connection
to connect your Streamlit app to Snowflakesession state
to store your chatbot's message historyComplete the following steps in your local machine (or an equivalent dev environment):
llm-chatbot
.conda create --name snowpark-llm-chatbot --override-channels -c https://repo.anaconda.com/pkgs/snowflake python=3.10 numpy pandas
Activate the environment created in those instructions by running conda activate py310_env
and proceed to step 6 below.If you're not using a machine with an Apple M1 chip, continue to step 4.conda create --name snowpark-llm-chatbot -c https://repo.anaconda.com/pkgs/snowflake python=3.10
conda activate snowpark-llm-chatbot
conda install -c https://repo.anaconda.com/pkgs/snowflake snowflake-snowpark-python openai
pip install streamlit
pyarrow
related issuespyarrow
installed, you do not need to install it yourself; installing Snowpark automatically installs the appropriate version.pyarrow
after installing Snowpark.If you prefer to run through the tutorial in a remote environment instead of setting up a Python environment locally, you can use GitHub Codespaces.
.streamlit/secrets.toml
file with configuration for connecting to Snowflake and an OpenAI API Key as described in "Setting up Streamlit environment".Snowflake Marketplace provides visibility to a wide variety of datasets from third-party data stewards which broaden access to data points used to transform business processes. Snowflake Marketplace also removes the need to integrate and model data by providing secure access to data sets fully maintained by the data provider.
If you don't have a Snowflake account, sign up for a 30-day free trial here.
You can also access Snowsight from the Classic Console:
Before building our app, we need to run a set of SQL statements in Snowflake to create two views. The first view is FROSTY_SAMPLE.CYBERSYN_FINANCIAL.FINANCIAL_ENTITY_ATTRIBUTES_LIMITED
, which includes:
The second view is FROSTY_SAMPLE.CYBERSYN_FINANCIAL.FINANCIAL_ENTITY_ANNUAL_TIME_SERIES
, which includes:
You can copy the SQL statements from this file and run them in the worksheet created for your sample queries.
Now that we've configured the dataset we'll be using for our application, we can get started with Streamlit.
llm-chatbot
folder.streamlit hello
. Since our application will connect to Snowflake and OpenAI, we need a way to securely store our credentials. Luckily, Streamlit's secrets management feature allows us to store secrets securely and access them in our Streamlit app as environment variables.
llm-chatbot
folder called .streamlit
. Using the command line, you can do this by entering mkdir .streamlit
..streamlit
folder, add a file called secrets.toml
. Using the command line, you can do this by first navigating to the .streamlit
folder via cd .streamlit
and then entering touch secrets.toml.
secrets.toml
We need to add the Snowflake user
, password
, warehouse
, role
, and account
to our secrets file. Copy the following format, replacing the placeholder credentials with your actual credentials.
If you prefer to use browser-based SSO to authenticate, replace password = "
with authenticator=EXTERNALBROWSER
.
# .streamlit/secrets.toml
[connections.snowpark]
user = "<jdoe>"
password = "<my_trial_pass>"
warehouse = "COMPUTE_WH"
role = "ACCOUNTADMIN"
account = "<account-id>"
secrets.toml
We also need to add our OpenAI API key to our secrets file. Copy the following format, replacing the placeholder API key with your actual API key.
# .streamlit/secrets.toml
OPENAI_API_KEY = "sk-2v...X"
# .streamlit/secrets.toml
OPENAI_API_KEY = "sk-2v...X"
[connections.snowpark]
user = "<username>"
password = "<password>"
warehouse = "COMPUTE_WH"
role = "ACCOUNTADMIN"
account = "<account-id>"
Let's validate that our Snowflake and OpenAI credentials are working as expected.
First, we'll validate our OpenAI credentials by asking GPT-3.5 a simple question: what is Streamlit?
validate_credentials.py
at the root of your llm-chatbot
folder.validate_credentials.py
. This snippet does the following: st.write
import streamlit as st
import openai
openai.api_key = st.secrets["OPENAI_API_KEY"]
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "What is Streamlit?"}
]
)
st.write(completion.choices[0].message.content)
streamlit run validate_credentials.py
in the command line. Next, let's validate that our Snowflake credentials are working as expected.
validate_credentials.py
with the below code. This snippet does the following: import streamlit as st
conn = st.experimental_connection("snowpark")
df = conn.query("select current_warehouse()")
st.write(df)
streamlit run validate_credentials.py
in the command line. We're ready to start building our app! We're going to first build a simple version of the chatbot app that simply passes user-inputted messages to GPT-3.5 and returns GPT-3.5's response. We'll build on the app's complexity in subsequent sections.
We'll break down the Python file snippet-by-snippet so that you understand the functionality of each section, but if you'd like to skip ahead and download the full file, you can do so here.
simple_chatbot.py
. Add import statements and give your app a title.import openai
import streamlit as st
st.title("☃️ Frosty")
# Initialize the chat messages history
if "messages" not in st.session_state.keys():
st.session_state.messages = [
{"role": "assistant", "content": "How can I help?"}
]
st.chat_input()
feature. If the user has entered a message, add that message to the chat history by storing it in session state.# Prompt for user input and save
if prompt := st.chat_input():
st.session_state.messages.append({"role": "user", "content": prompt})
# display the prior chat messages
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.write(message["content"])
openai
Python package. Display a spinner while the app is retrieving GPT-3.5's response via Streamlit's st.spinner
feature and use st.write
to display the chatbot's response in the UI. Append the chatbot's response to the chat history stored in session state.# If last message is not from assistant, we need to generate a new response
if st.session_state.messages[-1]["role"] != "assistant":
# Call LLM
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
r = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": m["role"], "content": m["content"]} for m in st.session_state.messages],
)
response = r.choices[0].message.content
st.write(response)
message = {"role": "assistant", "content": response}
st.session_state.messages.append(message)
streamlit run simple_chatbot.py
. Give it a whirl – ask Frosty a question!The full contents of the Python file for this simple chatbot app are below, or you can download the file from GitHub.
import openai
import streamlit as st
st.title("☃️ Frosty")
# Initialize the chat messages history
if "messages" not in st.session_state.keys():
st.session_state.messages = [
{"role": "assistant", "content": "How can I help?"}
]
# Prompt for user input and save
if prompt := st.chat_input():
st.session_state.messages.append({"role": "user", "content": prompt})
# display the existing chat messages
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.write(message["content"])
# If last message is not from assistant, we need to generate a new response
if st.session_state.messages[-1]["role"] != "assistant":
# Call LLM
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
r = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": m["role"], "content": m["content"]} for m in st.session_state.messages],
)
response = r.choices[0].message.content
st.write(response)
message = {"role": "assistant", "content": response}
st.session_state.messages.append(message)
Now that we've built a simple version of the chatbot app, let's expand the functionality to enable Frosty to translate our requests into SQL statements and execute those statements using the Cybersyn dataset stored in our Snowflake database.
We're also going to create a prompt Python file before building out the main file of our chatbot app. The primary purpose of this file is to create the function get_system_prompt()
, which will be called in our main Python file and will do a few things:
This file should be placed in the root of your llm-chatbot
folder. You can download the file from here or create an empty Python file and paste the following code:
import streamlit as st
QUALIFIED_TABLE_NAME = "FROSTY_SAMPLE.CYBERSYN_FINANCIAL.FINANCIAL_ENTITY_ANNUAL_TIME_SERIES"
METADATA_QUERY = "SELECT VARIABLE_NAME, DEFINITION FROM FROSTY_SAMPLE.CYBERSYN_FINANCIAL.FINANCIAL_ENTITY_ATTRIBUTES_LIMITED;"
TABLE_DESCRIPTION = """
This table has various metrics for financial entities (also referred to as banks) since 1983.
The user may describe the entities interchangeably as banks, financial institutions, or financial entities.
"""
GEN_SQL = """
You will be acting as an AI Snowflake SQL expert named Frosty.
Your goal is to give correct, executable SQL queries to users.
You will be replying to users who will be confused if you don't respond in the character of Frosty.
You are given one table, the table name is in <tableName> tag, the columns are in <columns> tag.
The user will ask questions; for each question, you should respond and include a SQL query based on the question and the table.
{context}
Here are 6 critical rules for the interaction you must abide:
<rules>
1. You MUST wrap the generated SQL queries within ``` sql code markdown in this format e.g
```sql
(select 1) union (select 2)
```
2. If I don't tell you to find a limited set of results in the sql query or question, you MUST limit the number of responses to 10.
3. Text / string where clauses must be fuzzy match e.g ilike %keyword%
4. Make sure to generate a single Snowflake SQL code snippet, not multiple.
5. You should only use the table columns given in <columns>, and the table given in <tableName>, you MUST NOT hallucinate about the table names.
6. DO NOT put numerical at the very front of SQL variable.
</rules>
Don't forget to use "ilike %keyword%" for fuzzy match queries (especially for variable_name column)
and wrap the generated sql code with ``` sql code markdown in this format e.g:
```sql
(select 1) union (select 2)
```
For each question from the user, make sure to include a query in your response.
Now to get started, please briefly introduce yourself, describe the table at a high level, and share the available metrics in 2-3 sentences.
Then provide 3 example questions using bullet points.
"""
@st.cache_data(show_spinner=False)
def get_table_context(table_name: str, table_description: str, metadata_query: str = None):
table = table_name.split(".")
conn = st.experimental_connection("snowpark")
columns = conn.query(f"""
SELECT COLUMN_NAME, DATA_TYPE FROM {table[0].upper()}.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = '{table[1].upper()}' AND TABLE_NAME = '{table[2].upper()}'
""",
)
columns = "\n".join(
[
f"- **{columns['COLUMN_NAME'][i]}**: {columns['DATA_TYPE'][i]}"
for i in range(len(columns["COLUMN_NAME"]))
]
)
context = f"""
Here is the table name <tableName> {'.'.join(table)} </tableName>
<tableDescription>{table_description}</tableDescription>
Here are the columns of the {'.'.join(table)}
<columns>\n\n{columns}\n\n</columns>
"""
if metadata_query:
metadata = conn.query(metadata_query)
metadata = "\n".join(
[
f"- **{metadata['VARIABLE_NAME'][i]}**: {metadata['DEFINITION'][i]}"
for i in range(len(metadata["VARIABLE_NAME"]))
]
)
context = context + f"\n\nAvailable variables by VARIABLE_NAME:\n\n{metadata}"
return context
def get_system_prompt():
table_context = get_table_context(
table_name=QUALIFIED_TABLE_NAME,
table_description=TABLE_DESCRIPTION,
metadata_query=METADATA_QUERY
)
return GEN_SQL.format(context=table_context)
# do `streamlit run prompts.py` to view the initial system prompt in a Streamlit app
if __name__ == "__main__":
st.header("System prompt for Frosty")
st.markdown(get_system_prompt())
Finally, you can run this file as a Streamlit app to verify the output is working correctly. Run the prompts generation via streamlit run prompts.py
. Make sure the table information is showing up as expected in the rendered prompt - this will get passed to the chatbot in the next section.
We'll break down the Python file snippet-by-snippet so that you understand the functionality of each section, but if you'd like to skip ahead and download the full file, you can do so here.
frosty_app.py
and add the below code snippet, which does the following: get_system_prompt()
retrieves this information.import openai
import re
import streamlit as st
from prompts import get_system_prompt
st.title("☃️ Frosty")
# Initialize the chat messages history
openai.api_key = st.secrets.OPENAI_API_KEY
if "messages" not in st.session_state:
# system prompt includes table information, rules, and prompts the LLM to produce
# a welcome message to the user.
st.session_state.messages = [{"role": "system", "content": get_system_prompt()}]
# Prompt for user input and save
if prompt := st.chat_input():
st.session_state.messages.append({"role": "user", "content": prompt})
# display the existing chat messages
for message in st.session_state.messages:
if message["role"] == "system":
continue
with st.chat_message(message["role"]):
st.write(message["content"])
if "results" in message:
st.dataframe(message["results"])
stream
parameter to signify that GPT-3.5's response should be sent incrementally in chunks via an event stream, and display the chunks as they're received.# If last message is not from assistant, we need to generate a new response
if st.session_state.messages[-1]["role"] != "assistant":
with st.chat_message("assistant"):
response = ""
resp_container = st.empty()
for delta in openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": m["role"], "content": m["content"]} for m in st.session_state.messages],
stream=True,
):
response += delta.choices[0].delta.get("content", "")
resp_container.markdown(response)
st.experimental_connection
to execute the SQL query against the database we created in Snowflake. Write the result to the app using st.dataframe
, and append the result to the associated message in the message history. message = {"role": "assistant", "content": response}
# Parse the response for a SQL query and execute if available
sql_match = re.search(r"```sql\n(.*)\n```", response, re.DOTALL)
if sql_match:
sql = sql_match.group(1)
conn = st.experimental_connection("snowpark")
message["results"] = conn.query(sql)
st.dataframe(message["results"])
st.session_state.messages.append(message)
streamlit run frosty_app.py
.The full contents of the Python file for this app are below, or you can download the file from GitHub.
import openai
import re
import streamlit as st
from prompts import get_system_prompt
st.title("☃️ Frosty")
# Initialize the chat messages history
openai.api_key = st.secrets.OPENAI_API_KEY
if "messages" not in st.session_state:
# system prompt includes table information, rules, and prompts the LLM to produce
# a welcome message to the user.
st.session_state.messages = [{"role": "system", "content": get_system_prompt()}]
# Prompt for user input and save
if prompt := st.chat_input():
st.session_state.messages.append({"role": "user", "content": prompt})
# display the existing chat messages
for message in st.session_state.messages:
if message["role"] == "system":
continue
with st.chat_message(message["role"]):
st.write(message["content"])
if "results" in message:
st.dataframe(message["results"])
# If last message is not from assistant, we need to generate a new response
if st.session_state.messages[-1]["role"] != "assistant":
with st.chat_message("assistant"):
response = ""
resp_container = st.empty()
for delta in openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": m["role"], "content": m["content"]} for m in st.session_state.messages],
stream=True,
):
response += delta.choices[0].delta.get("content", "")
resp_container.markdown(response)
message = {"role": "assistant", "content": response}
# Parse the response for a SQL query and execute if available
sql_match = re.search(r"```sql\n(.*)\n```", response, re.DOTALL)
if sql_match:
sql = sql_match.group(1)
conn = st.experimental_connection("snowpark")
message["results"] = conn.query(sql)
st.dataframe(message["results"])
st.session_state.messages.append(message)
Finally, it's time to explore the Cybersyn Financial & Economic Essentials using natural language. Try asking Frosty any of the following questions:
Congratulations – you've just built an LLM-powered chatbot capable of translating natural language to SQL queries and running those queries on data stored in Snowflake!
This tutorial is just a starting point for exploring the possibilities of LLM-powered chat interfaces for data exploration and question-answering using Snowflake and Streamlit. A few next things to try:
prompts.py
, so it should be easy to swap and start playing around!Check out the Frosty session (ML103) from Snowflake Summit 2023 for more ideas and what's coming soon from Snowflake!
Want to learn more about the tools and technologies used by your app? Check out the following resources: