Preview of final app

In this guide, we will build an LLM-powered chatbot named "Frosty" that performs data exploration and answers questions by writing and executing SQL queries on Snowflake data.

The application uses Streamlit and Snowflake and can be plugged into your LLM of choice, alongside data from Snowflake Marketplace. By the end of the session, you will have an interactive web application chatbot that can converse and answer questions based on a financial dataset.

Key features & technology

What is Streamlit?

Streamlit is an open-source Python library that enables developers to quickly create, deploy, and share web apps from Python scripts. Learn more about Streamlit.

What is a large language model (LLM)?

A large language model, or LLM, is a deep learning algorithm that can recognize, summarize, translate, predict and generate text and other content based on knowledge gained from massive datasets. Some examples of popular LLMs are GPT-4, GPT-3, BERT, LLaMA, and LaMDA.

What is OpenAI?

OpenAI is the AI research and deployment company behind ChatGPT, GPT-4 (and its predecessors), DALL-E, and other notable offerings. Learn more about OpenAI. We use OpenAI in this guide, but you are welcome to use the large language model of your choice in its place.

What is the Snowflake Marketplace?

The Snowflake Marketplace provides users with access to a wide range of datasets from third-party data stewards, expanding the data available for transforming business processes and making decisions. Data providers can publish datasets and offer data analytics services to Snowflake customers. Customers can securely access shared datasets directly from their Snowflake accounts and receive automatic real-time updates.

Prerequisites

What you'll learn

Complete the following steps in your local machine (or an equivalent dev environment):

  1. Install [Anaconda Distribution](https://docs.conda.io/en/latest/miniconda.html](https://www.anaconda.com/download) to manage a separate environment by selecting the appropriate installer link for your operating system and Python version.
  2. Open the terminal or command prompt and create a folder for your project. Let's call it llm-chatbot.
  3. Make sure you are running the latest version of conda by running the following command:
    conda update -n base conda
    
  4. Run the following command to create a Python 3.11 conda virtual environment:
    conda create --name snowpark-llm-chatbot python=3.11
    
  5. Activate the conda environment by running the following command:
    conda activate snowpark-llm-chatbot
    
  6. Install Snowpark for Python, Streamlit, and OpenAI by running the following command:
    conda install snowflake-snowpark-python "openai>=1.0.0"
    conda install conda-forge::"streamlit>=1.28.2"
    

Troubleshooting pyarrow related issues

Running in GitHub Codespaces

If you prefer to run through the tutorial in a remote environment instead of setting up a Python environment locally, you can use GitHub Codespaces.

Snowflake Marketplace provides visibility to a wide variety of datasets from third-party data stewards which broaden access to data points used to transform business processes. Snowflake Marketplace also removes the need to integrate and model data by providing secure access to data sets fully maintained by the data provider.

Log into Snowsight

If you don't have a Snowflake account, sign up for a 30-day free trial here.

  1. In a supported web browser, navigate to https://app.snowflake.com.
  2. Provide your account name or account URL. If you've previously signed in to Snowsight, you might see an account name that you can select.
  3. Sign in using your Snowflake account credentials.

You can also access Snowsight from the Classic Console:

  1. Sign in to the Classic Console.
  2. In the navigation menu, select Snowsight.
  3. Snowsight opens in a new tab.

Obtain dataset from Snowflake Marketplace

  1. At the top left corner, make sure you are logged in as ACCOUNTADMIN (switch role to ACCOUNTADMIN if not).
  2. Navigate to the Cybersyn Financial & Economic Essentials listing in the Snowflake Marketplace by clicking here.
  3. Select "Get."
  4. Select the appropriate roles to access the database being created and accept the Snowflake consumer terms and Cybersyn's terms of use.
  5. Select "Query Data," which will open a worksheet with example queries.

Example queries for the Cybersyn Financial & Economic Essentials dataset from the Snowflake Marketplace

Prep database

Before building our app, we need to run a set of SQL statements in Snowflake to create two views. The first view is FROSTY_SAMPLE.CYBERSYN_FINANCIAL.FINANCIAL_ENTITY_ATTRIBUTES_LIMITED, which includes:

The second view is FROSTY_SAMPLE.CYBERSYN_FINANCIAL.FINANCIAL_ENTITY_ANNUAL_TIME_SERIES, which includes:

You can copy the SQL statements from this file and run them in the worksheet created for your sample queries.

GIF showing the SQL statements being run in Snowflake

Now that we've configured the dataset we'll be using for our application, we can get started with Streamlit.

Run an example Streamlit app

  1. Head back over to the command line and navigate to your llm-chatbot folder.
  2. Run an example Streamlit app by entering streamlit hello. alt_text

Configure secrets file

Since our application will connect to Snowflake and OpenAI, we need a way to securely store our credentials. Luckily, Streamlit's secrets management feature allows us to store secrets securely and access them in our Streamlit app as environment variables.

  1. Add a folder within your llm-chatbot folder called .streamlit. Using the command line, you can do this by entering mkdir .streamlit.
  2. Within the .streamlit folder, add a file called secrets.toml. Using the command line, you can do this by first navigating to the .streamlit folder via cd .streamlit and then entering touch secrets.toml.

Add OpenAI credentials to secrets.toml

We need to add our OpenAI API key to our secrets file. Add your OpenAI key to the secrets file with the following format (replace the placeholder API key with your actual API key).

# .streamlit/secrets.toml

OPENAI_API_KEY = "sk-2v...X"

Add Snowflake credentials to secrets.toml

We also need to add the Snowflake user, password, warehouse, role, and account to our secrets file. Copy the following format, replacing the placeholder credentials with your actual credentials. account should be your Snowflake account identifier, which you can locate by following the instructions outlined here.

If you prefer to use browser-based SSO to authenticate, replace password = "" with authenticator=EXTERNALBROWSER.

# .streamlit/secrets.toml

[connections.snowflake]
user = "<jdoe>"
password = "<my_trial_pass>"
warehouse = "COMPUTE_WH"
role = "ACCOUNTADMIN"
account = "<account-id>"

Full contents of secrets.toml

# .streamlit/secrets.toml

OPENAI_API_KEY = "sk-2v...X"

[connections.snowflake]
user = "<username>"
password = "<password>"
warehouse = "COMPUTE_WH"
role = "ACCOUNTADMIN"
account = "<account-id>"

Validate credentials

Let's validate that our Snowflake and OpenAI credentials are working as expected.

OpenAI credentials

First, we'll validate our OpenAI credentials by asking GPT-3.5 a simple question: what is Streamlit?

  1. Add a file called validate_credentials.py at the root of your llm-chatbot folder.
  2. Add the below code to validate_credentials.py. This snippet does the following:
    • Imports the Streamlit and OpenAI Python packages
    • Retrieves our OpenAI API key from the secrets file
    • Sends GPT-3.5 the question "What is Streamlit?"
    • Prints GPT-3.5's response to the UI using st.write
import streamlit as st
from openai import OpenAI

client = OpenAI(api_key=st.secrets["OPENAI_API_KEY"])

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "user", "content": "What is Streamlit?"}
  ]
)

st.write(completion.choices[0].message.content)
  1. Run your Streamlit app by entering streamlit run validate_credentials.py in the command line. alt_text

Snowflake credentials

Next, let's validate that our Snowflake credentials are working as expected.

  1. Append the following to validate_credentials.py. This snippet does the following:
    • Creates a Snowpark connection
    • Executes a query to pull the current warehouse and writes the result to the UI
conn = st.connection("snowflake")
df = conn.query("select current_warehouse()")
st.write(df)
  1. Run your Streamlit app by entering streamlit run validate_credentials.py in the command line. alt_text

We're ready to start building our app! We're going to first build a simple version of the chatbot app that simply passes user-inputted messages to GPT-3.5 and returns GPT-3.5's response. We'll build on the app's complexity in subsequent sections.

We'll break down the Python file snippet-by-snippet so that you understand the functionality of each section, but if you'd like to skip ahead and download the full file, you can do so here.

  1. Create a file called simple_chatbot.py. Add import statements and give your app a title.
from openai import OpenAI
import streamlit as st

st.title("☃️ Frosty")
  1. Initialize the chatbot's message history by adding the first message that we want the chatbot to display, "How can I help?", to session state.
# Initialize the chat messages history
if "messages" not in st.session_state.keys():
    st.session_state.messages = [{"role": "assistant", "content": "How can I help?"}]
  1. Prompt the user to enter chat input by using Streamlit's st.chat_input() feature. If the user has entered a message, add that message to the chat history by storing it in session state.
# Prompt for user input and save
if prompt := st.chat_input():
    st.session_state.messages.append({"role": "user", "content": prompt})
  1. Display the chatbot's message history by iterating through the values stored in session state associated with the key "messages" and printing each value.
# display the existing chat messages
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.write(message["content"])
  1. If the last message is not from the assistant, send the message to GPT-3.5 via the openai Python package. Display a spinner while the app is retrieving GPT-3.5's response via Streamlit's st.spinner feature and use st.write to display the chatbot's response in the UI. Append the chatbot's response to the chat history stored in session state.
# If last message is not from assistant, we need to generate a new response
if st.session_state.messages[-1]["role"] != "assistant":
    # Call LLM
    with st.chat_message("assistant"):
        with st.spinner("Thinking..."):
            r = OpenAI().chat.completions.create(
                messages=[{"role": m["role"], "content": m["content"]} for m in st.session_state.messages],
                model="gpt-3.5-turbo",
            )
            response = r.choices[0].message.content
            st.write(response)

    message = {"role": "assistant", "content": response}
    st.session_state.messages.append(message)
  1. Run the Streamlit app via streamlit run simple_chatbot.py. Give it a whirl – ask Frosty a question!

GIF demonstrating the simple chatbot app

The full contents of the Python file for this simple chatbot app are below, or you can download the file from GitHub.

from openai import OpenAI
import streamlit as st

st.title("☃️ Frosty")

# Initialize the chat messages history
if "messages" not in st.session_state.keys():
    st.session_state.messages = [{"role": "assistant", "content": "How can I help?"}]

# Prompt for user input and save
if prompt := st.chat_input():
    st.session_state.messages.append({"role": "user", "content": prompt})

# display the existing chat messages
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.write(message["content"])

# If last message is not from assistant, we need to generate a new response
if st.session_state.messages[-1]["role"] != "assistant":
    # Call LLM
    with st.chat_message("assistant"):
        with st.spinner("Thinking..."):
            r = OpenAI().chat.completions.create(
                messages=[{"role": m["role"], "content": m["content"]} for m in st.session_state.messages],
                model="gpt-3.5-turbo",
            )
            response = r.choices[0].message.content
            st.write(response)

    message = {"role": "assistant", "content": response}
    st.session_state.messages.append(message)

Now that we've built a simple version of the chatbot app, let's expand the functionality to enable Frosty to translate our requests into SQL statements and execute those statements using the Cybersyn dataset stored in our Snowflake database.

Create a prompt file

We're also going to create a prompt Python file before building out the main file of our chatbot app. The primary purpose of this file is to create the function get_system_prompt(), which will be called in our main Python file and will do a few things:

This file should be placed in the root of your llm-chatbot folder. You can download the file from here or create an empty Python file and paste the following code:

import streamlit as st

SCHEMA_PATH = st.secrets.get("SCHEMA_PATH", "FROSTY_SAMPLE.CYBERSYN_FINANCIAL")
QUALIFIED_TABLE_NAME = f"{SCHEMA_PATH}.FINANCIAL_ENTITY_ANNUAL_TIME_SERIES"
TABLE_DESCRIPTION = """
This table has various metrics for financial entities (also referred to as banks) since 1983.
The user may describe the entities interchangeably as banks, financial institutions, or financial entities.
"""
# This query is optional if running Frosty on your own table, especially a wide table.
# Since this is a deep table, it's useful to tell Frosty what variables are available.
# Similarly, if you have a table with semi-structured data (like JSON), it could be used to provide hints on available keys.
# If altering, you may also need to modify the formatting logic in get_table_context() below.
METADATA_QUERY = f"SELECT VARIABLE_NAME, DEFINITION FROM {SCHEMA_PATH}.FINANCIAL_ENTITY_ATTRIBUTES_LIMITED;"

GEN_SQL = """
You will be acting as an AI Snowflake SQL Expert named Frosty.
Your goal is to give correct, executable sql query to users.
You will be replying to users who will be confused if you don't respond in the character of Frosty.
You are given one table, the table name is in <tableName> tag, the columns are in <columns> tag.
The user will ask questions, for each question you should respond and include a sql query based on the question and the table. 

{context}

Here are 6 critical rules for the interaction you must abide:
<rules>
1. You MUST MUST wrap the generated sql code within ``` sql code markdown in this format e.g
```sql
(select 1) union (select 2)
```
2. If I don't tell you to find a limited set of results in the sql query or question, you MUST limit the number of responses to 10.
3. Text / string where clauses must be fuzzy match e.g ilike %keyword%
4. Make sure to generate a single snowflake sql code, not multiple. 
5. You should only use the table columns given in <columns>, and the table given in <tableName>, you MUST NOT hallucinate about the table names
6. DO NOT put numerical at the very front of sql variable.
</rules>

Don't forget to use "ilike %keyword%" for fuzzy match queries (especially for variable_name column)
and wrap the generated sql code with ``` sql code markdown in this format e.g:
```sql
(select 1) union (select 2)
```

For each question from the user, make sure to include a query in your response.

Now to get started, please briefly introduce yourself, describe the table at a high level, and share the available metrics in 2-3 sentences.
Then provide 3 example questions using bullet points.
"""

@st.cache_data(show_spinner="Loading Frosty's context...")
def get_table_context(table_name: str, table_description: str, metadata_query: str = None):
    table = table_name.split(".")
    conn = st.connection("snowflake")
    columns = conn.query(f"""
        SELECT COLUMN_NAME, DATA_TYPE FROM {table[0].upper()}.INFORMATION_SCHEMA.COLUMNS
        WHERE TABLE_SCHEMA = '{table[1].upper()}' AND TABLE_NAME = '{table[2].upper()}'
        """, show_spinner=False,
    )
    columns = "\n".join(
        [
            f"- **{columns['COLUMN_NAME'][i]}**: {columns['DATA_TYPE'][i]}"
            for i in range(len(columns["COLUMN_NAME"]))
        ]
    )
    context = f"""
Here is the table name <tableName> {'.'.join(table)} </tableName>

<tableDescription>{table_description}</tableDescription>

Here are the columns of the {'.'.join(table)}

<columns>\n\n{columns}\n\n</columns>
    """
    if metadata_query:
        metadata = conn.query(metadata_query, show_spinner=False)
        metadata = "\n".join(
            [
                f"- **{metadata['VARIABLE_NAME'][i]}**: {metadata['DEFINITION'][i]}"
                for i in range(len(metadata["VARIABLE_NAME"]))
            ]
        )
        context = context + f"\n\nAvailable variables by VARIABLE_NAME:\n\n{metadata}"
    return context

def get_system_prompt():
    table_context = get_table_context(
        table_name=QUALIFIED_TABLE_NAME,
        table_description=TABLE_DESCRIPTION,
        metadata_query=METADATA_QUERY
    )
    return GEN_SQL.format(context=table_context)

# do `streamlit run prompts.py` to view the initial system prompt in a Streamlit app
if __name__ == "__main__":
    st.header("System prompt for Frosty")
    st.markdown(get_system_prompt())

Finally, you can run this file as a Streamlit app to verify the output is working correctly. Run the prompts generation via streamlit run prompts.py. Make sure the table information is showing up as expected in the rendered prompt - this will get passed to the chatbot in the next section.

Build the chatbot

We'll break down the Python file snippet-by-snippet so that you understand the functionality of each section, but if you'd like to skip ahead and download the full file, you can do so here.

  1. Create a file called frosty_app.py and add the below code snippet, which does the following:
    • Adds import statements and a title
    • Retrieves our OpenAI API key from the secrets file
    • Initializes the message history using session state
      • This time, the first assistant message from the chatbot will display information about the current table in the database this app is using. get_system_prompt() retrieves this information.
    • Prompts the user to enter a message and upon receiving a message, adds that message to the chat history
    • Iterates through the message history and displays each message in the app
from openai import OpenAI
import re
import streamlit as st
from prompts import get_system_prompt

st.title("☃️ Frosty")

# Initialize the chat messages history
client = OpenAI(api_key=st.secrets.OPENAI_API_KEY)
if "messages" not in st.session_state:
    # system prompt includes table information, rules, and prompts the LLM to produce
    # a welcome message to the user.
    st.session_state.messages = [{"role": "system", "content": get_system_prompt()}]

# Prompt for user input and save
if prompt := st.chat_input():
    st.session_state.messages.append({"role": "user", "content": prompt})

# display the existing chat messages
for message in st.session_state.messages:
    if message["role"] == "system":
        continue
    with st.chat_message(message["role"]):
        st.write(message["content"])
        if "results" in message:
            st.dataframe(message["results"])
  1. Check the last entry in the chat history to see if it was sent by the user or the chatbot. If it was sent by the user, use GPT-3.5 to generate a response. Instead of displaying the entire response at once, use OpenAI's stream parameter to signify that GPT-3.5's response should be sent incrementally in chunks via an event stream, and display the chunks as they're received.
# If last message is not from assistant, we need to generate a new response
if st.session_state.messages[-1]["role"] != "assistant":
    with st.chat_message("assistant"):
        response = ""
        resp_container = st.empty()
        for delta in client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": m["role"], "content": m["content"]} for m in st.session_state.messages],
            stream=True,
        ):
            response += (delta.choices[0].delta.content or "")
            resp_container.markdown(response)
  1. Use a regular expression to search the newly generated response for the SQL markdown syntax that we instructed GPT-3.5 to wrap any SQL queries in. If a match is found, use st.experimental_connection to execute the SQL query against the database we created in Snowflake. Write the result to the app using st.dataframe, and append the result to the associated message in the message history.
        message = {"role": "assistant", "content": response}
        # Parse the response for a SQL query and execute if available
        sql_match = re.search(r"```sql\n(.*)\n```", response, re.DOTALL)
        if sql_match:
            sql = sql_match.group(1)
            conn = st.connection("snowflake")
            message["results"] = conn.query(sql)
            st.dataframe(message["results"])
        st.session_state.messages.append(message)
  1. Run the Streamlit app via streamlit run frosty_app.py.

Preview of final app

The full contents of the Python file for this app are below, or you can download the file from GitHub.

from openai import OpenAI
import re
import streamlit as st
from prompts import get_system_prompt

st.title("☃️ Frosty")

# Initialize the chat messages history
client = OpenAI(api_key=st.secrets.OPENAI_API_KEY)
if "messages" not in st.session_state:
    # system prompt includes table information, rules, and prompts the LLM to produce
    # a welcome message to the user.
    st.session_state.messages = [{"role": "system", "content": get_system_prompt()}]

# Prompt for user input and save
if prompt := st.chat_input():
    st.session_state.messages.append({"role": "user", "content": prompt})

# display the existing chat messages
for message in st.session_state.messages:
    if message["role"] == "system":
        continue
    with st.chat_message(message["role"]):
        st.write(message["content"])
        if "results" in message:
            st.dataframe(message["results"])

# If last message is not from assistant, we need to generate a new response
if st.session_state.messages[-1]["role"] != "assistant":
    with st.chat_message("assistant"):
        response = ""
        resp_container = st.empty()
        for delta in client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": m["role"], "content": m["content"]} for m in st.session_state.messages],
            stream=True,
        ):
            response += (delta.choices[0].delta.content or "")
            resp_container.markdown(response)

        message = {"role": "assistant", "content": response}
        # Parse the response for a SQL query and execute if available
        sql_match = re.search(r"```sql\n(.*)\n```", response, re.DOTALL)
        if sql_match:
            sql = sql_match.group(1)
            conn = st.connection("snowflake")
            message["results"] = conn.query(sql)
            st.dataframe(message["results"])
        st.session_state.messages.append(message)

Finally, it's time to explore the Cybersyn Financial & Economic Essentials using natural language. Try asking Frosty any of the following questions:

  1. Which financial institution had the highest total assets in the year 2020?
  2. Which financial institutions in California had the highest total assets value between 2010 to 2015?
  3. What was the highest % insured (estimated) value for all financial institutions in the state of New Jersey?
  4. What is the lowest value of total securities for all financial institutions in Texas?
  5. What was the % change in all real estate loans for banks headquartered in California between 2015 and 2020?
  6. What was the average total securities value for banks in the state of Wisconsin between 2015 and 2020?
  7. How have the total securities value changed over time for financial institutions in New York City?
  8. What was the maximum % insured (estimated) value for a single financial entity in Illinois between 2010 and 2020?
  9. What was the value of all real estate loans for banks located in Massachusetts in 2020?
  10. How many banks headquartered in New Hampshire experienced more than 50% growth in their total assets between 2015 and 2020?

Congratulations – you've just built an LLM-powered chatbot capable of translating natural language to SQL queries and running those queries on data stored in Snowflake!

Where to go from here

This tutorial is just a starting point for exploring the possibilities of LLM-powered chat interfaces for data exploration and question-answering using Snowflake and Streamlit. A few next things to try:

Check out the Frosty session (ML103) from Snowflake Summit 2023 for more ideas and what's coming soon from Snowflake!

Additional resources

Want to learn more about the tools and technologies used by your app? Check out the following resources: