In this tutorial, you'll learn how to process and analyze customer review data using an LLM-powered data processing workflow with Snowflake Cortex followed by building a dashboard. Briefly, you'll work with the Avalanche dataset, which contains customer reviews for a hypothetical winter sports gear company, and transform unstructured text data into actionable insights through sentiment analysis and data visualization.
By the end of this guide, you'll have built a complete analytics pipeline that extracts customer review data from DOCX files, processes it using Snowflake's Cortex functions for language translation and sentiment analysis, and visualizes the results using Streamlit.
A customer review analytics dashboard that processes unstructured text data and visualizes sentiment trends across products and time periods.
Firstly, to follow along with this quickstart, you can click on Avalanche-Customer-Review-Analytics.ipynb to download the Notebook from GitHub.
Snowflake Notebooks come pre-installed with common Python libraries for data science and machine learning, such as numpy
, pandas
, matplotlib
, and more! If you are looking to use other packages, click on the Packages dropdown on the top right to add additional packages to your notebook.
Secondly, let's create the Avalanche database and schema by running the following in a SQL worksheet:
CREATE DATABASE IF NOT EXISTS avalanche_db;
CREATE SCHEMA IF NOT EXISTS avalanche_schema;
Thirdly, create the stage:
CREATE STAGE IF NOT EXISTS avalanche_db.avalanche_schema.customer_reviews
ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
DIRECTORY = (ENABLE = true);
You should see a confirmation message after a stage has been created succesfully:
Fourthly, download the Avalanche customer review data, which is comprised of customer reviews for a hypothetical winter sports gear company. The customer review data is in DOCX format and can is available as customer_reviews_docx.zip.
And upload the 100 DOCX files to the stage. Particularly, from Snowsight click on the "Data" icon from the menu bar located in the left panel. Next, click on "Databases" and in the next panel navigate through the database hierarchy as follows:
AVALANCHE_DB > AVALANCHE_SCHEMA > Stages > CUSTOMER_REVIEWS
Then, in the top-right hand corner click on the blue "+ Files" button. Followed by browsing to and selecting the 100 DOCX files of the customer review data that we had unzipped from the `customer_reviews_docx.zip file.
You'll see a modal pop-up where you can upload the 100 DOCX files:
Afterwards, you'll see the files listed in the main panel. Also make note of the previously mentioned hierarchical list of the database, schema and stage that we navigated through:
Finally, we can verify that the files have been uploaded successfully to the stage by opening up a SQL worksheet, from the left menu sidebar go to Projects > Worksheets > + (Create SQL Worksheet)
Next, enter the following in a SQL worksheet:
`ls @avalanche_db.avalanche_schema.customer_reviews`
This should yield the following results output:
Our first step is to query and parse the content from DOCX files stored in the Snowflake stage. We'll use the SNOWFLAKE.CORTEX.PARSE_DOCUMENT
function to extract the content from these files.
-- Parse content from DOCX files
WITH files AS (
SELECT
REPLACE(REGEXP_SUBSTR(file_url, '[^/]+$'), '%2e', '.') as filename
FROM DIRECTORY('@avalanche_db.avalanche_schema.customer_reviews')
WHERE filename LIKE '%.docx'
)
SELECT
filename,
SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
@avalanche_db.avalanche_schema.customer_reviews,
filename,
{'mode': 'layout'}
):content AS layout
FROM files;
This query does the following:
files
that extracts the filenames from the file URLsPARSE_DOCUMENT
function to extract the content from each file in "layout" modeThe PARSE_DOCUMENT
function is particularly useful for extracting text from various document formats, including DOCX, PDF, and others. The "layout" mode preserves the document's structure, making it easier to extract specific sections.
Running the above query should yield the following table output:
Now that we have the raw content from the DOCX files, we need to reshape it into a more structured format. We'll use regular expressions to extract specific information such as product name, date, and the customer review text.
-- Extract PRODUCT name, DATE, and CUSTOMER_REVIEW from the LAYOUT column
SELECT
filename,
REGEXP_SUBSTR(layout, 'Product: (.*?) Date:', 1, 1, 'e') as product,
REGEXP_SUBSTR(layout, 'Date: (202[0-9]-[0-9]{2}-[0-9]{2})', 1, 1, 'e') as date,
REGEXP_SUBSTR(layout, '## Customer Review\n([\\s\\S]*?)$', 1, 1, 'es') as customer_review
FROM previous_query_result;
This query uses REGEXP_SUBSTR
to extract:
The regular expression patterns are designed to match the specific structure of the documents, extracting only the relevant information. This transformation converts unstructured text into a structured format that's easier to analyze.
The result table should look like the following:
With our data now in a structured format, we can apply Snowflake's Cortex LLM functions to perform advanced text analysis. We'll use three key functions:
TRANSLATE
: To convert any non-English reviews to EnglishSUMMARIZE
: To create concise summaries of the reviewsSENTIMENT
: To analyze the sentiment of each review-- Perform translation, summarization and sentiment analysis on customer review
SELECT
product,
date,
SNOWFLAKE.CORTEX.TRANSLATE(customer_review, '', 'en') as translated_review,
SNOWFLAKE.CORTEX.SUMMARIZE(translated_review) as summary,
SNOWFLAKE.CORTEX.SENTIMENT(translated_review) as sentiment_score
FROM previous_query_result
ORDER BY date;
This query:
The sentiment score is a numerical value that indicates the overall sentiment of the review, with positive values indicating positive sentiment and negative values indicating negative sentiment.
Results from the query is shown below:
To prepare for visualization, we'll convert the SQL results to a Pandas DataFrame:
df = sql_result.to_pandas()
This conversion allows us to use Python's data visualization libraries to create interactive charts and graphs.
Now that we have our processed data in a Pandas DataFrame, we can create visualizations to gain insights into the sentiment of customer reviews. We'll use Streamlit for the interactive dashboard and Altair for the charts.
First, let's create a bar chart showing sentiment scores by date:
import streamlit as st
import altair as alt
import pandas as pd
# Ensure SENTIMENT_SCORE is numeric
df['SENTIMENT_SCORE'] = pd.to_numeric(df['SENTIMENT_SCORE'])
# Create the base chart with bars
chart = alt.Chart(df).mark_bar(size=15).encode(
x=alt.X('DATE:T',
axis=alt.Axis(
format='%Y-%m-%d', # YYYY-MM-DD format
labelAngle=90) # Rotate labels 90 degrees
),
y=alt.Y('SENTIMENT_SCORE:Q'),
color=alt.condition(
alt.datum.SENTIMENT_SCORE >= 0,
alt.value('#2ecc71'), # green for positive
alt.value('#e74c3c') # red for negative
),
tooltip=['PRODUCT:N', 'DATE:T'] # Add tooltip
).properties(
height=500
)
# Display the chart
st.altair_chart(chart, use_container_width=True)
This chart:
Next, let's create a bar chart showing average sentiment scores by product:
import streamlit as st
import altair as alt
import pandas as pd
# Create the base chart with aggregation by PRODUCT
bars = alt.Chart(df).mark_bar(size=15).encode(
y=alt.Y('PRODUCT:N',
axis=alt.Axis(
labelAngle=0, # Horizontal labels
labelOverlap=False, # Prevent label overlap
labelPadding=10 # Add some padding
)
),
x=alt.X('mean(SENTIMENT_SCORE):Q', # Aggregate mean sentiment score
title='MEAN SENTIMENT_SCORE'),
color=alt.condition(
alt.datum.mean_SENTIMENT_SCORE >= 0,
alt.value('#2ecc71'), # green for positive
alt.value('#e74c3c') # red for negative
),
tooltip=['PRODUCT:N', 'mean(SENTIMENT_SCORE):Q']
).properties(
height=400
)
# Display the chart
st.altair_chart(bars, use_container_width=True)
This chart:
Finally, let's add a download button so users can export the processed data:
# Download button for the CSV file
st.subheader('Processed Customer Reviews Data')
st.download_button(
label="Download CSV",
data=df[['PRODUCT', 'DATE', 'SUMMARY', 'SENTIMENT_SCORE']].to_csv(index=False).encode('utf-8'),
mime="text/csv"
)
This button allows users to download a CSV file containing the product name, date, summary, and sentiment score for each review.
Congratulations! You've successfully built an end-to-end workflow for creation of a customer review analytics dashboard that processes unstructured text data and finally visualizes sentiment trends. This pipeline demonstrates how to leverage Snowflake's data processing capabilities and Cortex LLM functions to process and extract valuable insights from customer feedback.
By analyzing sentiment across products and time periods, businesses can identify trends, address customer concerns, and improve their products and services. The interactive visualizations make it easy to spot patterns and outliers, enabling data-driven decision-making.
Documentation:
Happy coding!