This Quickstart will demonstrate how you can perform Natural Language Processing (NLP) and ML within Snowflake using Snowpark Python and Streamlit. We'll use these tools to perform sentiment analysis with Snowpark (feature engineering, training, and prediction).
You will build an end-to-end Data Science workflow leveraging Snowpark for Python and Streamlit around the Sentiment Analysis use case.
This section covers cloning of the GitHub repository and creating a Python 3.8 environment.
First, clone the source code for this repo to your local environment:
git clone https://github.com/Snowflake-Labs/snowpark-python-demos.git
cd snowpark_nlp_ml_demo/
Create a conda environment. Let's name the environment nlp_ml_sentiment_analysis.
conda update conda
conda update python
conda env create -f ./snowpark-env/conda-env_nlp_ml_sentiment_analysis.yml --force
Update the Snowflake connexion file: connection.json
{
"account": "",
"user": "",
"password": "",
"role": "ACCOUNTADMIN",
"database": "IMDB",
"schema": "PUBLIC",
"warehouse": "DEMO_WH"
}
conda activate nlp_ml_sentiment_analysis
cd streamlit
streamlit run Sentiment_Analysis_APP.py
The full code of the use case is also available in this Notebook Sentiment_Analysis_NLP_with_Snowpark_ML.ipynb. Once the Setup is done (Create the Snowflake Objects and load the data) you can run all the Notebook.
cd notebook
jupyter notebook
Use the Streamlit App to setup Snowflake Objects
Make sure you have this result:
You can check directly with Snowsight that the data are available in Snowflake.
First, log into your Snowflake Account (Snowsight Web UI) using your credentials.
Then, run the following SQL commands to create the DATABASE:
USE ROLE ACCOUNTADMIN;
CREATE DATABASE if not EXISTS IMDB;
Run the following SQL commands to create the TABLES:
USE DATABASE IMDB;
USE SCHEMA PUBLIC;
CREATE TABLE if not EXISTS TRAIN_DATASET (
REVIEW STRING,
SENTIMENT STRING
);
CREATE TABLE if not EXISTS TEST_DATASET (
REVIEW STRING,
SENTIMENT STRING
);
Run the following SQL commands to create the WAREHOUSE:
CREATE WAREHOUSE if not EXISTS DEMO_WH WAREHOUSE_SIZE=MEDIUM INITIALLY_SUSPENDED=TRUE AUTO_SUSPEND=120;
Run the following SQL commands to create the STAGE:
CREATE STAGE if not EXISTS MODELS;
USE IMDB.PUBLIC;
We used Python code to load the data into Snowflake. In order to simplify code execution you can click on the right button to start loading the data.
Use use the section Load Data:
Here is the display that we expect after the execution.
Here is the display that we expect after the execution.
with z.open("TRAIN_DATASET.csv") as f:
pandas_df = pd.read_csv(f)
session.write_pandas(pandas_df, "TRAIN_DATASET", auto_create_table=False, overwrite=True)
Use use the section Analyze to explore and analyze the datasets and see some metrics.
Choose the dataset that you want to analyze:
Here is some statistics related to the dataset:
You can see a sample of data:
Here a description of your dataset:
table_to_print = "TRAIN_DATASET"
df_table = session.table(table_to_print)
df_table.count()
pd_table = df_table.limit(10).to_pandas()
pd_describe = df_table.describe().to_pandas()
Use use the section Train Model:
Choose the training dataset to build the model:
Select a Virtual Warehouse:
To run the model training, click on the button below:
We created a function called train_model_review_pipline():
def train_model_review_pipline(session : Session, train_dataset_name: str) -> Variant:
...
that will do the following steps:
Then we registered the function as a Store Procedure:
session.sproc.register(func=train_model_review_pipline, name="train_model_review_pipline", replace=True)
And use this Python code to call the SP that wil be execute the training into Snowflake with a Snowflake Virtual Warehouse:
session.call("train_model_review_pipline", "TRAIN_DATASET")
You can also execute the training from Snowsight directly with SQL code:
CALL train_model_review_pipline("TRAIN_DATASET")
@udf(name='predict_review', session=session, is_permanent = False, stage_location = '@MODELS', replace=True)
def predict_review(args: list) -> float:
import sys
import pandas as pd
from joblib import load
model = load_file("model_review.joblib")
vec = load_file("vect_review.joblib")
features = list(["REVIEW", "SENTIMENT_FLAG"])
row = pd.DataFrame([args], columns=features)
bowTest = vec.transform(row.REVIEW.values)
return model.predict(bowTest)
Use use the section Model Monitoring. You can use Snowsight (Snowflake UI) as well to get more details and see the Query Details and Query Profile.
Use use the section Model Catalog. Here you can see your models that you deployed and saved on Snowflake (Stage):
Use use the section Inference to analyze the Test Dataset and see the Accuracy of your Model after the Inference.
Analyze Test Dataset Click on the Test Dataset sub-section to explore the dataset.
Accuracy Click on the Accuracy sub-section to see the details.
Select the new dataset that you want to predict and the Inference will run automatically.
Use the section to clean Up to remove all the Snowflake Objects and the Data that you already load:
Congratulations! You've successfully performed the Sentiment Analysis use case and built an end-to-end Data Science workflow leveraging Snowpark for Python and Streamlit.
In this quickstart we demonstrated how Snowpark Python enables rapid, end-to-end machine learning workload development, deployment, and orchestration. We were also able to experience how Snowpark for Python enables you to use familiar syntax and constructs to process data where it lives with Snowflake's elastic, scalable and secure engine, accelerating the path to production for data pipelines and ML workflows.