In this quickstart, you will be introduced to ML Sidekick, a no-code app built using Streamlit in Snowflake, designed for building and deploying machine learning models in Snowflake. This application aids both seasoned data scientists and business users with no coding experience by simplifying the machine learning process and making it accessible to a broader audience.

Prerequisites

What You'll Learn

What You'll Build

Overview

You will use Snowsight, the Snowflake web interface, to:

Download datasets

Creating Objects

USE ROLE sysadmin;

-- creating database and schema to for tables
-- creating warehouse for data ingestion

-- create database
CREATE OR REPLACE DATABASE ML_SIDEKICK;

-- create schema
CREATE OR REPLACE SCHEMA TEST_DATA;

-- create ml_sidekick_load_wh
CREATE OR REPLACE WAREHOUSE ml_sidekick_load_wh
	WAREHOUSE_SIZE = 'XSMALL'
	WAREHOUSE_TYPE = 'standard'
	AUTO_SUSPEND = 60
	AUTO_RESUME = TRUE
INITIALLY_SUSPENDED = TRUE
COMMENT = 'ml-sidekick standard warehouse for data loading';

USE WAREHOUSE ml_sidekick_load_wh;

-- create file format and tables 
-- Assuming the files are csv format if not please see this documentation here to create the correct file format https://docs.snowflake.com/en/sql-reference/sql/create-file-format#syntax


-- create abalone table
CREATE OR REPLACE TABLE ML_SIDEKICK.TEST_DATA.ABALONE
(
SEX VARCHAR,
LENGTH NUMBER,
DIAMETER NUMBER,
HEIGHT NUMBER,
WHOLE_WEIGHT NUMBER,
SHUCKED_WEIGHT NUMBER,
VISCERA_WEIGHT NUMBER,
SHELL_WEIGHT NUMBER,
RINGS INTEGER
);

-- create diabetes table
CREATE OR REPLACE TABLE ML_SIDEKICK.TEST_DATA.DIABETES
(
Diabetes_binary INTEGER,
HighBP INTEGER,
HighChol INTEGER,
CholCheck INTEGER,
BMI INTEGER,
Smoker INTEGER,
Stroke INTEGER,
HeartDiseaseorAttack INTEGER,
PhysActivity INTEGER,
Fruits INTEGER,
Veggies INTEGER,
HvyAlcoholConsump INTEGER,
AnyHealthcare INTEGER,
NoDocbcCost INTEGER,
GenHlth INTEGER,
MentHlth INTEGER,
PhysHlth INTEGER,
DiffWalk INTEGER,
Sex INTEGER,
Age INTEGER,
Education INTEGER,
Income INTEGER
);

Loading Data through Snowflake UI

Overview

Creating the ML_Sidekick application

Overview

In this section, we will navigate the deployed app to select the dataset we would like to work with.

Steps for selecting dataset

  1. Launch the app and click on "Create Project" to start the flow.
  2. Select "ML Model". This would navigate you to the data selection section in the app.
  3. Click on "Data Selection". This would give you options of the available datasets in your Snowflake account.
  4. Select the appropriate database, schema and table as shown below. In this quickstart, we will go with the Abalone dataset that we previously loaded. However, we can also go through the same workflow for the Diabetes dataset. We can see a snapshot of the data we have selected come up on the right. Once we are satisfied with our selection, we can click on "Next" to begin pre-processing the data we have selected.

Overview

After dataset selection, we tackle data preprocessing—a critical step for model performance. Our app streamlines this process, automatically handling missing values, encoding categories, and scaling numbers. This ensures clean, consistent data primed for ML algorithms to extract patterns and make accurate predictions, all within the Snowflake environment. With just a few clicks, our data is transformed from raw to ML-ready, setting the stage for powerful model training.

Steps for pre-processing selected dataset

  1. Optional - Click on the inspect icon as shown below. This brings up an exploratory data analysis pop up.
  2. Optional - In the pop up, we see different descriptive statistics for all the columns in our selected dataset. We can click on the cell next to any of the columns to a visualization for value distribution in that column.
  3. After we are done going over our data analysis, we can select the features and target for our machine learning model and click on "Add Step".
  4. Add step allows us to pre-process any selected column to encode categorical columns or scale numeric values along with providing us the ability to impute missing values. Below, we select one hot encoder for the "SEX" column which is categorical in nature.
  5. Optional - We can click on "Generate Preview" to see how the column got encoded.
  6. With our dataset pre-process, we can go to the next part of the flow which is training our machine learning model by clicking on the "Next" button.

Overview

After prepping our data, we go on to training our machine learning model. In this section of the quickstart, we will train a regression model using prepped Abalone data with Rings as our target. Alternatively, you can also train a classification model using Diabetes data. We will see how easily we can train a machine learning model and register it to the Snowflake registry with this app.

Steps to train machine learning model in Snowflake

  1. Select "Regression" as the model type.
  2. Next select "XGBRegressor" as the model we would be training. Below you can see we have few other regression models available to train as well. In case of classification models, you will find XGBclassifier and Logistic Regression as available models.
  3. Once you have the model type and model selected, click on "Fit Model and Run Prediction(s)" button. This would train our model, make predictions and also calculate performance metrics as we have "Retrieve Model Metrics" turned on.
  4. Scroll down to view the performance metrics and feature importance.

Steps to register model to Snowflake registry

  1. Once we have our model trained, we can generate a jupyter notebook or Snowflake notebook along with registering the model to Snowflake Registry.
  2. Click on "Save Register" and provide a name for the trained model to be registered with.
  3. Once the model is registered, a message appears with the confirmation.
  4. Optional - Navigate back to the homepage of the app.
  5. Optional - Under Model Registry, the newly registered model should appear along with all the metadata.
  6. Optional - We can train another version for the model and save it to register as a new version by providing the same name when registering the model as seen below.

Overview

The ML Sidekick app automatically generates a jupyter notebook or Snowflake notebook that showcases the underlying Python code, making it easy to explore and customize the machine learning pipeline we went through so far. It enhances transparency, serves as an educational tool, and allows us to fine-tune models or adapt the code for future projects—all within the familiar Snowflake environment.

Steps to generate notebook for pipeline

  1. Click "Download Notebook" once we have our model trained. Fill out the project name, database and schema where the Snowflake notebook would be stored.
  2. Click on "Download" button to download jupyter notebook which provides runnable code to simulate the pipeline so far.
  3. Optional - You can create a Snowflake Notebook with the app if you want to run the code using Snowflake Notebooks. Click on the "Create Snowflake Notebook" to do so.
  4. Optional - To find your Snowflake Notebook, you will need to navigate to the Projects tab on the left and click on the "Notebooks" to find your Snowflake Notebook for the model you just created.

Overview

The ML Sidekick app offers another powerful feature that simplifies model management by facilitating automatic version control and streamlined model comparison. This means we can easily track different iterations of the models and evaluate their performance without any manual effort.

Steps to explore & compare two different registered models

  1. Select two models to compare as shown below. Make sure to create a second model similar to the one we created in the previous sections before.
  2. Once we have selected the models, we can explore the performance metrics for both along with their default version and available functions.

Optional - Steps to explore & compare two different versions of same model

  1. Select a model with multiple versions registered.
  2. Select the versions that would be compared.
  3. Toggle to test tab to test both versions on sample data.
  4. Select source for the test data and click "Start Test"
  5. Once the testing is complete for both versions, performance metrics and prediction results pop up.

Congratulations! You have successfully deployed and utilized ML_SIDEKICK application to:

And you did all of it within the secure walls of Snowflake!

What You Learned

Related Resources