Through this quickstart guide, you will get an introduction to Snowflake for Machine Learning. You will set up your Snowflake and Python environments and build an end to end ML workflow from feature engineering to model training and batch inference with Snowflake ML all from a set of unified Python APIs in the Snowpark ML library.

What is Snowflake ML?

Snowflake ML is the integrated set of capabilities for end-to-end machine learning in a single platform on top of your governed data. Data scientists and ML engineers can easily and securely develop and productionize scalable features and models without any data movement, silos or governance tradeoffs.

Capabilities for custom ML include:

snowflake_ml_overview

To get started with Snowflake ML, developers can use the Python APIs from the Snowpark ML library, directly from Snowflake Notebooks (public preview) or downloaded and installed into any IDE of choice, including Jupyter or Hex.

Feature Engineering and Preprocessing: Improve performance and scalability with distributed execution for common scikit-learn preprocessing functions.

Model Training: Accelerate model training for scikit-learn, XGBoost and LightGBM models without the need to manually create stored procedures or user-defined functions (UDFs), and leverage distributed hyperparameter optimization.

snowpark_ml_modeling_overview

Model Management, Batch Inference, and Model Explainability: Manage several types of ML models created both within and outside Snowflake, execute batch inference, and understand features the model considers most impactful when generating predictions.

snowflake_model_registry

Snowflake ML provides the following advantages:

The first batch of algorithms provided in Snowpark ML Modeling is based on scikit-learn preprocessing transformations from sklearn.preprocessing, as well as estimators that are compatible with those in the scikit-learn, xgboost, and lightgbm libraries.

Learn more about Snowpark ML Modeling and Snowflake Model Registry.

What you will learn

This quickstart will focus on building a custom ML workflow using the following features:

Prerequisites

What You'll Build

E2E ML Workflow with Snowflake ML

To get started using Snowflake Notebooks, first login to Snowsight and run the following setup.sql in a SQL worksheet (we need to create the database, warehouse, schema, etc. that we will use for our ML project).

USE ROLE SYSADMIN;
CREATE OR REPLACE WAREHOUSE ML_HOL_WH; --by default, this creates an XS Standard Warehouse
CREATE OR REPLACE DATABASE ML_HOL_DB;
CREATE OR REPLACE SCHEMA ML_HOL_SCHEMA;
CREATE OR REPLACE STAGE ML_HOL_ASSETS; --to store model assets

-- create csv format
CREATE FILE FORMAT IF NOT EXISTS ML_HOL_DB.ML_HOL_SCHEMA.CSVFORMAT 
    SKIP_HEADER = 1 
    TYPE = 'CSV';

-- create external stage with the csv format to stage the diamonds dataset
CREATE STAGE IF NOT EXISTS ML_HOL_DB.ML_HOL_SCHEMA.DIAMONDS_ASSETS 
    FILE_FORMAT = ML_HOL_DB.ML_HOL_SCHEMA.CSVFORMAT 
    URL = 's3://sfquickstarts/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv';
    -- https://sfquickstarts.s3.us-west-1.amazonaws.com/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv

LS @DIAMONDS_ASSETS;

Now, let's create our first Notebook by importing a .ipynb file. First, make sure your current role is SYSADMIN. Then, select the Notebooks tab under the Projects dropdown in the left sidebar:

Notebook Dropdown

Next, click the gray upload/import .ipynb button, and select 0_start_here.ipynb from your local filesystem:

Notebook Upload

Leave the populated notebook name as-is (or change it if you'd like!), and make sure that the location is set to ML_HOL_DB and ML_HOL_SCHEMA. Lastly, make sure the Notebook warehouse is ML_HOL_WH, and click Create:

Notebook Config

This will create and open the notebook you just uploaded.

Now, upload the provided environment.yml file.

Environment File Upload

Environment File Upload

Then, click Start and run the Notebook start to finish!

Repeat this process with all the other Notebooks to see how easy it is to write Python and SQL code in a single, familiar Notebook interface directly in Snowsight!

Open the following notebook in Snowflake Notebooks and run each of the cells: 0_start_here.ipynb

Within this notebook, we will clean and ingest the diamonds dataset into a Snowflake table from an external stage. The diamonds dataset has been widely used in data science and machine learning, and we will use it to demonstrate Snowflake's native data science transformers throughout this quickstart.

The overall goal of this ML project is to predict the price of diamonds given different qualitative and quantitative attributes.

Open the following notebook in Snowflake Notebooks and run each of the cells: 1_sf_nb_snowflake_ml_feature_transformations.ipynb

In this notebook, we will walk through a few transformations on the diamonds dataset that are included in the Snowpark ML Modeling. We will also build a preprocessing pipeline to be used in the ML modeling notebook.

Open the following notebook in Snowflake Notebooks and run each of the cells: 2_sf_nb_snowflake_ml_model_training_inference.ipynb

In this notebook, we will illustrate how to train an XGBoost model with the diamonds dataset using the Snowpark ML Modeling. We also show how to execute batch inference and model explainability through the Snowflake Model Registry.

Open the following notebook in Snowflake Notebooks and run each of the cells: 3_sf_nb_snowpark_ml_adv_mlops.ipynb

In this notebook, we will show you how to manage Machine Learning models from experimentation to production using existing (Snowpark ML Modeling & Model Registry) and new Snowflake MLOps features:

We will also go more into detail in using the Model Registry API.

Congratulations, you have successfully completed this quickstart! Through this quickstart, we were able to showcase Snowflake for Machine Learning through the introduction of Snowpark ML, the Python library and underlying infrastructure for data science and machine learning tasks. Now, you can run data preprocessing, feature engineering, model training, and batch inference in a few lines of code without having to define and deploy stored procedures that package scikit-learn, xgboost, or lightgbm code. You can also manage your models from iteration to production and trace your ML lineage to better understand how machine learning artifacts relate to each other.

For more information, check out the resources below:

Related Resources