Through this quickstart guide, you will explore what's new in Snowpark for Machine Learning. You will set up your Snowflake and Python environments and build an end to end ML workflow from feature engineering to model training and deployment using Snowpark ML.

What is Snowpark?

Snowpark is the set of libraries and runtimes that securely enable developers to deploy and process Python code in Snowflake.

Familiar Client Side Libraries - Snowpark brings deeply integrated, DataFrame-style programming and OSS compatible APIs to the languages data practitioners like to use. It also includes a set of Snowpark ML APIs for more efficient ML modeling (public preview) and ML operations (private preview).

Flexible Runtime Constructs - Snowpark provides flexible runtime constructs that allow users to bring in and run custom logic. Developers can seamlessly build data pipelines, ML models, and data applications with User-Defined Functions and Stored Procedures.

Learn more about Snowpark.

snowpark_diagram

What is Snowpark ML?

Snowpark ML is a new library for faster and more intuitive end-to-end ML development in Snowflake. Snowpark ML has 2 APIs: Snowpark ML Modeling (public preview) for model development and Snowpark ML operations (private preview) for model deployment.

This quickstart will focus on the Snowpark ML Modeling API, which scales out feature engineering and simplifies ML training execution.

Preprocessing: Improve performance and scalability with distributed, multi-node execution for common feature engineering functions

Model Training: Execute training for popular scikit-learn and xgboost models without manual creation of Stored Procedures or UDFs

snowpark_ml_diagram

By letting you perform these tasks in a Snowflake Python application, snowpark-ml provides the following advantages:

The first batch of algorithms provided in Snowpark Python is based on scikit-learn preprocessing transformations from sklearn.preprocessing, as well as estimators that are compatible with those in the scikit-learn, xgboost, and lightgbm libraries.

Learn more about Snowpark ML Modeling API.

What you will learn

Prerequisites

What You'll Build

e2e_ml_workflow

Run the following SQL commands in a SQL worksheet to create the warehouse, database and schema.

USE ROLE ACCOUNTADMIN;
CREATE OR REPLACE WAREHOUSE ML_HOL_WH; --by default, this creates an XS Standard Warehouse
CREATE OR REPLACE DATABASE ML_HOL_DB;
CREATE OR REPLACE SCHEMA ML_HOL_SCHEMA;
CREATE OR REPLACE STAGE ML_HOL_ASSETS; --to store model assets

-- create csv format
CREATE FILE FORMAT IF NOT EXISTS ML_HOL_DB.ML_HOL_SCHEMA.CSVFORMAT 
    SKIP_HEADER = 1 
    TYPE = 'CSV';

-- create external stage with the csv format to stage the diamonds dataset
CREATE STAGE IF NOT EXISTS ML_HOL_DB.ML_HOL_SCHEMA.DIAMONDS_ASSETS 
    FILE_FORMAT = ML_HOL_DB.ML_HOL_SCHEMA.CSVFORMAT 
    URL = 's3://sfquickstarts/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv';
    -- https://sfquickstarts.s3.us-west-1.amazonaws.com/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv

LS @DIAMONDS_ASSETS;

These can also be found in the setup.sql file.

Snowpark for Python and Snowpark ML

{
  "account"   : "<your_account_identifier_goes_here>",
  "user"      : "<your_username_goes_here>",
  "password"  : "<your_password_goes_here>",
  "role"      : "ACCOUNTADMIN",
  "warehouse" : "ML_HOL_WH",
  "database"  : "ML_HOL_DB",
  "schema"    : "ML_HOL_SCHEMA"
}

Open the following jupyter notebook and run each of the cells: 1_snowpark_ml_data_ingest.ipynb

Within this notebook, we will clean and ingest the diamonds dataset into a Snowflake table from an external stage. The diamonds dataset has been widely used in data science and machine learning, and we will use it to demonstrate Snowflake's native data science transformers throughout this quickstart.

The overall goal of this ML project is to predict the price of diamonds given different qualitative and quantitative attributes.

Open the following jupyter notebook and run each of the cells: 2_snowpark_ml_feature_transformations.ipynb

In this notebook, we will walk through a few transformations on the diamonds dataset that are included in the Snowpark ML Preprocessing API. We will also build a preprocessing pipeline to be used in the ML modeling notebook.

Open the following jupyter notebook and run each of the cells: 3_snowpark_ml_model_training_deployment.ipynb

In this notebook, we will illustrate how to train an XGBoost model with the diamonds dataset using the Snowpark ML Model API. We also show how to do inference and deploy the model as a UDF.

Note: Once Snowpark ML's native model registry is available, this will be the more streamlined approach to deploy your model.

Congratulations, you have successfully completed this quickstart! Through this quickstart, we were able to showcase what's new in Snowpark for Machine Learning through the introduction of Snowpark ML, a collection of Python APIs for data science and machine learning tasks. Now, you can run data preprocessing and model training steps in a few lines of code without having to define and deploy stored procedures that package scikit-learn, xgboost, or lightgbm code.

For more information, check out the resources below:

Related Resources