Through this quickstart guide, you will explore what's new in Snowflake for Machine Learning. You will set up your Snowflake and Python environments and build an end to end ML workflow from feature engineering to model training and batch inference with Snowflake ML all from a set of unified Python APIs in the Snowpark ML library.

What is Snowpark?

Snowpark is the set of libraries and code execution environments that run Python and other programming languages next to your data in Snowflake.

Client Side Libraries - Snowpark libraries can be installed and downloaded from any client-side notebook or IDE and are used for code development and deployment. Libraries include the Snowpark ML API, which provides Python APIs for machine learning workflows in Snowflake.

Code Execution Environments - Snowpark provides elastic compute environments for secure execution of your code in Snowflake. Runtime options include Python, Java, and Scala in warehouses, container runtimes for out-of-the-box distributed processing with CPUs or GPUs using any Python framework,or custom runtimes brought in from Snowpark Container Services to execute any language of choice with CPU or GPU compute.

Learn more about Snowpark.

What is Snowflake ML?

Snowflake ML is the integrated set of capabilities for end-to-end machine learning in a single platform on top of your governed data. Snowflake ML can be used for fully custom and out-of-the-box workflows. For ready-to-use ML, analysts can use ML Functions to shorten development time or democratize ML across your organization with SQL from Studio, our no-code user interface. For custom ML, data scientists and ML engineers can easily and securely develop and productionize scalable features and models without any data movement, silos or governance tradeoffs.

Capabilities for custom ML include:

snowflame_ml_overview

To get started with Snowflake ML, developers can use the Python APIs from the Snowpark ML library, directly from Snowflake Notebooks (public preview) or downloaded and installed into any IDE of choice, including Jupyter or Hex.

Feature Engineering and Preprocessing: Improve performance and scalability with distributed execution for common scikit-learn preprocessing functions.

Model Training: Accelerate model training for scikit-learn, XGBoost and LightGBM models without the need to manually create stored procedures or user-defined functions (UDFs), and leverage distributed hyperparameter optimization.

snowpark_ml_modeling_overview

Model Management and Batch Inference: Manage several types of ML models created both within and outside Snowflake and execute batch inference.

snowflake_model_registry

Snowflake ML provides the following advantages:

The first batch of algorithms provided in Snowpark ML Modeling is based on scikit-learn preprocessing transformations from sklearn.preprocessing, as well as estimators that are compatible with those in the scikit-learn, xgboost, and lightgbm libraries.

Learn more about Snowpark ML Modeling and Snowflake Model Registry.

What you will learn

This quickstart will focus on building a custom ML workflow using the following features:

Prerequisites

What You'll Build

To get started using Snowflake Notebooks, first login to Snowsight and run the following setup.sql in a SQL worksheet (we need to create the database, warehouse, schema, etc. that we will use for our ML project).

USE ROLE SYSADMIN;
CREATE OR REPLACE WAREHOUSE ML_HOL_WH; --by default, this creates an XS Standard Warehouse
CREATE OR REPLACE DATABASE ML_HOL_DB;
CREATE OR REPLACE SCHEMA ML_HOL_SCHEMA;
CREATE OR REPLACE STAGE ML_HOL_ASSETS; --to store model assets

-- create csv format
CREATE FILE FORMAT IF NOT EXISTS ML_HOL_DB.ML_HOL_SCHEMA.CSVFORMAT 
    SKIP_HEADER = 1 
    TYPE = 'CSV';

-- create external stage with the csv format to stage the diamonds dataset
CREATE STAGE IF NOT EXISTS ML_HOL_DB.ML_HOL_SCHEMA.DIAMONDS_ASSETS 
    FILE_FORMAT = ML_HOL_DB.ML_HOL_SCHEMA.CSVFORMAT 
    URL = 's3://sfquickstarts/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv';
    -- https://sfquickstarts.s3.us-west-1.amazonaws.com/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv

LS @DIAMONDS_ASSETS;

Now, let's create our first Notebook by importing a .ipynb file. First, make sure your current role is SYSADMIN. Then, select the Notebooks tab under the Projects drop-down in the left side-bar:

Notebook Dropdown

Next, click the gray upload/import .ipynb button, and select 0_start_here.ipynb from your local filesystem:

Notebook Upload

Leave the populated notebook name as-is (or change it if you'd like!), and make sure that the location is set to ML_HOL_DB and ML_HOL_SCHEMA. Lastly, make sure the Notebook warehouse is ML_HOL_WH, and click Create:

Notebook Config

This will create and open the notebook you just uploaded. Follow the instructions at the top of the Notebook to select any necessary packages you might need via the Packages drop-down:

Packages

Then, click Start and run the Notebook start to finish! Repeat this process with all three Notebooks to see how easy it is to write Python and SQL code in a single, familiar Notebook interface directly in Snowsight!

Open the following notebook in Snowflake Notebooks and run each of the cells: 0_start_here.ipynb

Within this notebook, we will clean and ingest the diamonds dataset into a Snowflake table from an external stage. The diamonds dataset has been widely used in data science and machine learning, and we will use it to demonstrate Snowflake's native data science transformers throughout this quickstart.

The overall goal of this ML project is to predict the price of diamonds given different qualitative and quantitative attributes.

Open the following notebook in Snowflake Notebooks and run each of the cells: 1_sf_nb_snowflake_ml_feature_transformations.ipynb

In this notebook, we will walk through a few transformations on the diamonds dataset that are included in the Snowpark ML Modeling. We will also build a preprocessing pipeline to be used in the ML modeling notebook.

Open the following notebook in Snowflake Notebooks and run each of the cells: 2_sf_nb_snowflake_ml_model_training_inference.ipynb

In this notebook, we will illustrate how to train an XGBoost model with the diamonds dataset using the Snowpark ML Modeling. We also show how to execute batch inference through the Snowflake Model Registry.

Congratulations, you have successfully completed this quickstart! Through this quickstart, we were able to showcase Snowflake ML, the integrated set of capabilities for end-to-end ML workflows. Now, you can run data preprocessing, feature engineering, model training, and batch inference in a few lines of code without having to define and deploy stored procedures that package scikit-learn, xgboost, or lightgbm code.

For more information, check out the resources below:

Related Resources