Through this quickstart guide, you will explore what's new in Snowpark for Machine Learning. You will set up your Snowflake and Python environments and build an end to end ML workflow from feature engineering to model training and deployment using Snowpark ML.
Snowpark is the set of libraries and runtimes that securely enable developers to deploy and process Python code in Snowflake.
Familiar Client Side Libraries - Snowpark brings deeply integrated, DataFrame-style programming and OSS compatible APIs to the languages data practitioners like to use. It also includes a set of Snowpark ML APIs for more efficient ML modeling and ML operations (public preview soon).
Flexible Runtime Constructs - Snowpark provides flexible runtime constructs that allow users to bring in and run custom logic. Developers can seamlessly build data pipelines, ML models, and data applications with User-Defined Functions and Stored Procedures.
Learn more about Snowpark.
Snowpark ML includes the Python library and underlying infrastructure for end-to-end ML workflows in Snowflake. Snowpark ML has 2 APIs: Snowpark ML Modeling for model development and Snowpark ML Operations including the Snowpark Model Registry (public preview soon) for model deployment and management.
This quickstart will focus on
Using these features, you can build and operationalize a complete ML workflow, taking advantage of Snowflake's scale and security features.
Preprocessing: Improve performance and scalability with distributed, multi-node execution for common feature engineering functions
Model Training: Execute training for popular scikit-learn, xgboost, and lightgbm models without manual creation of Stored Procedures or UDFs and accelerate model training with distributed hyperparameter tuning
Model management and deployment: Manage and deploy models and their metadata easily into Snowflake warehouses as UDFs, or Snowpark Container Services as Service endpoints for batch inference
By letting you perform these tasks within Snowflake, snowpark-ml provides the following advantages:
The first batch of algorithms provided in Snowpark Python is based on scikit-learn preprocessing transformations from sklearn.preprocessing, as well as estimators that are compatible with those in the scikit-learn, xgboost, and lightgbm libraries.
Learn more about Snowpark ML Modeling API and Snowpark Model Registry
MAKE SURE YOU'VE DOWNLOADED THE GIT REPO HERE: [https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python**]
Run the following SQL commands in a SQL worksheet to create the warehouse, database and schema.
USE ROLE ACCOUNTADMIN;
CREATE OR REPLACE WAREHOUSE ML_HOL_WH; --by default, this creates an XS Standard Warehouse
CREATE OR REPLACE DATABASE ML_HOL_DB;
CREATE OR REPLACE SCHEMA ML_HOL_SCHEMA;
CREATE OR REPLACE STAGE ML_HOL_ASSETS; --to store model assets
-- create csv format
CREATE FILE FORMAT IF NOT EXISTS ML_HOL_DB.ML_HOL_SCHEMA.CSVFORMAT
SKIP_HEADER = 1
TYPE = 'CSV';
-- create external stage with the csv format to stage the diamonds dataset
CREATE STAGE IF NOT EXISTS ML_HOL_DB.ML_HOL_SCHEMA.DIAMONDS_ASSETS
FILE_FORMAT = ML_HOL_DB.ML_HOL_SCHEMA.CSVFORMAT
URL = 's3://sfquickstarts/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv';
-- https://sfquickstarts.s3.us-west-1.amazonaws.com/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv
LS @DIAMONDS_ASSETS;
These can also be found in the setup.sql file.
MAKE SURE YOU'VE DOWNLOADED THE GIT REPO HERE: [https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python**]
conda env create -f conda_env.yml
conda activate snowpark-ml-hol
Optionally
start notebook server:$ jupyter notebook &> /tmp/notebook.log &
{
"account" : "<your_account_identifier_goes_here>",
"user" : "<your_username_goes_here>",
"password" : "<your_password_goes_here>",
"role" : "ACCOUNTADMIN",
"warehouse" : "ML_HOL_WH",
"database" : "ML_HOL_DB",
"schema" : "ML_HOL_SCHEMA"
}
Open the following jupyter notebook and run each of the cells: 1_snowpark_ml_data_ingest.ipynb
Within this notebook, we will clean and ingest the diamonds
dataset into a Snowflake table from an external stage. The diamonds
dataset has been widely used in data science and machine learning, and we will use it to demonstrate Snowflake's native data science transformers throughout this quickstart.
The overall goal of this ML project is to predict the price of diamonds given different qualitative and quantitative attributes.
Open the following jupyter notebook and run each of the cells: 2_snowpark_ml_feature_transformations.ipynb
In this notebook, we will walk through a few transformations on the diamonds
dataset that are included in the Snowpark ML Preprocessing API. We will also build a preprocessing pipeline to be used in the ML modeling notebook.
Open the following jupyter notebook and run each of the cells: 3_snowpark_ml_model_training_deployment.ipynb
In this notebook, we will illustrate how to train an XGBoost model with the diamonds
dataset using the Snowpark ML Modeling API. We also show how to do inference and deploy the model to a Snowflake Warehouse through the Snowpark Model Registry.
Congratulations, you have successfully completed this quickstart! Through this quickstart, we were able to showcase Snowpark for Machine Learning through the introduction of Snowpark ML, the Python library and underlying infrastructure for data science and machine learning tasks. Now, you can run data preprocessing, feature engineering, model training, and integrated deployment in a few lines of code without having to define and deploy stored procedures that package scikit-learn, xgboost, or lightgbm code.
For more information, check out the resources below: