By completing this guide, you will be able to go from raw data to build a machine learning model that can help to predict house prices.

Here is a summary of what you will be able to learn in each step by following this quickstart:

In case you are new to some of the technologies mentioned above, here's a quick summary with links to documentation.

What is Snowpark?

It allows developers to query data and write data applications in languages other than SQL using a set of APIs and DataFrame-style programming constructs in Python, Java, and Scala. These applications run on and take advantage of the same distributed computation on Snowflake's elastic engine as your SQL workloads. Learn more about Snowpark.

Snowpark

What is scikit-learn?

It is one of the most popular open source machine learning libraries for Python that also happens to be pre-installed and available for developers to use in Snowpark for Python via Snowflake Anaconda channel. This means that you can use it in Snowpark for Python User-Defined Functions and Stored Procedures without having to manually install it and manage all of its dependencies.

What You'll Learn

1. How to ingest data in Snowflake

2. How to do data explorations and understanding with Pandas and visualization

3. How to encode the data for algorithms to use

4. How to normalize the data

5. How to training models with Scikit-Learn and Snowpark (including using Snowpark Optimized warehouse)

6. How to evaluate models for accuracy

7. How to deploy models on Snowflake

Prerequisites

This section covers cloning of the GitHub repository and creating a Python 3.8 environment.

  1. Clone GitHub repository
  2. Download the miniconda installer from https://conda.io/miniconda.html. (OR, you may use any other Python environment with Python 3.8).
  3. Open environment.yml and paste in the following config:
name: snowpark_scikit_learn
channels:
  - https://repo.anaconda.com/pkgs/snowflake/
  - nodefaults
dependencies:
  - python=3.8
  - pip
  - snowflake-snowpark-python
  - ipykernel
  - pyarrow
  - numpy
  - scikit-learn
  - pandas
  - joblib
  - cachetools
  - matplotlib
  - seaborn
  1. From the root folder, create conda environment by running below command.
conda env create -f environment.yml
conda activate snowpark_scikit_learn
  1. Download and install VS Code or you could use juypter notebook or any other IDE of your choice
  2. Update config.py with your Snowflake account details and credentials.

Troubleshooting pyarrow related issues

The Notebook linked below covers the following data ingestion tasks.

  1. Download data file to be used in the lab
  2. Read downloaded data as pandas dataframe
  3. Connect to Snowflake using session object
  4. Create database, schema and warehouse
  5. Load pandas dataframe object into Snowflake table

Data Ingest Notebook in Jupyter or Visual Studio Code

To get started, follow these steps:

  1. In a terminal window, browse to this folder and run jupyter notebook at the command line. (You may also use other tools and IDEs such Visual Studio Code.)
  2. Open and run through the cells in 1_snowpark_housing_data_ingest.ipynb

The Notebook linked below covers the following data exploration tasks.

  1. Establish secure connection from Snowpark Python to Snowflake
  2. Compare Snowpark dataframe to Pandas dataframe
  3. Use describe function to understand data
  4. Build some visualisation using seaborn and pyplot

Data Exploration Notebook in Jupyter or Visual Studio Code

To get started, follow these steps:

  1. If not done already, in a terminal window, browse to this folder and run jupyter notebook at the command line. (You may also use other tools and IDEs such Visual Studio Code.)
  2. Open and run through the cells in 2_data_exploration_transformation.ipynb

The Notebook linked below covers the following machine learning tasks.

  1. Establish secure connection from Snowpark Python to Snowflake
  2. Get features and target from Snowflake table into Snowpark DataFrame
  3. Create Snowflake stage to save ML model and UDF's
  4. Prepare features using scikit learn for model training
  5. Create a Python Stored Procedure to deploy model training code on Snowflake
  6. Optinally use Snowpark optimised warehouse for model training
  7. Create Vectorized (aka Batch) Python User-Defined Functions (UDFs) for inference on new data points for online and offline inference respectively.

End-To-End-ML

Machine Learning Notebook in Jupyter or Visual Studio Code

To get started, follow these steps:

  1. If not done already, in a terminal window, browse to this folder and run jupyter notebook at the command line. (You may also use other tools and IDEs such Visual Studio Code.)
  2. Open and run through the 3_snowpark_end_to_end_ml.ipynb

Congratulations! You've successfully completed the lab using Snowpark for Python and scikit-learn.

What You Learned

Related Resources