By completing this guide, you will be able to go from raw data to build a machine learning model that can help to predict house prices.
Here is a summary of what you will be able to learn in each step by following this quickstart:
Setup Environment: Use write_pandas and tables to ingest raw data from local file system into Snowflake
Data Engineering: Leverage Snowpark for Python DataFrames to perform data transformations such as group by, aggregate, pivot, and join to prep the data for downstream applications.
Machine Learning using scikit learn: Prepare data and run ML Training in Snowflake using scikit-learn and deploy the model as a Snowpark User-Defined-Function (UDF) using the integrated Anaconda package repository.
In case you are new to some of the technologies mentioned above, here's a quick summary with links to documentation.
What is Snowpark?
It allows developers to query data and write data applications in languages other than SQL using a set of APIs and DataFrame-style programming constructs in Python, Java, and Scala. These applications run on and take advantage of the same distributed computation on Snowflake's elastic engine as your SQL workloads. Learn more about Snowpark.
What is scikit-learn?
It is one of the most popular open source machine learning libraries for Python that also happens to be pre-installed and available for developers to use in Snowpark for Python via Snowflake Anaconda channel. This means that you can use it in Snowpark for Python User-Defined Functions and Stored Procedures without having to manually install it and manage all of its dependencies.
What You'll Learn
1. How to ingest data in Snowflake
2. How to do data explorations and understanding with Pandas and visualization
3. How to encode the data for algorithms to use
4. How to normalize the data
5. How to training models with Scikit-Learn and Snowpark (including using Snowpark Optimized warehouse)
6. How to evaluate models for accuracy
7. How to deploy models on Snowflake
You will need to accept acknowledge the Snowflake Third Party Terms by following Anaconda link in previous step.
A Snowflake account login with ACCOUNTADMIN role. If you have this role in your environment, you may choose to use it. If not, you will need to 1) Register for a free trial, 2) Use a different role that has the ability to create database, schema, tables, stages, tasks, user-defined functions, and stored procedures OR 3) Use an existing database and schema in which you are able to create the mentioned objects.
This section covers cloning of the GitHub repository and creating a Python 3.8 environment.