Python is the language of choice for Data Science and Machine Learning workloads. Snowflake has long supported Python via the Python Connector, allowing data scientists to interact with data stored in Snowflake from their preferred Python environment. This did, however, require data scientists to write verbose SQL queries. To provide a more friendly, expressive, and extensible interface to Snowflake, we built Snowpark Python, a native Python experience with a pandas and PySpark-like API for data manipulation. This includes a client-side API to allow users to write Python code in a Spark-like API without the need to write verbose SQL. Python UDF and Stored Procedure support also provides more general additional capabilities for compute pushdown.

Snowpark includes client-side APIs and server-side runtimes that extends Snowflake to popular programming languages including Scala, Java, and Python. Ultimately, this offering provides a richer set of tools for Snowflake users (e.g. Python's extensibility and expressiveness) while still leveraging all of Snowflake's core features, and the underlying power of SQL, and provides a clear path to production for machine learning products and workflows.

A key component of Snowpark for Python is that you can "Bring Your Own IDE"- anywhere that you can run a Python kernel, you can run client-side Snowpark Python. You can use it in your code development the exact same way as any other Python library or module. In this quickstart, we will be using Jupyter Notebooks, but you could easily replace Jupyter with any IDE of your choosing.

Throughout this quickstart, we will specifically explore the power of the Snowpark Python Dataframe API, as well as server-side Python runtime capabilities, and how Snowpark Python can enable and accelerate end-to-end Machine Learning workflows.

The source code for this quickstart is available on GitHub.

Prerequisites

What You'll Learn

What You'll Need

What You'll Build

You will build an end-to-end data science workflow leveraging Snowpark for Python to load, clean and prepare data and then deploy our trained model in Snowflake using Python UDF for inference.

You are part of a team of data engineers and data scientists at a Telecom company that has been tasked to reduce customer churn using a machine learning based solution.

To build this, you have access to customer demographic and billing data. Using Snowpark, we will ingest, analyze and transform this data to train a model that will then be deployed inside Snowflake to score new data.

With Snowflake, it is easy to make all relevant data instantly accessible to your machine learning models whether it is for training or inference. For this guide, we are going to do all of our data and feature engineering with Snowpark for Python but users can choose to work with SQL or any of the other Snowpark supported languages including Java and Scala without the need for separate environments.

To streamline your path to production, we will learn how to bring trained models (whether trained inside Snowflake or in an external environment) to run directly inside Snowflake as a UDF bringing models where the data and data pipelines live.

Let's set up the Python environment necessary to run this quickstart:

First, clone the source code for this repo to your local environment (you can also download the code here):

git clone git@github.com:Snowflake-Labs/sfguide-getting-started-snowpark-python.git
cd getting-started-snowpark-python/customer-churn-prediction

Snowpark Python via Anaconda

If you are using Anaconda on your local machine, create a conda env for this quickstart:

conda env create -f jupyter_env.yml
conda activate getting_started_snowpark_python

Conda will automatically install snowflake-snowpark-python==0.7.0 and all other dependencies for you.

Once Snowpark is installed, create a kernel for Jupyter:

python -m ipykernel install --user --name=getting_started_snowpark_python

Now, launch Jupyter Notebook on your local machine:

jupyter notebook

Open up the config.py file in Jupyter and modify with your account, username, and password information.

Now, you are ready to get started with the notebooks. For each notebook, make sure that you select the getting_started_snowpark_python kernel when running. You can do this by navigating to: Kernel -> Change Kernel and selecting getting_started_snowpark_python after launching each Notebook.

Persona: DBA/Platform Administrator/Data Engineer

What You'll Do:

Open up the 01-Load-Data-with-Snowpark Jupyter notebook and run each of the cells to explore loading and transforming data with Snowpark Python.

Persona: Data Scientist

What You'll Do:

Open up the 02-Data-Analysis Jupyter notebook and run each of the cells to explore data analysis using Snowpark Python.

Persona: Data Scientist/ML Engineer

What You'll Do:

Open up the 03-Snowpark-UDF-Deployment Jupyter notebook and run each of the cells to train a model and deploy it for in-Snowflake inference using Snowpark Python UDFs

Through this Quickstart we were able to experience how Snowpark for Python enables you to use familiar syntax and constructs to process data where it lives with Snowflake's elastic, scalable and secure engine, accelerating the path to production for data pipelines and ML workflows. Here's what you were able to complete:

Alright, so what's next? Well, that's the end of this guide, but that's not the end of Snowpark's potential! You could automate the steps using Tasks or external tools such as Apache Airflow. Check out our advanced [Machine Learning with Snowpark Python]() to get deeper into Snowpark Python, orchestration, MLOps, and more.

For more information on Snowpark Python, and Machine Learning in Snowflake, check out the following resources: