Through this quickstart guide, you will learn how to use the Snowpark Pandas API to create a customer profile based on the using the Snowflake Sample TPC-H dataset, save it into a Snowflake table and to create a serverless task to schedule the feature engineering.

What Is Snowpark?

Snowpark is the set of libraries and code execution environments that run Python and other programming languages next to your data in Snowflake.

Learn more about Snowpark.

What Is Snowpark Pandas API?

The Snowpark pandas API is a module in the Snowpark library that lets you run your pandas code directly on your data in Snowflake. Just by changing the import statement and a few lines of code, you can get the same pandas-native experience you know and love with the scalability and security benefits of Snowflake. With this API, you can work with much larger datasets so you can avoid the time and expense of porting your pandas pipelines to other big data frameworks or using larger and more expensive machines. It runs workloads natively in Snowflake through translation to SQL, enabling it to take advantage of parallelization and the data governance and security benefits of Snowflake.

Benefits Of Using The Snowpark Pandas API

Learn more about Snowpark Pandas API.

What You'll Learn

Prerequisites

What You'll Build

A Customer profile table using the Snowpark Pandas API and a serverless task that will run the feature engineering on a schedule.

Overview

This section covers cloning of the GitHub repository and creating the needed Snowflake objects (i.e role, warehouse, database, schema, etc..)

Clone The Git Repository

The very first step is to clone the GitHub repository. This repository contains all the code you will need to successfully complete this QuickStart Guide.

Using HTTPS:

git clone https://github.com/Snowflake-Labs/sfguide-data-engineering-pipelines-with-snowpark-pandas.git

OR, using SSH:

git clone git@github.com:Snowflake-Labs/sfguide-data-engineering-pipelines-with-snowpark-pandas.git

You can also use the Git integration feature of Snowflake Notebooks, in order to do that you need to fork the GitHub repository to be allowed to commit changes. For instructions how to set up Git integration for your Snowflake account see here and for using it with Snowflake Notebooks see this page.

During this step you will verify that the Snowflake Sample TPC-H dataset is available in your account, and if not add the share.

Verify That The Snowflake Sample TPC-H Dataset Is Available

  1. Log into Snowsight for your account
  2. Navigate to Databases
  3. Verify that you can see the SNOWFLAKE_SAMPLE_DATA database, if it is missing then you can add it folling the instructions in https://docs.snowflake.com/en/user-guide/sample-data-using

Create Database, Schema And Warehouse To Be Used

USE ROLE ACCOUNTADMIN;

CREATE DATABASE SNOW_PANDAS_DE_QS;
CREATE SCHEMA SNOW_PANDAS_DE_QS.NOTEBOOKS;
CREATE SCHEMA SNOW_PANDAS_DE_QS.DATA;

CREATE WAREHOUSE SNOW_PANDAS_DE_QS_WH;

Create Snowflake Notebook

Navigate To Snowflake Notebooks

  1. Navigate to the Notebooks section by clicking Projects and then Notebooks
    Navigate to Notebooks
  2. Click on the *down arrow next to + Notebook
    New notebook drop down
  3. If you have set up git integration choose Create from repository if not, then choose import .ipynb file. New notebook from menu

Import .ipynb File

  1. Navigate to where you have cloned the GitHub repository and select Customer Profile Creation Pipeline.ipynb and click Open
    Select Notebook File
  2. Keep the name, select SNOW_PANDAS_DE_QS and NOTEBOOKS for Notebook location, SNOW_PANDAS_DE_QS_WH for Notebook warehouse and click Create
    Select Notebook File

Create From Repository

If you have forked the GitHub repository and create a integration to it in Snowflake you can open the notebook directly from the git repository.

  1. In the Create Notebook from Repository dialog click on Select .ipynb file
    Create Notebook from Repository Dialog
    2 Click on the repository integration you are using and select Customer Profile Creation Pipeline.ipynb and click Select File, if you do not see the file press Fetch to refresh with the latest changes from the repository
    Select Notebook File from Repository
  2. Name it Customer Profile Creation Pipeline, select SNOW_PANDAS_DE_QS, NOTEBOOKS for Notebook location and SNOW_PANDAS_DE_QS_WH for Notebook warehouse and click Create
    Create Notebook from Repository Dialog

Add Required Python Libraries

Before you run the notebook you need to add the following Python libraries:

  1. In the Notebook click on Packages
  2. Search for modin and select modin in the list
    Modin search result
  3. Do the same for snowflake, matplotlib and seaborn. When done you should have the same packages as the list below (the versions might differ)
    Added packages

During this step you will learn how to use the Snowpark Pandas API to:

Follow along and run each of the cells in the Notebook.

Within this Notebook, we will use Snowpark Pandas API to create DataFrames, join them, create new features and create a serverless task to schedule the feature engineering.

Congratulations, you have successfully completed this quickstart! Through this quickstart, we were able to showcase how you can use the Snowpark Pandas API to create DataFrames, join them, create new features, save the result to a Snowflake table, and create a serverless task to schedule the data transformation pipeline.

What You Learned

Related Resources