Getting Started with pandas on Snowflake

Through this quickstart guide, you will explore how to get started with pandas on Snowflake for scalable data processing, analysis, and transformation using familiar pandas API and semantics.

What is pandas on Snowflake?

pandas on Snowflake lets you run your pandas code in a distributed manner scalably and securely directly on your data in Snowflake. Just by changing the import statement and a few lines of code, you can get the same pandas-native experience you know and love with the scalability and security benefits of Snowflake. With pandas on Snowflake this API, you can work with much larger datasets and avoid the time and expense of porting your pandas pipelines to other big data frameworks or provisioning large and expensive machines. It runs workloads natively in Snowflake through transpilation to SQL, enabling it to take advantage of parallelization and the data governance and security benefits of Snowflake.

Why use pandas on Snowflake?

pandas is the go-to data processing library for millions worldwide, including countless Snowflake users. However, pandas was never built to handle data at the scale organizations are operating today. Running pandas code requires transferring and loading all of the data into a single in-memory process. It becomes unwieldy on moderate-to-large data sets and breaks down completely on data sets that grow beyond what a single node can handle. With pandas on Snowflake, you can run the same pandas code, but with all the pandas processing pushed down to run in a distributed fashion in Snowflake. Your data never leaves Snowflake, and your pandas workflows can process much more efficiently using the Snowflake elastic engine. This brings the power of Snowflake to pandas developers everywhere.

pandas_benefits

pandas on Snowflake is delivered through the Snowpark pandas API, which you will learn how to use as part of this quickstart. This quickstart will focus on getting started with Snowpark pandas API, and enable you to perform common pandas operations on large volumes of data using the power of Snowflake.

What you will learn

How to install and configure the Snowpark pandas library
How to use Snowpark pandas to transform and analyze large datasets using the power of Snowflake

Prerequisites

We will be using Snowflake Notebooks as part of this tutorial. You will need:

A Snowflake account with Anaconda Packages enabled by ORGADMIN. If you do not have a Snowflake account, you can register for a free trial account.
A Snowflake account login with a role that has the ability to create database, schema, tables, stages, user-defined functions, and stored procedures. If not, you will need to register for a free trial or use a different role.

What You'll Build

A notebook leveraging pandas on Snowflake:
- to load and clean up data
- to perform common pandas operations and transformations at scale
- to visualize data

Download the Git repo here.
Install the Cybersyn Finance and Economics dataset from Snowflake Marketplace.
Run the following SQL commands in a SQL worksheet to create the warehouse, database and schema.

USE ROLE ACCOUNTADMIN;

-- Roles
SET MY_USER = CURRENT_USER();
CREATE OR REPLACE ROLE PANDAS_ROLE;
GRANT ROLE PANDAS_ROLE TO ROLE SYSADMIN;
GRANT ROLE PANDAS_ROLE TO USER IDENTIFIER($MY_USER);

GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE PANDAS_ROLE;

-- Databases
CREATE OR REPLACE DATABASE PANDAS_DB;
GRANT OWNERSHIP ON DATABASE PANDAS_DB TO ROLE PANDAS_ROLE;

-- Warehouses
CREATE OR REPLACE WAREHOUSE PANDAS_WH WAREHOUSE_SIZE = XSMALL, AUTO_SUSPEND = 300, AUTO_RESUME= TRUE;
GRANT OWNERSHIP ON WAREHOUSE PANDAS_WH TO ROLE PANDAS_ROLE;


-- ----------------------------------------------------------------------------
-- Step #3: Create the database level objects
-- ----------------------------------------------------------------------------
USE ROLE PANDAS_ROLE;
USE WAREHOUSE PANDAS_WH;
USE DATABASE PANDAS_DB;

-- Schemas
CREATE OR REPLACE SCHEMA PUBLIC;

-- Create a table from the secure shared view
CREATE OR REPLACE TABLE STOCK_PRICE_TIMESERIES AS SELECT * FROM FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.STOCK_PRICE_TIMESERIES;

These can also be found in the setup.sql file.

Navigate To Snowflake Notebooks

Navigate to the Notebooks section by clicking Projects and then Notebooks
Click on the *down arrow next to + Notebook
Select Import .ipynb file.

Import .ipynb File

Download this ipynb file to your machine.
Navigate to where you have downloaded the ipynb file and select 0_start_here.ipynb and click Open
Give the notebook a name.
Select the database PANDAS_DB and schema PUBLIC to store the notebook
Select the warehouse PANDAS_WH for the notebook to run on.

Once you have created a notebook based on the ipynb file

Add Required Python Libraries

pandas on Snowflake is available as part of the Snowpark Python package (version 1.17 and above). Snowpark Python comes pre-installed with the Snowflake Notebooks environment.

Additionally, you will need to add the modin package in the Packages dropdown.

In the Notebook click on Packages
Search for modin and select modin in the list

Run code in Notebook

Within this notebook, we will import the Snowpark pandas API, connect to Snowflake, and perform common pandas operations on a financial time series dataset.

Congratulations, you have successfully completed this quickstart! Through this quickstart, we were able to showcase how pandas on Snowflake allows pandas developers to easily get started processing and analyzing data at tremendous scale using familiar programming constructs and APIs.

What you learned

How to install and configure the Snowpark pandas library
How to use pandas on Snowflake to transform and analyze large datasets using the power of Snowflake

Related Resources

For more information, check out the resources below: