This Quickstart guide contains key tips for optimal performance when using Snowpark Python. The guide is broken up into multiple labs, each covering a key concept that can improve performance and/or efficiency while running workloads in Snowpark.

Prerequisites

What You'll Learn

In this quickstart, you will learn how to make optimized decisions when using Snowpark Python. These choices will be compared with others to show performance improvements. Each concept is broken up into a lab, listed below:

What You'll Need

Python Environment Setup

Let's set up the Python environment necessary to run this quickstart:

First, clone the source code for this repo to your local environment:

git clone https://github.com/Snowflake-Labs/sfguide-snowpark-python-top-three-tips-for-optimal-performance
cd sfguide-snowpark-python-top-three-tips-for-optimal-performance

Snowpark Python via Anaconda

If you are using Anaconda on your local machine, create a conda env for this quickstart:

conda env create -f conda_env.yml
conda activate pysnowpark

Conda will automatically install snowflake-snowpark-python and all other dependencies for you.

Now, launch Jupyter Notebook on your local machine:

jupyter notebook

Troubleshooting pyarrow related issues

Snowflake Environment Setup

Create a fresh database in your Snowflake account for the following labs. Login to Snowsight and issue the following command:

CREATE DATABASE SNOWPARK_BEST_PRACTICES_LABS;

We'll also standardize the warehouse we will use for the following labs. Run the following command:

CREATE OR REPLACE WAREHOUSE compute_wh WAREHOUSE_SIZE=SMALL INITIALLY_SUSPENDED=TRUE;

Edit the credentials.json file

Throughout all the following labs, we will need to connect your notebook IDE to Snowflake. This is done via a credentials file. Edit the credentials.json file found in the cloned repo with your Snowflake account information.

Snowpark is essentially a wrapper that converts Spark or Pandas APIs into SQL when executing and data engineers don't need to spend excessive time to find out how it works behind scenes. This notebook guides you through why we should use Snowpark Dataframe and avoid using Pandas Data Frame.

Though Pandas has tons of really good APIs in day-to-day DE job, the challenge with Pandas is that it does not scale with your data volume linearly due to single-machine processing. For example, Pandas fails with out-of-memory error if it attempts to read a dataset that is larger than the memory available in a single machine.

Now we will see how Pandas works with Snowpark and how Snowflake has tried to parallelize the processing of Pandas API. Lets use a simple example to remove duplicates from a 6M sample table. ndedup

Prerequisites

What You'll Do

Run the Lab

Open up the jupyter notebook titled lab1_avoid_pandas_df and run each of the cells.

Conclusion

If you are interested, you can go to Snowflake UI to to check query history of the runs. I have summarized the performance differences between Snowpark Dataframe and Pandas Dataframe, both using the same small warehouse. As you can see the performance using Snowpark Dataframe is 8X faster than using Pandas Dataframe and it's always a best practice to leverage Snowpark Dataframe as much as you can!

compare

Lab Summary

The Snowpark API provides methods that help to create a User Defined Function. This can be done either using a Lambda or a typical Function in Python. When you create these UDFs, the Snowpark library uploads this code to an internal stage. When you call these UDFs, the Snowpark library executes your function on the server, where the data is. As a result, the data doesn't need to be transferred to the client in order for the function to process the data.

Naturally, you're wondering then, what is a Vectorised UDF?

The Python UDF batch API enables defining Python functions that receive batches of input rows (aka chunked rows) as Pandas DataFrames and return batches of results as Pandas arrays or Series. The column in the Snowpark dataframe will be vectorized as a Pandas Series inside the UDF.

The goal of this lab is to compare the performance of a UDF Operation with and without using the Python UDF Batch API (or Vectorised UDFs).

In this notebook, we will do the following:

Prerequisites

What You'll Do

This lab consists of the following major steps:

  1. Importing libraries and connecting to Snowflake
  2. Creating the dataframes of test data from the TPCDS dataset
  3. Setting up the virtual warehouses prior to testing
  4. Running four test suites on typical UDF operations
    • Numeric Computation (~3 minutes)
    • String Manipulation (~5 minutes)
    • Batch Sizing (~5 minutes)
    • Timestamp Manipulation (~60 minutes)
  5. Analyzing the results, forming conclusions, and cleanup

Run the lab

Open up the jupyter notebook titled lab2_vectorized_udfs and run each of the cells.

Analysis

Below is analysis of the tests ran in the notebook. Please compare these findings with the results in your notebook.

Recap

Before we look into the results, let's recap what we have done so far:

Numeric Computation Use Cases

Numeric Results Table

For Numerical Computation, keeping the same dataset and warehouse size, Vectorised UDFs outperform Normal UDFs.

Non-Numeric Computation Use Cases

NonNumeric Results Table

As expected, it makes sense not to use Vectorised UDFs for Non-numeric operations

Batch Sizes

Additionally, when using Vectorised UDFs, you may want to play around with Batch Sizes.

@udf(max_batch_size=1000)
def vect_udf_bc_100(inp: PandasSeries[float]) -> PandasSeries[float]:
    return (inp - df_customer_100_mean + df_customer_100_stddev) * 10000.0

It is important to note:

Conclusion

Your Python code must operate efficiently on batch of rows

Cleanup

Lab Summary

When using Snowpark, it is common for data engineers and data scientists to create pickle files, upload them to internal/external stage, and use them with Snowflake UDFs and Stored Procedures (SPs). This lab will show how using the Python library Cachetools can be used to speed up UDF or SP performance by ensuring the logic is cached in memory in cases of repeated reads. For simplicity we will demonstrate this scenario using a pickle file which has the dictionary object serialized.

Background Information - Cachetools and pickle files

Cachetools

Cachetools is a Python library that provides a collection of caching algorithms. It offers simple and efficient caching methods, such as LRU (Least Recently Used) and FIFO (First In First Out), to store a limited number of items for a specified duration. The library is useful in applications where temporarily storing data in memory improves performance. It's easy to use and can be integrated into an application with just a few lines of code.

The Cachetools library with Snowpark UDF can be used for:

Pickle File

A pickle file is a file format used to store Python objects as binary data. It is used in data engineering to store and retrieve data objects that can be used across different applications or scripts.For example, you might use pickle to store data objects such as pandas dataframes, numpy arrays, or even custom Python classes and functions. By using pickle, you can save these objects to disk and load them later without having to recreate the data from scratch.

Scenario

There is a ficticious system that generates transactional data along with the day of the week information for each row. To access the names of the weekdays, we are required to use a Python pickle file that is shared by the upstream data teams. This file contains the names of the weekdays, which may change in different scenarios. To ensure compatibility and security, the upstream data teams only provide access to this information through the pickle file.

To analyze this data in Snowflake Snowpark Python, we have created a Python User Defined Function (UDF) to efficiently retrieve the corresponding day name based on the day number which comes as an input. This UDF provides a convenient and reliable way to access the day names from the pickle file, allowing us to perform the necessary DataFrame transformations for our analysis.

Lately as the number of transactions are growing we are seeing performance degradation of our select queries which is using the UDF to fetch the required information from the pickle file. We need to look for ways to improve the performance of the Python UDF that we have created.

What You'll Do

This lab consists of the following major steps:

  1. Importing libraries and connecting to Snowflake
  2. Creating a pickle file
  3. Running a UDF without Cachetools
  4. Running a UDF with Cachetools
  5. Conclusion and Cleanup

Run the lab

Open up the jupyter notebook titled lab3_cachetools_library and run each of the cells.

Conclusion

Based on the our testing, we have identified that the total duration for the query using Cachetools decorator for this use case yields a significant performance increase for the Python UDF execution.

Congratulations! You've successfully performed all the labs illustrating the best practices for getting optimal performance when using Snowpark Python. You're now ready to implement these concepts in your own projects.

What You Learned

Related Resources