This Quickstart guide contains key tips for optimal performance when using Snowpark Python. The guide is broken up into multiple labs, each covering a key concept that can improve performance and/or efficiency while running workloads in Snowpark.
In this quickstart, you will learn how to make optimized decisions when using Snowpark Python. These choices will be compared with others to show performance improvements. Each concept is broken up into a lab, listed below:
Let's set up the Python environment necessary to run this quickstart:
First, clone the source code for this repo to your local environment:
git clone https://github.com/Snowflake-Labs/sfguide-snowpark-python-top-three-tips-for-optimal-performance
cd sfguide-snowpark-python-top-three-tips-for-optimal-performance
If you are using Anaconda on your local machine, create a conda env for this quickstart:
conda env create -f conda_env.yml
conda activate pysnowpark
Conda will automatically install snowflake-snowpark-python
and all other dependencies for you.
Now, launch Jupyter Notebook on your local machine:
jupyter notebook
pyarrow
related issuespyarrow
library already installed, uninstall it before installing Snowpark.pyarrow
installed, you do not need to install it yourself; installing Snowpark automatically installs the appropriate version.pyarrow
after installing Snowpark.Create a fresh database in your Snowflake account for the following labs. Login to Snowsight and issue the following command:
CREATE DATABASE SNOWPARK_BEST_PRACTICES_LABS;
We'll also standardize the warehouse we will use for the following labs. Run the following command:
CREATE OR REPLACE WAREHOUSE compute_wh WAREHOUSE_SIZE=SMALL INITIALLY_SUSPENDED=TRUE;
Throughout all the following labs, we will need to connect your notebook IDE to Snowflake. This is done via a credentials file. Edit the credentials.json
file found in the cloned repo with your Snowflake account information.
COMPUTE_WH
as created above.SNOWPARK_BEST_PRACTICES_LABS
and default schema as PUBLIC
.Snowpark is essentially a wrapper that converts Spark or Pandas APIs into SQL when executing and data engineers don't need to spend excessive time to find out how it works behind scenes. This notebook guides you through why we should use Snowpark Dataframe and avoid using Pandas Data Frame.
Though Pandas has tons of really good APIs in day-to-day DE job, the challenge with Pandas is that it does not scale with your data volume linearly due to single-machine processing. For example, Pandas fails with out-of-memory error if it attempts to read a dataset that is larger than the memory available in a single machine.
Now we will see how Pandas works with Snowpark and how Snowflake has tried to parallelize the processing of Pandas API. Lets use a simple example to remove duplicates from a 6M sample table.
Open up the jupyter notebook titled lab1_avoid_pandas_df and run each of the cells.
If you are interested, you can go to Snowflake UI to to check query history of the runs. I have summarized the performance differences between Snowpark Dataframe and Pandas Dataframe, both using the same small warehouse. As you can see the performance using Snowpark Dataframe is 8X faster than using Pandas Dataframe and it's always a best practice to leverage Snowpark Dataframe as much as you can!
The Snowpark API provides methods that help to create a User Defined Function. This can be done either using a Lambda or a typical Function in Python. When you create these UDFs, the Snowpark library uploads this code to an internal stage. When you call these UDFs, the Snowpark library executes your function on the server, where the data is. As a result, the data doesn't need to be transferred to the client in order for the function to process the data.
Naturally, you're wondering then, what is a Vectorised UDF?
The Python UDF batch API enables defining Python functions that receive batches of input rows (aka chunked rows) as Pandas DataFrames and return batches of results as Pandas arrays or Series. The column in the Snowpark dataframe will be vectorized as a Pandas Series inside the UDF.
The goal of this lab is to compare the performance of a UDF Operation with and without using the Python UDF Batch API (or Vectorised UDFs).
In this notebook, we will do the following:
This lab consists of the following major steps:
Open up the jupyter notebook titled lab2_vectorized_udfs and run each of the cells.
Below is analysis of the tests ran in the notebook. Please compare these findings with the results in your notebook.
Before we look into the results, let's recap what we have done so far:
For Numerical Computation, keeping the same dataset and warehouse size, Vectorised UDFs outperform Normal UDFs.
GET_QUERY_OPERATOR_STATS()
table function for a more detailed and thorough analysis.As expected, it makes sense not to use Vectorised UDFs for Non-numeric operations
GET_QUERY_OPERATOR_STATS()
table function for a more detailed and thorough analysis.Additionally, when using Vectorised UDFs, you may want to play around with Batch Sizes.
@udf(max_batch_size=1000)
def vect_udf_bc_100(inp: PandasSeries[float]) -> PandasSeries[float]:
return (inp - df_customer_100_mean + df_customer_100_stddev) * 10000.0
It is important to note:
max_batch_size = 50000
, does not mean 50k records are sent per batchYour Python code must operate efficiently on batch of rows
GET_QUERY_OPERATOR_STATS()
table function to further optimizeWhen using Snowpark, it is common for data engineers and data scientists to create pickle files, upload them to internal/external stage, and use them with Snowflake UDFs and Stored Procedures (SPs). This lab will show how using the Python library Cachetools can be used to speed up UDF or SP performance by ensuring the logic is cached in memory in cases of repeated reads. For simplicity we will demonstrate this scenario using a pickle file which has the dictionary object serialized.
Cachetools is a Python library that provides a collection of caching algorithms. It offers simple and efficient caching methods, such as LRU (Least Recently Used) and FIFO (First In First Out), to store a limited number of items for a specified duration. The library is useful in applications where temporarily storing data in memory improves performance. It's easy to use and can be integrated into an application with just a few lines of code.
The Cachetools library with Snowpark UDF can be used for:
A pickle file is a file format used to store Python objects as binary data. It is used in data engineering to store and retrieve data objects that can be used across different applications or scripts.For example, you might use pickle to store data objects such as pandas dataframes, numpy arrays, or even custom Python classes and functions. By using pickle, you can save these objects to disk and load them later without having to recreate the data from scratch.
There is a ficticious system that generates transactional data along with the day of the week information for each row. To access the names of the weekdays, we are required to use a Python pickle file that is shared by the upstream data teams. This file contains the names of the weekdays, which may change in different scenarios. To ensure compatibility and security, the upstream data teams only provide access to this information through the pickle file.
To analyze this data in Snowflake Snowpark Python, we have created a Python User Defined Function (UDF) to efficiently retrieve the corresponding day name based on the day number which comes as an input. This UDF provides a convenient and reliable way to access the day names from the pickle file, allowing us to perform the necessary DataFrame transformations for our analysis.
Lately as the number of transactions are growing we are seeing performance degradation of our select queries which is using the UDF to fetch the required information from the pickle file. We need to look for ways to improve the performance of the Python UDF that we have created.
This lab consists of the following major steps:
Open up the jupyter notebook titled lab3_cachetools_library and run each of the cells.
Based on the our testing, we have identified that the total duration for the query using Cachetools decorator for this use case yields a significant performance increase for the Python UDF execution.
Congratulations! You've successfully performed all the labs illustrating the best practices for getting optimal performance when using Snowpark Python. You're now ready to implement these concepts in your own projects.