Through this quickstart, you will learn how to get started with Snowpark Connect for Apache Spark™. Using Snowpark Connect for Apache Spark, you can run Spark workloads directly on Snowflake.
By the end of this quickstart, you will learn how to:
Snowpark is the set of libraries and code execution environments that run Python and other programming languages next to your data in Snowflake. Snowpark can be used to build data pipelines, ML models, apps, and other data processing tasks.
With Snowpark Connect for Apache Spark, you can connect your existing Spark workloads directly to Snowflake and run them on the Snowflake compute engine. Snowpark Connect for Spark supports using the Spark DataFrame API on Snowflake. All workloads run on Snowflake warehouse. As a result, you can run your PySpark dataframe code with all the benefits of the Snowflake engine.
In Apache Spark™ version 3.4, the Apache Spark community introduced Spark Connect. Its decoupled client-server architecture separates the user's code from the Spark cluster where the work is done. This new architecture makes it possible for Snowflake to power Spark jobs.
During this step you will learn how to run PySpark code on Snowflake to:
Sign up for a Snowflake Free Trial account and login to Snowflake home page.
Download the ipynb
from this git repository.
Projects
and click on Notebooks
.+ Notebook
and select Import ipynb file
.snowparkconnect_demo.ipynb
you had downloaded earlier.snowflake_learning_db
and public
schema.run on warehouse
option, select query warehouse
as compute_wh
and create
.Now you have successfully imported the notebook that contains PySpark code.
Next up, select the packages drop down at the top right of the notebook. Look for snowpark-connect
package and install it using the package picker.
After the installation is complete, start or restart the notebook session.
Follow along and run each of the cells in the Notebook.
Congratulations, you have successfully completed this quickstart!