This Quickstart will cover the basics of data engineering with Snowpark for Python. By completing this guide, you will be able to build a data pipeline to process data from different sources, and periodically run the pipeline to update your data tables in Snowflake.

Here is a summary of what you will be able to learn in each step by following this quickstart:

In case you are new to some of the technologies mentioned above, here is a quick summary with the links to documentation.

What is Snowpark?

Snowpark is the set of libraries and code execution environments that run Python and other programming languages next to your data in Snowflake. Snowpark can be used to build data pipelines, ML models, apps, and other data processing tasks.

Client Side Libraries - Snowpark libraries can be installed and downloaded from any client-side notebook or IDE and are used for code development and deployment. Libraries include the Snowpark API for data pipelines and apps and the Snowpark ML API for end to end machine learning.

Code Execution Environments - Snowpark provides elastic compute environments for secure execution of your code in Snowflake. These server-side capabilities allow users to bring in and run custom logic in Python with user-defined functions and stored procedures.

Learn more about Snowpark.

Snowpark

Working with Snowflake Marketplace

Snowflake Marketplace provides visibility to a wide variety of datasets from third-party data stewards which broaden access to data points used to transform business processes. Snowflake Marketplace also removes the need to integrate and model data by providing secure access to data sets fully maintained by the data provider.

What you will learn

Prerequisites

What You'll Need

You will need the following things before beginning:

Fork the GitHub Repository

The very first step is to fork the GitHub repository Intro to Data Engineering with Snowpark Python associated GitHub Repository. This repository contains all the code you need to successfully complete this Quickstart guide. Click on the "Fork" button near the top right. Complete any required fields and click "Create Fork".

Create a GitHub Personal Access Token

In order for Snowflake to authenticate to your Github repository, you will need to generate a personal access token. To create a personal access token, follow the instructions located here.

Make sure to note down the token until step 3 of the Quickstart, where we will be securely storing it within a Snowflake secret object.

Create GitHub Codespace

For this Quickstart we will be using GitHub Codespaces for our development environment. Codespaces offer a hosted development environment with a hosted, web-based VS Code environment. GitHub currently offers 60 hours for free each month when using a 2 node environment, which should be enough to work through this lab.

To create a GitHub Codespace, click on the green <> Code button from the GitHub repository homepage. In the Code popup, click on the Codespaces tab and then on the green Create codespace on main.

This will open a new tab and begin setting up your codespace. It will take a few minutes as it sets up the entire environment for this Quickstart. Here is what is being done for you:

Once the codespace has been created and started you should see a hosted web-based version of VS Code with your forked repository set up! Just a couple more things and we're ready to start.

Configure Snowflake Credentials

We will not be directly using the SnowCLI command line client for this Quickstart, but we will be storing our Snowflake connection details in the SnowCLI connections file located at ~/.snowflake/connections.toml. A default connection file was created for you during the codespace setup.

The easiest way to edit the default ~/.snowflake/connections.toml file is directly from VS Code in your codespace. Type Command-P, type (or paste) ~/.snowflake/connections.toml and hit return. The SnowCLI config file should now be open. You just need to edit the file and replace the accountname, username, and password with your values. Then save and close the file.

Note: The SnowCLI tool (and by extension this Quickstart) currently does not work with Key Pair authentication. It simply grabs your username and password details from the shared SnowCLI config file.

Verify Your Anaconda Environment is Activated

During the codespace setup we created an Anaconda environment named snowflake-demo. And when VS Code started up it should have automatically activated the environment in your terminal. You should see something like this in the terminal, and in particular you should see (snowflake-demo) before your bash prompt.

Snowflake Extensions for VS Code

You can run SQL queries and Python APIs against Snowflake in many different ways through the Snowsight UI, SnowCLI, etc. But for this Quickstart we'll be using the Snowflake extension for VS Code. For a brief overview of Snowflake's native extension for VS Code, please check out our VS Code Marketplace Snowflake extension page.

To put this in context, we are on step #3 in our data flow overview:

Quickstart Pipeline Overview

Create Git Integration, Roles, Databases, Tables, Schema and Stages

You can log into Snowsight or VS Code to create all the snowflake objects needed to work through this guide.

For the purpose of this quickstart, we will initially use VS Code to create the integration to our Git repository. We will then create the remaining objects using a .SQL script that Snowflake will execute directly from the Git repository. This allows us to ensure we are always running the most recent version of code, and not introducing any issues by copying and pasting in code.

Open ‘03_git_config.sql' in VS Code.

Execute in VS Code

Let's run through the commands individually and understand what each command does.

Creating Account Level Objects

USE ROLE SECURITYADMIN;
SET MY_USER = CURRENT_USER();
CREATE ROLE IF NOT EXISTS GIT_ADMIN;
GRANT ROLE GIT_ADMIN to ROLE SYSADMIN;
GRANT ROLE GIT_ADMIN TO USER IDENTIFIER($MY_USER);
USE ROLE SYSADMIN;
CREATE OR REPLACE DATABASE GIT_REPO;
USE SCHEMA PUBLIC;
GRANT OWNERSHIP ON DATABASE GIT_REPO TO ROLE GIT_ADMIN;
USE DATABASE GIT_REPO;
GRANT OWNERSHIP ON SCHEMA PUBLIC TO ROLE GIT_ADMIN;
USE ROLE GIT_ADMIN;
USE DATABASE GIT_REPO;
USE SCHEMA PUBLIC;
CREATE OR REPLACE SECRET GIT_SECRET 
    TYPE = PASSWORD 
    USERNAME = '<your_git_user' 
    PASSWORD = '<your_personal_access_token>';

--Create an API integration for interacting with the repository API
USE ROLE ACCOUNTADMIN; 
GRANT CREATE INTEGRATION ON ACCOUNT TO ROLE GIT_ADMIN;
USE ROLE GIT_ADMIN;

CREATE OR REPLACE API INTEGRATION GIT_API_INTEGRATION 
    API_PROVIDER = GIT_HTTPS_API 
    API_ALLOWED_PREFIXES = ('https://github.com/<your_git_user>') 
    ALLOWED_AUTHENTICATION_SECRETS = (GIT_SECRET) 
    ENABLED = TRUE;
    
CREATE OR REPLACE GIT REPOSITORY DE_QUICKSTART 
    API_INTEGRATION = GIT_API_INTEGRATION 
    GIT_CREDENTIALS = GIT_SECRET 
    ORIGIN = '<your git repo URL ending in .git>';

Verify Git Repo and Execute Script to Create Lab Specific Objects

Once a git repository is created, you can view branches in the repo, and list objects similarly to how you interact with stages. Run each line separately to understand how their outputs can be used.

SHOW GIT BRANCHES IN DE_QUICKSTART;
ls @DE_QUICKSTART/branches/main;

Now that we know our repository is accessible, we will create the objects and permissions necessary to run the lab. This uses Jinja2 templating to allow to substitute values with variables at run time, as well as looping and other capabilities. In today's example, we simply pass in the current user so that the role used for the lab is granted to the current user, but you can use variables to define the environment the code is deployed to and more. More information on using EXECUTE IMMEDIATE FROM and Jinja2 templating is available in the documentation here.

Execute the following lines together.

USE ROLE ACCOUNTADMIN;
SET MY_USER = CURRENT_USER();
EXECUTE IMMEDIATE
    FROM @GIT_REPO.PUBLIC.DE_QUICKSTART/branches/main/steps/03_setup_snowflake.sql
    USING (MY_USER=>$MY_USER);

While they are running, lets review what this script is doing.

Creating Account Level Objects

--!jinja
USE ROLE ACCOUNTADMIN;

-- Roles
CREATE OR REPLACE ROLE HOL_ROLE;
GRANT ROLE HOL_ROLE TO ROLE SYSADMIN;
GRANT ROLE HOL_ROLE TO USER {{MY_USER}};

GRANT EXECUTE TASK ON ACCOUNT TO ROLE HOL_ROLE;
GRANT EXECUTE MANAGED TASK ON ACCOUNT TO ROLE HOL_ROLE;
GRANT MONITOR EXECUTION ON ACCOUNT TO ROLE HOL_ROLE;
GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE HOL_ROLE;
GRANT USAGE ON DATABASE GIT_REPO TO ROLE HOL_ROLE;
GRANT USAGE ON SCHEMA GIT_REPO.PUBLIC TO ROLE HOL_ROLE;
GRANT USAGE ON SECRET GIT_REPO.PUBLIC.GIT_SECRET TO ROLE HOL_ROLE;
GRANT READ ON GIT REPOSITORY GIT_REPO.PUBLIC.DE_QUICKSTART TO ROLE HOL_ROLE;
CREATE OR REPLACE DATABASE HOL_DB;
GRANT OWNERSHIP ON DATABASE HOL_DB TO ROLE HOL_ROLE;
CREATE OR REPLACE WAREHOUSE HOL_WH WAREHOUSE_SIZE = XSMALL, AUTO_SUSPEND = 300, AUTO_RESUME= TRUE;
GRANT OWNERSHIP ON WAREHOUSE HOL_WH TO ROLE HOL_ROLE;

Creating Database Level Objects

CREATE OR REPLACE SCHEMA HOL_SCHEMA;

USE SCHEMA HOL_SCHEMA;
CREATE OR REPLACE STAGE FROSTBYTE_RAW_STAGE
    URL = 's3://sfquickstarts/data-engineering-with-snowpark-python/';

During this step we will be "loading" the raw weather data to Snowflake. But "loading" is really the wrong word here. Because we're using Snowflake's unique data sharing capability we don't actually need to copy the data to our Snowflake account with a custom ETL process. Instead we can directly access the weather data shared by Weather Source in the Snowflake Marketplace.

To put this in context, we are on step #4 in our data flow overview:

Quickstart Pipeline Overview

Snowflake Marketplace

Snowflake Marketplace provides visibility to a wide variety of datasets from third-party data stewards which broaden access to data points used to transform business processes. Snowflake Marketplace also removes the need to integrate and model data by providing secure access to data sets fully maintained by the data provider.

But what about data that needs constant updating - like the WEATHER data? We would need to build a pipeline process to constantly update that data to keep it fresh. Perhaps a better way to get this external data would be to source it from a trusted data supplier. Let them manage the data, keeping it accurate and up to date.

Weather data from Snowflake Marketplace

Weather Source is a leading provider of global weather and climate data and their OnPoint Product Suite provides businesses with the necessary weather and climate data to quickly generate meaningful and actionable insights for a wide range of use cases across industries. Let's connect to the Weather Source LLC: frostbyte feed from Weather Source in the Snowflake Marketplace by following these steps:

Snowflake Marketplace

That's it... we don't have to do anything from here to keep this data updated. The provider will do that for us and data sharing means we are always seeing whatever they have published. How amazing is that? Just think of all the things you didn't have to do here to get access to an always up-to-date, third-party dataset!

Query the Weather data

Open the steps/04_load_weather.sql script in VS Code from the file Explorer in the left navigation bar, and select Execute All from the top right corner. Notice how easy it is to query data shared through the Snowflake Marketplace! You access it just like any other table or view in Snowflake:

SELECT * FROM FROSTBYTE_WEATHERSOURCE.ONPOINT_ID.POSTAL_CODES LIMIT 100;

You can also view the shared database FROSTBYTE_WEATHERSOURCE.ONPOINT_ID.POSTAL_CODE by navigating to the Snowsight UI -> Data -> Databases.

During this step we will be loading the raw Tasty Bytes POS and Customer loyalty data from raw Parquet files in s3://sfquickstarts/data-engineering-with-snowpark-python/ to our RAW_POS and RAW_CUSTOMER schemas in Snowflake. And you are going to be orchestrating this process from your laptop in Python using the Snowpark Python API. To put this in context, we are on step #5 in our data flow overview:

Run the Script

To load the raw data, execute the app/05_load_raw_data.py script. This can be done a number of ways in VS Code, from a terminal or directly by VS Code. For this demo you will need to execute the Python scripts from the terminal. So go back to the terminal in VS Code, make sure that your snowflake-demo conda environment is active, then run the following commands (which assume that your terminal has the root of your repository open):

python app/05_load_raw_

While that is running, please open the script in VS Code and continue on this page to understand what is happening.

Running Snowpark Python Locally

In this step you will be running the Snowpark Python code locally from your laptop. At the bottom of the script is a block of code that is used for local debugging (under the if __name__ == "__main__": block):

# For local debugging
if __name__ == "__main__":
    # Create a local Snowpark session
    with Session.builder.getOrCreate() as session:
        load_all_raw_tables(session)

A few things to point out here. First, the local Snowpark session is being created by Session.builder.getOrCreate(). This method either pulls in an existing session, or creates a new one based on the .snowflake/connections.toml file.

Then after getting the Snowpark session it calls the load_all_raw_tables(session) method which does the heavy lifting. The next few sections will point out the key parts.

Finally, almost all of the Python scripts in this Quickstart include a local debugging block. Later on we will create Snowpark Python stored procedures and UDFs and those Python scripts will have a similar block. So this pattern is important to understand.

Viewing What Happened in Snowflake

The Query History in Snowflake is a very power feature, that logs every query run against your Snowflake account, no matter which tool or process initiated it. And this is especially helpful when working with client tools and APIs.

The Python script you just ran did a small amount of work locally, basically just orchestrating the process by looping through each table and issuing the command to Snowflake to load the data. But all of the heavy lifting ran inside Snowflake! This push-down is a hallmark of the Snowpark API and allows you to leverage the scalability and compute power of Snowflake!

Log in to your Snowflake account and take a quick look at the SQL that was generated by the Snowpark API. This will help you better understand what the API is doing and will help you debug any issues you may run into.

Schema Inference

One very helpful feature in Snowflake is the ability to infer the schema of files in a stage that you wish to work with. This is accomplished in SQL with the INFER_SCHEMA() function. The Snowpark Python API does this for you automatically when you call the session.read() method. Here is the code snippet:

# we can infer schema using the parquet read option
df = session.read.option("compression", "snappy") \
                            .parquet(location)

Data Ingestion with COPY

In order to load the data into a Snowflake table we will use the copy_into_table() method on a DataFrame. This method will create the target table in Snowflake using the inferred schema (if it doesn't exist), and then call the highly optimized Snowflake COPY INTO <table> Command. Here is the code snippet:

df.copy_into_table("{}".format(tname))

Snowflake's Table Format

One of the major advantages of Snowflake is being able to eliminate the need to manage a file-based data lake. And Snowflake was designed for this purpose from the beginning. In the step we are loading the raw data into a structured Snowflake managed table. But Snowflake tables can natively support structured and semi-structured data, and are stored in Snowflake's mature cloud table format (which predates Hudi, Delta or Iceberg).

Once loaded into Snowflake the data will be securely stored and managed, without the need to worry about securing and managing raw files. Additionally the data, whether raw or structured, can be transformed and queried in Snowflake using SQL or the language of your choice, without needing to manage separate compute services like Spark.

This is a huge advantage for Snowflake customers.

Warehouse Elasticity (Dynamic Scaling)

With Snowflake there is only one type of user defined compute cluster, the Virtual Warehouse, regardless of the language you use to process that data (SQL, Python, Java, Scala, Javascript, etc.). This makes working with data much simpler in Snowflake. And governance of the data is completely separated from the compute cluster, in other words there is no way to get around Snowflake governance regardless of the warehouse settings or language being used.

And these virtual warehouses can be dynamically scaled, in under a second for most sized warehouses! This means that in your code you can dynamically resize the compute environment to increase the amount of capacity to run a section of code in a fraction of the time, and then dynamically resized again to reduce the amount of capacity. And because of our per-second billing (with a sixty second minimum) you won't pay any extra to run that section of code in a fraction of the time!

Let's see how easy that is done. Here is the code snippet:

 _ = session.sql("ALTER WAREHOUSE HOL_WH SET WAREHOUSE_SIZE = XLARGE WAIT_FOR_COMPLETION = TRUE").collect()

 # Some data processing code

 _ = session.sql("ALTER WAREHOUSE HOL_WH SET WAREHOUSE_SIZE = XSMALL").collect()

Please also note that we included the WAIT_FOR_COMPLETION parameter in the first ALTER WAREHOUSE statement. Setting this parameter to TRUE will block the return of the ALTER WAREHOUSE command until the resize has finished provisioning all its compute resources. This way we make sure that the full cluster is available before processing any data with it.

We will use this pattern a few more times during this Quickstart, so it's important to understand.

During this step we will be transforming the raw data into an aggregated metrics table using Snowpark. This sproc will join the ORDER_DETAIL table with the LOCATION table and HISTORY_DAY table to create a final, aggregated table for analysis named DAILY_CITY_METRICS.

To run this step, we will follow the same process as before. From your terminal, run the following command:

python 

To put this in context, we are on step #6 in our data flow overview:

Quickstart Pipeline Overview

Creating Sproc to Calculate Daily City Metrics

Below is a summary of the code which will ultimately become a stored procedure:

More on the Snowpark API

In this step we're starting to really use the Snowpark DataFrame API for data transformations. The Snowpark API provides the same functionality as the Spark SQL API. To begin with you need to create a Snowpark session object. Like PySpark, this is accomplished with the Session.builder.configs().create() methods.

When building a Snowpark Python sproc the contract is that the first argument to the entry point (or handler) function is a Snowpark session.

The first thing you'll notice in the script is that we have some functions which use SQL to create objects in Snowflake and to check object status. To issue a SQL statement to Snowflake with the Snowpark API you use the session.sql() function, like you'd expect. Here's one example:

def table_exists(session, schema='', name=''):
    exists = session.sql("SELECT EXISTS (SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = '{}' AND TABLE_NAME = '{}') AS TABLE_EXISTS".format(schema, name)).collect()[0]['TABLE_EXISTS']
    return exists

The second thing to point out is how we're using DataFrames to join the data from different data sources into an ORDER_DETAIL df using the join() API.

order_detail = order_detail.join(location, order_detail['LOCATION_ID'] == location['LOCATION_ID'])
    order_detail = order_detail.join(history_day, (F.builtin("DATE")(order_detail['ORDER_TS']) == history_day['DATE_VALID_STD']) & (location['ISO_COUNTRY_CODE'] == history_day['COUNTRY']) & (location['CITY'] == history_day['CITY_NAME']))

The last thing to point out is how we are using the Snowpark Python Dataframe APIs to aggregate DataFrames using APIs such as agg(), group_by(), and select().

final_agg = order_detail.group_by(F.col('DATE_VALID_STD'), F.col('CITY_NAME'), F.col('ISO_COUNTRY_CODE')) \
                        .agg( \
                            F.sum('PRICE').alias('DAILY_SALES_SUM'), \
                            F.avg('AVG_TEMPERATURE_AIR_2M_F').alias("AVG_TEMPERATURE_F"), \
                            F.avg("TOT_PRECIPITATION_IN").alias("AVG_PRECIPITATION_IN"), \
                        ) \
                        .select(F.col("DATE_VALID_STD").alias("DATE"), F.col("CITY_NAME"), F.col("ISO_COUNTRY_CODE").alias("COUNTRY_DESC"), \
                            F.builtin("ZEROIFNULL")(F.col("DAILY_SALES_SUM")).alias("DAILY_SALES"), \
                            F.round(F.col("AVG_TEMPERATURE_F"), 2).alias("AVG_TEMPERATURE_FAHRENHEIT"), \
                            F.round(F.col("AVG_PRECIPITATION_IN"), 2).alias("AVG_PRECIPITATION_INCHES"), \
                        )

Again, for more details about the Snowpark Python DataFrame API, please check out our Working with Dataframe in Snowpark Python page.

During this step we will be orchestrating our new Snowpark pipelines with Snowflake's native orchestration feature named Tasks. You can create and deploy Snowflake Task objects using SQL as well as Python Task APIs. For the scope of this quickstart, we will use Snowflake Python Task APIs to create and run Tasks.

In this step, we will create two tasks, one for each stored procedure, and chain them together. We will then deploy or run the tasks to operationalize the data pipeline.

In this step, we will run through the commands in the Python file steps/07_deploy_task_dag.py from VS Code. To execute the file, you can open the terminal in VS Code and run cd steps && python 07_deploy_task_dag.py.

To put this in context, we are on step #7 in our data flow overview:

Quickstart Pipeline Overview

Exploring the different terminologies associated with Snowflake Tasks

Creating Tasks, and DAGs

Let us create a dag and configure the schedule and transformations we want to run as part of the dag.

In this example, we have two tasks dag_task1 and dag_task2 as part of HOL_DAG. The dag is scheduled daily.

Here is the code snippet for dag and task creation:

dag_name = "HOL_DAG"
    dag = DAG(dag_name, schedule=timedelta(days=1), warehouse=warehouse_name)
    with dag:
        dag_task1 = DAGTask("LOAD_ORDER_DETAIL_TASK", definition="CALL LOAD_EXCEL_WORKSHEET_TO_TABLE_SP(BUILD_SCOPED_FILE_URL(@FROSTBYTE_RAW_STAGE, 'intro/order_detail.xlsx'), 'order_detail', 'ORDER_DETAIL')", warehouse=warehouse_name)
        dag_task2 = DAGTask("LOAD_DAILY_CITY_METRICS_TASK", definition="CALL LOAD_DAILY_CITY_METRICS_SP()", warehouse=warehouse_name)

Great, we have the tasks and the dag created. But how do we define the task dependencies? Which tasks should run first?

If you are familiar with Apache Airflow, you might know how we can use >> operator to define task dependency. If not, let's understand the flow.

Here is how we define the order of execution of the tasks in our dag:

dag_task2 >> dag_task1

The above definition dag_task2 >> dag_task1 means that dag_task2 is dependant on dag_task1.

Deploying the DAG

So far, we have only created the dag and defined the order of execution of tasks within the dag. To enable the dag to run on the defined schedule, we have to deploy() it first.

dag_op.deploy(dag, mode="orreplace")

After we deploy the dag, the tasks in the dag will run periodically as per the schedule interval in the dag definition.

Running a DAG

In addition to running the dags periodically, suppose if you need to run the dag on demand for debugging or troubleshooting, you can use the run() function.

dag_op.run(dag)

And for more details on Tasks see Introduction to Tasks.

Task Metadata

Snowflake keeps metadata for almost everything you do, and makes that metadata available for you to query (and to create any type of process around). Tasks are no different. Snowflake maintains rich metadata to help you monitor your task runs. Here are a few sample SQL queries you can use to monitor your tasks runs:

-- Get a list of tasks
SHOW TASKS;

-- Task execution history in the past day
SELECT *
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
    SCHEDULED_TIME_RANGE_START=>DATEADD('DAY',-1,CURRENT_TIMESTAMP()),
    RESULT_LIMIT => 100))
ORDER BY SCHEDULED_TIME DESC
;

-- Scheduled task runs
SELECT
    TIMESTAMPDIFF(SECOND, CURRENT_TIMESTAMP, SCHEDULED_TIME) NEXT_RUN,
    SCHEDULED_TIME,
    NAME,
    STATE
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
WHERE STATE = 'SCHEDULED'
ORDER BY COMPLETED_TIME DESC;

Monitoring Tasks

Now, after your dag is deployed, all the tasks in your dag will be running as per the defined schedule. However, as a Snowflake user, how can you peek into the success and failure of each run, look at the task run time, and more?

Snowflake provides some rich task observability features in the Snowsight UI. Try it out for yourself by following these steps:

Tasks

To monitor all the tasks at an account level, you can use the Task History tab as well.

Task Runs

For more details, and to learn about viewing account level task history, please check out our Viewing Task History documentation.

Query History for Tasks

One important thing to understand about tasks, is that the queries which get executed by the task won't show up with the default Query History UI settings. In order to see the queries that just ran you need to do the following:

remove filter

remove filter

You should now see all the queries run by your tasks!

During this step, we will comment out the validation code we use in the snippet below as this is not needed in production. Open this file in VS Code and comment line 70 from the very bottom section of the code as shown below.

# For local debugging
if __name__ == "__main__":
    # Create a local Snowpark session
    with Session.builder.getOrCreate() as session:
        load_all_raw_tables(session)
        #validate_raw_tables(session)

Save your changes.

Configuring Your Forked GitHub Project

In order for your GitHub Actions workflow to be able to connect to your Snowflake account you will need to store your Snowflake credentials in GitHub. Action Secrets in GitHub are used to securely store values/variables which will be used in your CI/CD pipelines. In this step we will create secrets for each of the parameters used by SnowCLI.

From the repository, click on the Settings tab near the top of the page. From the Settings page, click on the Secrets and variables then Actions tab in the left hand navigation. The Actions secrets should be selected. For each secret listed below click on New repository secret near the top right and enter the name given below along with the appropriate value (adjusting as appropriate).

Secret name

Secret value

SNOWSQL_ACCOUNT

myaccount

SNOWSQL_USER

myusername

SNOWSQL_PWD

mypassword

SNOWSQL_ROLE

HOL_ROLE

SNOWSQL_WAREHOUSE

HOL_WH

SNOWSQL_DATABASE

HOL_DB

When you're finished adding all the secrets, the page should look like this:

Push Changes to Forked Repository

Now that we have a changes ready and tested, and our Snowflake credentials stored in GitHub, let's commit them to our local repository and then push them to your forked repository. This can certainly be done from the command line, but in this step we'll do so through VS Code to make it easy.

Start by opening the "Source Control" extension in the left hand nav bar, you should see two files with changes. Click the + (plus) sign at the right of each file name to stage the changes. Then enter a message in the "Message" box and click the blue Commit button to commit the changes locally. Here's what it should look like before you click the button:

At this point those changes are only committed locally and have not yet been pushed to your forked repository in GitHub. To do that, simply click the blue Sync Changes button to push these commits to GitHub. Here's what it should look like before you click the button:

Viewing GitHub Actions Workflow

This repository is already set up with a very simple GitHub Actions CI/CD pipeline. You can review the code for the workflow by opening the .github/workflows/build_and_deploy.yaml file in VS Code.

As soon as you pushed the changes to your GitHub forked repo the workflow kicked off. To view the results go back to the homepage for your GitHub repository and do the following:

Once you're finished with the Quickstart and want to clean things up, you can simply run the steps/09_teardown.sql. Since this is a SQL script we will be using our native VS Code extension to execute it. So simply open the steps/09_teardown.sql script in VS Code and run the whole thing using the "Execute All Statements" button in the upper right corner of the editor window.

Congratulations! By now you have built a robust data engineering pipeline using Snowpark Python stored procedures. This pipeline processes data incrementally, is orchestrated with Snowflake tasks, and is deployed via a CI/CD pipeline. You also learned how to use Snowflake's developer CLI tool and Visual Studio Code extension!

We would love your feedback on this QuickStart Guide! Please submit your feedback using this Feedback Form.

What You Learned

Here is the overview of what we built:

Quickstart Pipeline Overview

Related Resources