The focus of this workshop will be to demonstrate how we can use both SQL and python together in the same workflow to run both analytics and machine learning models on dbt Cloud.

The code complete repository for this quickstart can be found on GitHub.

What you'll need to setup for the lab

What you will learn

What you need to know

What you'll build

What you'll need

In this lab we'll be transforming raw Formula 1 data into a consumable form for both analytics and machine learning pipelines. To understand how our data are related, we've included an entity relationship diagram (ERD) of the tables we'll be using today.

Our data rarely ever looks the way we need it in its raw form: we need to join, filter, aggregate, etc. dbt is designed to transform your data and keep your pipeline organized and reliable along the way. We can see from our Formula1 ERD that if we have a major table called results with other tables such as drivers, races, and circuits tables that provide meaningful context to the results table.

You might also see that circuits cannot be directly joined to results since there is no key. This is a typical data model structure we see in the wild: we'll need to first join results and races together, then we can join to circuits. By bringing all this information together we'll be able to gain insights about lap time trends through the years.

Formula 1 ERD:
ERD can also be downloaded for interactive view from S3F1_ERD

Here's a visual for the data pipeline that we'll be building using dbt! project_DAG

Setup Snowflake trial and email alias

In this section we're going to sign up for a Snowflake trial account and enable Anaconda-provided Python packages.

  1. Sign up for a Snowflake Trial Account using this form. Ensure that your account is set up using AWS.
  2. After creating your account and verifying it from your sign-up email, Snowflake will direct you back to the UI called Snowsight.
  3. To ensure we are working with a clean slate and create a fresh dbt Cloud instance when launching partner connect we will be using email aliasing. What this will look like is <your_email>+<alias_addition>@<your_domain>.com.
  4. Navigate to your left panel menu and select Profile. snowflake_account_profile_menu
  5. You will see your unaliased email you used to sign up for your snowflake trial (screenshot is redacted for privacy). unaliased_email_profile
  6. Edit the email field to include the alias using the notation <your_email>+dbtsnowpark@<your_domain>.com. Ensure you Save your updated aliased email. alias_email_update_profile_save
  7. This should automatically send a re-verification email. In your email simply click the email link to verify your new aliased email, and then you're good to go. If for any reason you did the email was not generated you can navigate back to your Profile and click the link to manually Resend verification email. manually_resend_verification_email
  8. Your verification email will look like the image below. Select Validate your email. verification_email
  9. After you validate your email your screen should look as follows: verification_confirmation_page
  10. Re-login or refresh your browser window.

To recap, we created this email alias to ensure that later when we launch Partner Connect to spin up a dbt Cloud account that there are no previously existing dbt Cloud accounts and projects that cause issues and complications in setup.

Enable Anaconda Python packages and open new SQL worksheet

  1. Ensure you are still logged in as the ACCOUNTADMIN. new_snowflake_account
  2. Navigate to Admin > Billing & Terms. Click Enable > Acknowledge & Continue to enable Anaconda Python packages to run in Snowflake. enable-anacondaaccept-anaconda-terms
  3. Finally, navigate back to Worksheets to create a new SQL Worksheet by selecting + then SQL Worksheet in the upper right corner.

We need to obtain our data source by copying our Formula 1 data into Snowflake tables from a public S3 bucket that dbt Labs hosts.

  1. Your new Snowflake account has a preconfigured warehouse named COMPUTE_WH. You can check by going under Admin > Warehouses. If for some reason you don't have this warehouse, we can create a warehouse using the following script:
    create or replace warehouse COMPUTE_WH with warehouse_size=XSMALL
  2. Rename the SQL worksheet by clicking the worksheet name (this is automatically set to the current timestamp) using the 3 dots ... option, then click Rename. Rename the file to data setup script since we will be placing code in this worksheet to ingest the Formula 1 data. Set the context of the worksheet by setting your role as the ACCOUNTADMIN and warehouse as COMPUTE_WH. rename-worksheet-and-select-warehouse
  3. Copy the following code into the main body of the Snowflake SQL worksheet. You can also find this setup script under the setup folder in the Git repository. The script is long since it's bringing in all of the data we'll need today! We recommend copying this straight from the github file linked rather than from this workshop UI so you don't miss anything (use Ctrl+A or Cmd+A).

Generally during this lab we'll be explaining and breaking down the queries. We won't be going line by line, but we will point out important information related to our learning objectives!

   This is our setup script to create a new database for the Formula1 data in Snowflake.
   We are copying data from a public s3 bucket into snowflake by defining our csv format and snowflake stage. 
   -- create and define our formula1 database
   create or replace database formula1;
   use database formula1; 
   create or replace schema raw; 
   use schema raw; 

   --define our file format for reading in the csvs 
   create or replace file format csvformat
   type = csv
   field_delimiter =','
   field_optionally_enclosed_by = '"', 

   create or replace stage formula1_stage
   file_format = csvformat 
   url = 's3://formula1-dbt-cloud-python-demo/formula1-kaggle-data/';

   -- load in the 8 tables we need for our demo 
   -- we are first creating the table then copying our data in from s3
   -- think of this as an empty container or shell that we are then filling

   create or replace table formula1.raw.circuits (
       CIRCUIT_ID NUMBER(38,0),
       CIRCUIT_REF VARCHAR(16777216),
       NAME VARCHAR(16777216),
       LOCATION VARCHAR(16777216),
       COUNTRY VARCHAR(16777216),
       LAT FLOAT,
       LNG FLOAT,
       ALT NUMBER(38,0),
       URL VARCHAR(16777216)
   -- copy our data from public s3 bucket into our tables 
   copy into circuits 
   from @formula1_stage/circuits.csv

   create or replace table formula1.raw.constructor_results (
       RACE_ID NUMBER(38,0),
       POINTS NUMBER(38,0),
       STATUS VARCHAR(16777216)
   copy into constructor_results
   from @formula1_stage/constructor_results.csv

   create or replace table formula1.raw.constructor_standings (
       RACE_ID NUMBER(38,0),
       POINTS NUMBER(38,0),
       POSITION_TEXT VARCHAR(16777216),
       WINS NUMBER(38,0)
   copy into constructor_standings
   from @formula1_stage/constructor_standings.csv

   create or replace table formula1.raw.constructors (
       NAME VARCHAR(16777216),
       NATIONALITY VARCHAR(16777216),
       URL VARCHAR(16777216)
   copy into constructors 
   from @formula1_stage/constructors.csv

   create or replace table formula1.raw.driver_standings (
       RACE_ID NUMBER(38,0),
       DRIVER_ID NUMBER(38,0),
       POINTS NUMBER(38,0),
       POSITION_TEXT VARCHAR(16777216),
       WINS NUMBER(38,0)

   copy into driver_standings 
   from @formula1_stage/driver_standings.csv

   create or replace table formula1.raw.drivers (
       DRIVER_ID NUMBER(38,0),
       DRIVER_REF VARCHAR(16777216),
       NUMBER VARCHAR(16777216),
       CODE VARCHAR(16777216),
       FORENAME VARCHAR(16777216),
       SURNAME VARCHAR(16777216),
       DOB DATE,
       NATIONALITY VARCHAR(16777216),
       URL VARCHAR(16777216)
   copy into drivers 
   from @formula1_stage/drivers.csv

   create or replace table formula1.raw.lap_times (
       RACE_ID NUMBER(38,0),
       DRIVER_ID NUMBER(38,0),
       LAP NUMBER(38,0),
       TIME VARCHAR(16777216),
   copy into lap_times 
   from @formula1_stage/lap_times.csv

   create or replace table formula1.raw.pit_stops (
       RACE_ID NUMBER(38,0),
       DRIVER_ID NUMBER(38,0),
       STOP NUMBER(38,0),
       LAP NUMBER(38,0),
       TIME VARCHAR(16777216),
       DURATION VARCHAR(16777216),
   copy into pit_stops 
   from @formula1_stage/pit_stops.csv

   create or replace table formula1.raw.qualifying (
       RACE_ID NUMBER(38,0),
       DRIVER_ID NUMBER(38,0),
       NUMBER NUMBER(38,0),
       Q1 VARCHAR(16777216),
       Q2 VARCHAR(16777216),
       Q3 VARCHAR(16777216)
   copy into qualifying 
   from @formula1_stage/qualifying.csv

   create or replace table formula1.raw.races (
       RACE_ID NUMBER(38,0),
       YEAR NUMBER(38,0),
       ROUND NUMBER(38,0),
       CIRCUIT_ID NUMBER(38,0),
       NAME VARCHAR(16777216),
       DATE DATE,
       TIME VARCHAR(16777216),
       URL VARCHAR(16777216),
       FP1_DATE VARCHAR(16777216),
       FP1_TIME VARCHAR(16777216),
       FP2_DATE VARCHAR(16777216),
       FP2_TIME VARCHAR(16777216),
       FP3_DATE VARCHAR(16777216),
       FP3_TIME VARCHAR(16777216),
       QUALI_DATE VARCHAR(16777216),
       QUALI_TIME VARCHAR(16777216),
       SPRINT_DATE VARCHAR(16777216),
       SPRINT_TIME VARCHAR(16777216)
   copy into races 
   from @formula1_stage/races.csv

   create or replace table formula1.raw.results (
       RESULT_ID NUMBER(38,0),
       RACE_ID NUMBER(38,0),
       DRIVER_ID NUMBER(38,0),
       NUMBER NUMBER(38,0),
       GRID NUMBER(38,0),
       POSITION_TEXT VARCHAR(16777216),
       POINTS NUMBER(38,0),
       LAPS NUMBER(38,0),
       TIME VARCHAR(16777216),
       FASTEST_LAP NUMBER(38,0),
       RANK NUMBER(38,0),
       FASTEST_LAP_TIME VARCHAR(16777216),
       STATUS_ID NUMBER(38,0)
   copy into results 
   from @formula1_stage/results.csv

   create or replace table formula1.raw.seasons (
       YEAR NUMBER(38,0),
       URL VARCHAR(16777216)
   copy into seasons 
   from @formula1_stage/seasons.csv

   create or replace table formula1.raw.sprint_results (
       RESULT_ID NUMBER(38,0),
       RACE_ID NUMBER(38,0),
       DRIVER_ID NUMBER(38,0),
       NUMBER NUMBER(38,0),
       GRID NUMBER(38,0),
       POSITION_TEXT VARCHAR(16777216),
       POINTS NUMBER(38,0), 
       LAPS NUMBER(38,0),
       TIME VARCHAR(16777216),
       FASTEST_LAP VARCHAR(16777216),
       FASTEST_LAP_TIME VARCHAR(16777216),
       STATUS_ID NUMBER(38,0)
   copy into sprint_results 
   from @formula1_stage/sprint_results.csv

   create or replace table formula1.raw.status (
       STATUS_ID NUMBER(38,0),
       STATUS VARCHAR(16777216)
   copy into status 
   from @formula1_stage/status.csv
  1. Ensure all the commands are selected before running the query — an easy way to do this is to use Ctrl-A to highlight all of the code in the worksheet. Select run (blue triangle icon). Notice how the dot next to your COMPUTE_WH turns from gray to green as you run the query. The status table is the final table of all 14 tables loaded in. load-data-from-s3
  2. Let's unpack that pretty long query we ran into component parts. We ran this query to load in our 14 Formula 1 tables from a public S3 bucket. To do this, we:
  1. Once the script completes, browse to the left navigation menu. Click on ... button to bring up Refresh button. Click Refresh and you will see the newly created FORMULA1 database show up. Expand the database and explore the different tables you just created and loaded data into in the RAW schema. create-new-worksheet-to-query-data
  2. Now let's take a look at some of our cool Formula 1 data we just loaded up!

We're ready to setup our dbt account!

We are going to be using Snowflake Partner Connect to set up a dbt Cloud account. Using this method will allow you to spin up a fully fledged dbt account with your Snowflake connection and environments already established.

  1. Navigate out of your SQL worksheet back by selecting home.
  2. In Snowsight, confirm that you are using the ACCOUNTADMIN role.
  3. Confirm that your email address contains an email alias.
  4. Navigate to the Admin > Partner Connect. Find dbt either by using the search bar or navigating the Data Integration. Select the dbt tile. open-partner-connect
  5. You should now see a new window that says Connect to dbt. Select Optional Grant and add the FORMULA1 database. This will grant access for your new dbt user role to the FORMULA1 database. partner-connect-optional-grant
  6. Ensure the FORMULA1 is present in your optional grant before clicking Connect. This will create a dedicated dbt user, database, warehouse, and role for your dbt Cloud trial. connect-to-dbt

If you forgot to add the optional grant to the Formula1 database in the previous screenshot, please run these commands:

grant usage on database FORMULA1 to role PC_DBT_ROLE;
grant usage on schema FORMULA1.RAW to role PC_DBT_ROLE;
grant select on all tables in schema FORMULA1.RAW to role PC_DBT_ROLE;
  1. When you see the Your partner account has been created window, click Activate. connect-to-dbt
  2. You should be redirected to a dbt Cloud registration page. Fill out the form using whatever account name you'd like. Make sure to save the password somewhere for login in the future. dbt-cloud-sign-up
  3. Select Complete Registration. You should now be redirected to your dbt Cloud account, complete with a connection to your Snowflake account, a deployment and a development environment, and a sample job.

Instead of building an entire version controlled data project from scratch, we'll be forking and connecting to an existing workshop github repository in the next step. dbt Cloud's git integration creates easy to use git guardrails. You won't need to know much Git for this workshop. In the future, if you're developing your own proof of value project from scratch, feel free to use dbt's managed repository that is spun up during partner connect.

To keep the focus on dbt python and deployment today, we only want to build a subset of models that would be in an entire data project. To achieve this we need to fork an existing repository into our personal github, copy our forked repo name into dbt cloud, and add the dbt deploy key to our github account. Viola! There will be some back and forth between dbt cloud and GitHub as part of this process, so keep your tabs open, and let's get the setup out of the way!

  1. Open a new browser tab and navigate to our demo repo by clicking here.
  2. Fork your own copy of the lab repo. fork_exisiting_formula1_repo
  3. Add a description if you'd like such as: "learning about dbt Cloud is cool" and Create fork.
  4. Select the Code button. Choose the SSH option and use the copy button shortcut for our repo. We'll be using this copied path in step 11 in this section. copy_repo_ssh_github
  5. Head back over to your dbt Cloud browser tab so we can connect our new forked repository into our dbt Cloud project.
  6. We'll need to delete the existing connection to the managed repository spun up during Partner Connect before we input our new one. To do this navigate to Settings > Account Settings > Partner Connect Trial.
  7. This will open the Project Details. Navigate to Repository and click the existing managed repository GitHub connection setup during partner connect. select-existing-partner-connect-repo
  8. In the Repository Details select Edit in the lower right corner. The option to Disconnect will appear, select it. repository_details_disconnect
  9. Confirm disconnect. confirm_disconnect_from_managed_repo
  10. Within your Project Details you should have the option to Configure Repository. configure_repository
  11. After deleting our partner connect managed repository, we should see New Repository. Select Git Clone. Input the repository by pasting what you copied from GitHub in step 4 above into the Repository parameter and clicking Import. git_clone_copy_repo_from_github
  12. We can see we successfully made the connection to our forked GitHub repo. update_dbt_cloud_repo_connection_with_forked_repo

If you tried to start developing onto of this repo right now, we'd get permissions errors. So we need to give dbt Cloud write access.

  1. Click on your git cloned repository link. dbt Cloud generated a deploy key to link the development we do in dbt cloud back to our GitHub repo. Copy the deploy key starting with ssh-rsa followed by a long hash key (full key hidden for privacy). copy_deploy_key_from_dbt_cloud
  2. Phew almost there! Navigate back to GitHub again.
  3. Ensure you're in your forked repo. Navigate to your repo Settingsgit_repo_settings
  4. Go to Deploy keys. deploy_keys_github
  5. Select Add deploy key. new_deploy_key_button
  6. Give your deploy key a title such as dbt Cloud python snowpark. Paste the ssh-rsa deploy key we copied from dbt Cloud into the Key box. Be sure to enable Allow write access. Finally, Add key. Your deploy key has been created. We won't have to come back to again GitHub until the end of our workshop. add_new_deploy_keydeploy_key_created
  7. Head back over to dbt cloud. Navigate to Develop. develop_panel_dbt_cloud
  8. Run "dbt deps"run_dep_deps_after_importing_forked_repo
  9. Since we're bringing in an existing project, your root folder should now say dbt-python-hands-on-lab-snowparkfile_tree_of_forked_repo

Alas, now that our setup work is complete, time get a look at our production data pipeline code!

dbt Cloud's IDE will be our development space for this workshop, so let's get familiar with it. Once we've done that we'll run the pipeline we imported from our forked repo.

  1. There are a couple of key features to point out about the IDE before we get to work. It is a text editor, an SQL and Python runner, and a CLI with Git version control all baked into one package! This allows you to focus on editing your SQL and Python files, previewing the results with the SQL runner (it even runs Jinja!), and building models at the command line without having to move between different applications. The Git workflow in dbt Cloud allows both Git beginners and experts alike to be able to easily version control all of their work with a couple clicks. ide-overview
  2. In the file tree, click on the magnifying glass icon next to the File Explorer on the left sidebar and type in Click the Lineage tab. To make it full screen click the viewfinder icon. Play around with the nodes being shown by removing the 2 in front or behind of 2+hold_out_dataset_for_prediction+2and updating the graph. lineage_viewfinder
  3. Explore the DAG for a few minutes to understand everything we've done to our pipeline along the way. This includes: cleaning up and joining our data, machine learning data prep, variable encoding, and splitting the datasets. We'll go more in-depth in next steps about how we brought in raw data and then transformed it, but for now get an overall familiarization. lineage_fullview You can view the code in each node of the DAG by selecting it and navigating out of the full screen. You can read the code on the scratchpad.
  4. Let's run the pipeline we imported from our forked repo. Type dbt build into the command line and select Enter on your keyboard. When the run bar expands you'll be able to see the results of the run, where you should see the run complete successfully. dbt_build_initial_pipeline_ml To understand more about what the dbt build syntax is running check out the documentation.
  5. You can look at the run results of each model to see the code that dbt compiles and sends to Snowflake for execution. Select the arrow beside a model >. Click Details and view the ouput. We can see that dbt automatically generates the DDL statement and is creating our models in our development schema (i.e. dbt_hwatson). model_details_ddl
  6. Now let's switch over to a new browser tab on Snowflake to confirm that the objects were actually created. Click on the three dots ... above your database objects and then Refresh. Expand the PC_DBT_DB database and you should see your development schema. Select the schema, then Tables and Views. Now you should be able to see many models we created from our forked repo. confirm_pipeline_build_in_snowflake

We did a lot upstream in our forked repo and we'll explore it at a high level of how we did that before moving on to machine learning model training and prediction in dbt cloud.

We brought a good chunk of our data pipeline in through our forked repo to lay a foundation for machine learning. In the next couple steps we are taking time to review how this was done. That way when you have your own dbt project you'll be familiar with the setup! We'll start with the dbt_project.yml, sources, and staging.


  1. Select the dbt_project.yml file in the root directory the file explorer to open it. What are we looking at here? Every dbt project requires a dbt_project.yml file — this is how dbt knows a directory is a dbt project. The dbt_project.yml file also contains important information that tells dbt how to operate on your project.
  2. Your code should as follows:
    name: 'snowflake_python_workshop'
    version: '1.5.0'
    require-dbt-version: '>=1.3.0'
    config-version: 2
    # This setting configures which "profile" dbt uses for this project.
    profile: 'default'
    # These configurations specify where dbt should look for different types of files.
    # The `source-paths` config, for example, states that models in this project can be
    # found in the "models/" directory. You probably won't need to change these!
    model-paths: ["models"]
    analysis-paths: ["analyses"]
    test-paths: ["tests"]
    seed-paths: ["seeds"]
    macro-paths: ["macros"]
    snapshot-paths: ["snapshots"]
    target-path: "target"  # directory which will store compiled SQL files
    clean-targets:         # directories to be removed by `dbt clean`
        - "target"
        - "dbt_packages"
            node_color: "CadetBlue"
            +materialized: table
                node_color: "Maroon"
            +tags: "bi"
            +materialized: table
            node_color: "#800080"
            +materialized: table
                node_color: "Indigo"
                node_color: "Black"
  3. The key configurations to point out in the file with relation to the work that we're going to do are in the models section.
    • require-dbt-version — Tells dbt which version of dbt to use for your project. We are requiring 1.3.0 and any newer version to run python models and node colors.
    • materialized — Tells dbt how to materialize models when compiling the code before it pushes it down to Snowflake. All models in the marts folder will be built as tables.
    • tags — Applies tags at a directory level to all models. All models in the aggregates folder will be tagged as bi (abbreviation for business intelligence).
  4. Materializations are strategies for persisting dbt models in a warehouse, with tables and views being the most commonly utilized types. By default, all dbt models are materialized as views and other materialization types can be configured in the dbt_project.yml file or in a model itself. It's very important to note Python models can only be materialized as tables or incremental models. Since all our Python models exist under marts, the following portion of our dbt_project.yml ensures no errors will occur when we run our Python models. Starting with dbt version 1.4, Python files will automatically get materialized as tables even if not explicitly specified.
      +materialized: table

Cool, now that dbt knows we have a dbt project we can view the folder structure and data modeling.

Folder structure

dbt Labs has developed a project structure guide that contains a number of recommendations for how to build the folder structure for your project. These apply to our entire project except the machine learning portion - this is still relatively new use case in dbt without the same established best practices.

Do check out that guide if you want to learn more. Right now we are going to organize our project using the following structure:

Your folder structure should look like (make sure to expand some folders if necessary): folder_structure

Remember you can always reference the entire project in GitHub to view the complete folder and file strucutre.

In any data project we follow the process of starting with raw data, cleaning and transforming it, and then gaining insights. In this step we'll be showing you how to bring raw data into dbt and create staging models. The steps of setting up sources and staging models were completed when we forked our repo, so we'll only need to preview these files (instead of build them).

Sources allow us to create a dependency between our source database object and our staging models which will help us when we look at data-lineage later. Also, if your source changes database or schema, you only have to update it in your f1_sources.yml file rather than updating all of the models it might be used in.

Staging models are the base of our project, where we bring all the individual components we're going to use to build our more complex and useful models into the project. Staging models have a 1:1 relationship with their source table and are for light transformation steps such as renaming columns, type casting, basic computations, and categorizing data.

Since we want to focus on dbt and Python in this workshop, check out our sources and staging docs if you want to learn more (or take our dbt Fundamentals course which covers all of our core functionality).

Creating Sources

  1. Open the file called f1_sources.yml with the following file path: models/staging/formula1/f1_sources.yml.
  2. You should see the following code that creates our 14 source tables in our dbt project from Snowflake: sources_f1
  3. dbt makes it really easy to:
    • declare sources
    • provide testing for data quality and integrity with support for both generic and singular tests
    • create documentation using descriptions where you write code

Now that we are connected into our raw data let's do some light transformations in staging.


  1. Let's view two staging models that we'll be using to understand lap time trends through the years.
  2. Open stg_lap_times.
    lap_times as (select * from {{ source('formula1', 'lap_times') }}),
    renamed as (
            race_id as race_id,
            driver_id as driver_id,
            "POSITION" as driver_position,
            "TIME" as lap_time_formatted,
            {{ convert_laptime("lap_time_formatted") }} as official_laptime,
            milliseconds as lap_time_milliseconds
        from lap_times
        {{ dbt_utils.generate_surrogate_key(["race_id", "driver_id", "lap"]) }}
        as lap_times_id,
    from renamed
  3. Review the SQL code. We see renaming columns using the alias in addition to reformatting using a jinja code in our project referencing a macro. At a high level a macro is a reusable piece of code and jinja is the way we can bring that code into our SQL model. Datetimes column formatting is usually tricky and repetitive. By using a macro we introduce a way to systematic format times and reduce redunant code in our Formula 1 project. Select </> Compile once its finished view the Compiled Code tab. compiled_jinja_lap_times
  4. Now click Preview — look how pretty and human readable our official_laptime column is!
  5. Feel free to view our project macros under the root folder macros and look at the code for our convert_laptime macro in the convert_laptim.sql file.
  6. We can see the reusable logic we have for splitting apart different components of our lap times from hours to nanoseconds. If you want to learn more about leveraging macros within dbt SQL, check out our macros documentation.

You can see for every source table, we have a staging table. Now that we're done staging our data it's time for transformation.

dbt got it's start in being a powerful tool to enhance the way data transformations are done in SQL. Before we jump into python, let's pay homage to SQL.
SQL is so performant at data cleaning and transformation, that many data science projects "use SQL for everything you can, then hand off to python" and that's exactly what we're going to do.

Fact and dimension tables

Dimensional modeling is an important data modeling concept where we break up data into "facts" and "dimensions" to organize and describe data. We won't go into depth here, but think of facts as "skinny and long" transactional tables and dimensions as "wide" referential tables. We'll preview one dimension table and be building one fact table.

  1. Create a new branch so we can build new models (our main branch is protected as read-only in dbt Cloud). Name your branch snowpark-python-workshop. create_branch_dbt_cloudname_branch_dbt_cloud
  2. Navigate in the file tree to models > marts > core > dim_races.
  3. Preview the data. We can see we have the RACE_YEAR in this table. That's important since we want to understand the changes in lap times over years. So we now know dim_races contains the time column we need to make those calculations.
  4. Create a new file within the core directory core > ... > Create file. create_fct_file
  5. Name the file fct_lap_times.sql. fct_lap_times
  6. Copy in the following code and save the file (Save or Ctrl+S):
    with lap_times as (
            {{ dbt_utils.generate_surrogate_key(['race_id', 'driver_id', 'lap']) }} as lap_times_id,
            race_id                                                                 as race_id,
            driver_id                                                               as driver_id,
            lap                                                                     as lap,
            driver_position                                                         as driver_position,
            lap_time_formatted                                                      as lap_time_formatted,
            official_laptime                                                        as official_laptime,
            lap_time_milliseconds                                                   as lap_time_milliseconds
        from {{ ref('stg_lap_times') }}
    select * from lap_times
  7. Our fct_lap_times is very similar to our staging file since this is clean demo data. In your real world data project your data will probably be messier and require extra filtering and aggregation prior to becoming a fact table exposed to your business users for utilizing.
  8. Use the UI Build (buttom with hammer icon) to create the fct_lap_times model. dbt_build_fct_lap_times

Now we have both dim_races and fct_lap_times separately. Next we'll to join these to create lap trend analysis through the years.

Marts tables

Marts tables are where everything comes together to create our business-defined entities that have an identity and purpose. We'll be joining our dim_races and fct_lap_times together.

  1. Create a new file under your marts folder called mrt_lap_times_years.sql.
  2. Copy and Save the following code:
    with lap_times as (
    select * from {{ ref('fct_lap_times') }}
        races as (
        select * from {{ ref('dim_races') }}
        expanded_lap_times_by_year as (
            from lap_times
            left join races
                on lap_times.race_id = races.race_id
            where lap_time_milliseconds is not null 
        select * from expanded_lap_times_by_year
  3. Our dataset contains races going back to 1950, but the measurement of lap times begins in 1996. Here we join our datasets together use our where clause to filter our races prior to 1996, so they have lap times.
  4. Execute the model using Build.
  5. Preview your new model. We have race years and lap times together in one joined table so we are ready to create our trend analysis.
  6. It's a good time to commit the 2 new models we created in our repository. Click Commit and sync and add a commit message. commit_and_sync_fct_mrtcommit_message_fct_mrt

Now that we've joined and denormalized our data we're ready to use it in python development.

This step is optional for this quickstart to give a better feel for working with python directly in Snowflake. To see how to implement this in dbt Cloud, you may skip to the next section.

Now that we've transformed data using SQL let's write our first python code and get insights about lap time trends. Snowflake python worksheets are excellent for developing your python code before bringing it into a dbt python model. Then once we are settled on the code we want, we can drop it into our dbt project.

Python worksheets in Snowflake are a dynamic and interactive environment for executing Python code directly within Snowflake's cloud data platform. They provide a seamless integration between Snowflake's powerful data processing capabilities and the versatility of Python as a programming language. With Python worksheets, users can easily perform data transformations, analytics, and visualization tasks using familiar Python libraries and syntax, all within the Snowflake ecosystem. These worksheets enable data scientists, analysts, and developers to streamline their workflows, explore data in real-time, and derive valuable insights from their Snowflake data.

  1. Head back over to Snowflake.
  2. Open up a Python Worksheet. The boilerplate example code when you first create a Python worksheet is fetching information_schema.packages available, filtering on column language = ‘python', and returning that as dataframe, which is what gets shown in result (next step). create_python_worksheetnew_python_worksheet_boilerplate_example_code
  3. Ensure you are in your development database and schema (i.e. PC_DBT_DB and DBT_HWATSON) and run the Python worksheet (Ctrl+A and Run). The query results represent the many (about 5,400) packages snowpark for python supports that you can leverage! python_worksheet_db_schemaresults_of_new_python_worksheet_boilerplate_example_code
  4. Delete the sample boilerplate code in the new python worksheet. Copy the following code into the python worksheet to get a 5 year moving average of Formula 1 laps:
    # The Snowpark package is required for Python Worksheets. 
    # You can add more packages by selecting them using the Packages control and then importing them.
    import snowflake.snowpark as snowpark
    import pandas as pd 
    def main(session: snowpark.Session): 
        # Your code goes here, inside the "main" handler.
        tableName = 'MRT_LAP_TIMES_YEARS'
        dataframe = session.table(tableName)
        lap_times = dataframe.to_pandas()
        # print table
        # describe the data
        lap_times["LAP_TIME_SECONDS"] = lap_times["LAP_TIME_MILLISECONDS"]/1000
        lap_time_trends = lap_times.groupby(by="RACE_YEAR")["LAP_TIME_SECONDS"].mean().to_frame()
        lap_time_trends["LAP_MOVING_AVG_5_YEARS"] = lap_time_trends["LAP_TIME_SECONDS"].rolling(5).mean()
        lap_time_trends.columns = lap_time_trends.columns.str.upper()
        final_df = session.create_dataframe(lap_time_trends)
        # Return value will appear in the Results tab.
        return final_df

If you have workloads that have large memory requirements such as deep learning models consider using Snowpark dataframes and Snowpark-optimized warehouses that are specifically engineered to handle these types of compute intensive workloads!

  1. Your result should have three columns: race_year, lap_time_seconds, and lap_moving_avg_5_years. chart_5yr_lap_time_avg

We were able to quickly calculate a 5 year moving average using python instead of having to sort our data and worry about lead and lag SQL commands. Clicking on the Chart button next to Results, we can see that lap times seem to be trending down with small fluctuations until 2010 and 2011 which coincides with drastic Formula 1 regulation changes including cost-cutting measures and in-race refueling bans. So we can safely ascertain lap times are not consistently decreasing.

Now that we've created this dataframe and lap time trend insight, what do we do when we want to scale it? In the next section we'll be learning how to do this by leveraging python transformations in dbt Cloud.

Our first dbt python model for lap time trends

Let's get our lap time trends in our data pipeline so we have this data frame to leverage as new data comes in. The syntax of of a dbt python model is a variation of our development code in the python worksheet so we'll be explaining the code and concepts more.

You might be wondering: How does this work?
Or more specifically: How is dbt able to send a python command over to a Snowflake runtime executing python?

At a high level, dbt executes python models as stored procedures in Snowflake, via Snowpark for python.

Snowpark for python and dbt python architecture:architecture_diagram

  1. Open your dbt Cloud browser tab.
  2. Create a new file under the models > marts > aggregates directory called
  3. Copy the following code in and Save the file:
    import pandas as pd
    def model(dbt, session):
        # dbt configuration
        # get upstream data
        lap_times = dbt.ref("mrt_lap_times_years").to_pandas()
        # describe the data
        lap_times["LAP_TIME_SECONDS"] = lap_times["LAP_TIME_MILLISECONDS"]/1000
        lap_time_trends = lap_times.groupby(by="RACE_YEAR")["LAP_TIME_SECONDS"].mean().to_frame()
        lap_time_trends["LAP_MOVING_AVG_5_YEARS"] = lap_time_trends["LAP_TIME_SECONDS"].rolling(5).mean()
        lap_time_trends.columns = lap_time_trends.columns.str.upper()
        return lap_time_trends.round(1)
  4. Let's break down what this code is doing:
  1. Create the model in our warehouse by clicking Build.
  2. We can't preview Python models directly, so let's open a new file using the + button or the Control-N shortcut to create a new scratchpad:
    select * from {{ ref('agg_lap_times_moving_avg') }}
  3. Preview the output. It should look the same as our snowflake python worksheet: preview_agg_lap_times_scratchpad
  4. We can see we have the same results from our python worksheet development as we have in our codified dbt python project.

The dbt model, .source(), .ref() and .config() functions

Let's take a step back before starting machine learning to both review and go more in-depth at the methods that make running dbt python models possible. If you want to know more outside of this lab's explanation read the documentation on Python models here.

  1. Commit and sync so our project contains our model, add a commit message and Commit changes.

Now that we understand how to create python transformations we can use them to prepare train machine learning models and generate predictions!

In upstream parts of our data lineage we had dedicated steps and data models to cleaning, encoding, and splitting out the data into training and testing datasets. We do these steps to ensure:

There are 3 areas to break down as we go since we are working at the intersection all within one model file:

  1. Machine Learning
  2. Snowflake and Snowpark
  3. dbt Python models

Training and saving a machine learning model

  1. Project organization remains key, under the ml folder make a new subfolder called training_and_prediction.
  2. Now create a new file called train_model_to_predict_position.pypreview-create_train_model_file-test-position
  3. Copy and save the following code (make sure copy all the way to the right). You can also copy it from our demo repo by clicking on this link and using Ctrl/Cmd+A.
    import snowflake.snowpark.functions as F
    from sklearn.model_selection import train_test_split
    import pandas as pd
    from sklearn.metrics import confusion_matrix, balanced_accuracy_score
    import io
    from sklearn.linear_model import LogisticRegression
    from joblib import dump, load
    import joblib
    import logging
    import sys
    from joblib import dump, load
    logger = logging.getLogger("mylog")
    def save_file(session, model, path, dest_filename):
        input_stream = io.BytesIO()
        joblib.dump(model, input_stream)
        session._conn.upload_stream(input_stream, path, dest_filename)
        return "successfully created file: " + path
    def model(dbt, session):
            packages = ['numpy','scikit-learn','pandas','numpy','joblib','cachetools'],
            materialized = "table",
            tags = "train"
        # Create a stage in Snowflake to save our model file
        session.sql('create or replace stage MODELSTAGE').collect()
        #session._use_scoped_temp_objects = False
        version = "1.0"'Model training version: ' + version)
        # read in our training and testing upstream dataset
        test_train_df = dbt.ref("training_testing_dataset")
        #  cast snowpark df to pandas df
        test_train_pd_df = test_train_df.to_pandas()
        target_col = "POSITION_LABEL"
        # split out covariate predictors, x, from our target column position_label, y.
        split_X = test_train_pd_df.drop([target_col], axis=1)
        split_y = test_train_pd_df[target_col]
        # Split out our training and test data into proportions
        X_train, X_test, y_train, y_test  = train_test_split(split_X, split_y, train_size=0.7, random_state=42)
        train = [X_train, y_train]
        test = [X_test, y_test]
            # now we are only training our one model to deploy
        # we are keeping the focus on the workflows and not algorithms for this lab!
        model = LogisticRegression()
        # fit the preprocessing pipeline and the model together, y_train)   
        y_pred = model.predict_proba(X_test)[:,1]
        predictions = [round(value) for value in y_pred]
        balanced_accuracy =  balanced_accuracy_score(y_test, predictions)
        # Save the model to a stage
        save_file(session, model, "@MODELSTAGE/driver_position_"+version, "driver_position_"+version+".joblib" )'Model artifact:' + "@MODELSTAGE/driver_position_"+version+".joblib")
        # Take our pandas training and testing dataframes and put them back into snowpark dataframes
        snowpark_train_df = session.write_pandas(pd.concat(train, axis=1, join='inner'), "train_table", auto_create_table=True, create_temp_table=True)
        snowpark_test_df = session.write_pandas(pd.concat(test, axis=1, join='inner'), "test_table", auto_create_table=True, create_temp_table=True)
        # Union our training and testing data together and add a column indicating train vs test rows
        return  snowpark_train_df.with_column("DATASET_TYPE", F.lit("train")).union(snowpark_test_df.with_column("DATASET_TYPE", F.lit("test")))
  4. Use the UI Build our train_model_to_predict_position model.
  5. Breaking down our Python script:
  1. Viewing our output of this model: preview-train-test-position
  2. Let's pop back over to Snowflake. To check that our logistic regression model has been stored in our MODELSTAGE open a SQL Worksheet and use the query below to list objects in your modelstage. Make sure you are in the correct database and development schema to view your stage (this should be PC_DBT_DB and your dev schema - for example dbt_hwatson).
    list @modelstage


  1. To investigate the commands run as part of script, navigate to Snowflake query history to view it Home button > Activity > Query History. We can view the portions of query that we wrote such as create or replace stage MODELSTAGE, but we also see additional queries that Snowflake uses to interpret python code. view-snowflake-query-history

Let's use our new trained model to create predictions!

Predicting on new data

It's time to use that 2020 data we held out to make predictions on!

  1. Create a new file under ml/training_and_prediction called and copy and save the following code (You can also copy it from our demo repo by clicking on this link and using Ctrl/Cmd+A.):
    import logging
    import joblib
    import pandas as pd
    import os
    from snowflake.snowpark import types as T
    version = '1.0'
    # The name of the model file
    model_file_path = 'driver_position_'+version
    model_file_packaged = 'driver_position_'+version+'.joblib'
    # This is a local directory, used for storing the various artifacts locally
    LOCAL_TEMP_DIR = f'/tmp/driver_position'
    DOWNLOAD_DIR = os.path.join(LOCAL_TEMP_DIR, 'download')
    TARGET_MODEL_DIR_PATH = os.path.join(LOCAL_TEMP_DIR, 'ml_model')
    TARGET_LIB_PATH = os.path.join(LOCAL_TEMP_DIR, 'lib')
    # The feature columns that were used during model training
    # and that will be used during prediction
    def register_udf_for_prediction(p_predictor ,p_session ,p_dbt):
        # The prediction udf
        def predict_position(p_df: T.PandasDataFrame[int, int, int, int,
                                            int, int, int, int, int]) -> T.PandasSeries[int]:
            # Snowpark currently does not set the column name in the input dataframe
            # The default col names are like 0,1,2,... Hence we need to reset the column
            # names to the features that we initially used for training.
            p_df.columns = [*FEATURE_COLS]
            # Perform prediction. this returns an array object
            pred_array = p_predictor.predict(p_df)
            # Convert to series
            df_predicted = pd.Series(pred_array)
            return df_predicted
        # The list of packages that will be used by UDF
        udf_packages = p_dbt.config.get('packages')
        predict_position_udf = p_session.udf.register(
            ,packages = udf_packages
        return predict_position_udf
    def download_models_and_libs_from_stage(p_session):
        p_session.file.get(f'@{DB_STAGE}/{model_file_path}/{model_file_packaged}', DOWNLOAD_DIR)
    def load_model(p_session):
        # Load the model and initialize the predictor
        model_fl_path = os.path.join(DOWNLOAD_DIR, model_file_packaged)
        predictor = joblib.load(model_fl_path)
        return predictor
    # -------------------------------
    def model(dbt, session):
            packages = ['snowflake-snowpark-python' ,'scipy','scikit-learn' ,'pandas' ,'numpy'],
            materialized = "table",
            tags = "predict"
        session._use_scoped_temp_objects = False
        predictor = load_model(session)
        predict_position_udf = register_udf_for_prediction(predictor, session ,dbt)
        # Retrieve the data, and perform the prediction
        hold_out_df = (dbt.ref("hold_out_dataset_for_prediction")
        trained_model_file = dbt.ref("train_model_to_predict_position")
        # Perform prediction.
        new_predictions_df = hold_out_df.withColumn("position_predicted"
        return new_predictions_df
  2. Use the UI Build our apply_prediction_to_position model.
  3. Commit and sync our changes to keep saving our work as we go using the commit message logistic regression model training and application before moving on. commit_training_and_predictioncommit_message_training_and_prediction
  4. At a high level in this script, we are:
  1. At a more detailed level:

🧠 Another way to read this script is from the bottom up. This can help us progressively see what is going into our final dbt model and work backwards to see how the other functions are being referenced.

  1. Let's take a look at our predicted position alongside our feature variables. Open a new scratchpad and use the following query. I chose to order by the prediction of who would obtain a podium position:
    select * from {{ ref('apply_prediction_to_position') }} order by position_predicted


We can see that we created predictions in our final dataset for each result.

  1. Run a fresh dbt build in the command bar to ensure our pipeline is working end to end. This will take a few minutes, (3 minutes and 2.4 seconds to be exact) so it's not a bad time to stretch (we know programmers slouch). This runtime is pretty performant since we're using an X-Smalll warehouse. If you want to speed up the pipeline, you can increase the warehouse size (good for SQL) or use a Snowpark-optimized Warehouses (good for Python) fresh_dbt_build_full_pipeline

Committing all development work

Before we jump into deploying our code, let's have a quick primer on environments. Up to this point, all of the work we've done in the dbt Cloud IDE has been in our development environment, with code committed to a feature branch and the models we've built created in our development schema in Snowflake as defined in our Development environment connection. Doing this work on a feature branch, allows us to separate our code from what other coworkers are building and code that is already deemed production ready. Building models in a development schema in Snowflake allows us to separate the database objects we might still be modifying and testing from the database objects running production dashboards or other downstream dependencies. Together, the combination of a Git branch and Snowflake database objects form our environment.

Now that we've completed applying prediction, we're ready to deploy our code from our development environment to our production environment and this involves two steps:

  1. Before getting started, let's make sure that we've committed all of our work to our feature branch. Our working branch,snowpark-python-workshop, should be clean. If for some reason you do still have work to commit, you'll be able to select the Commit and sync, provide a message, and then select Commit changes again.
  2. Once all of your work is committed, the git workflow button will now appear as Create pull request. create_pull_request_dbt_cloud_button
  3. This will bring you to your GitHub repo. This will show the commits that encompass all changes made since the last pull request. Since we only added new files we are able to merge into main without conflicts. Click Create pull request. review_commits_create_pull_request
  4. This goes to a Open a pull request page. Usually, when merging in a pull request (PR) we would create descriptions and motivations for the work being completed, validation our models work (like a fresh dbt build), and note changes to exisiting models (we only created new models and didn't alter existing ones). Then typically your teammates will review, comment, and independently test out the code on your branch. dbt has created a pull request template to make PRs as efficient and scalable to your analytics workflow.

The template is also located in our root directory under .github in the file When a PR is opened, the template will automatically be pulled in for you to fill out. For the workshop we'll do an abbreviated version of this for example. If you'd like you can just add a quick comment followed by Merge pull request since we're doing a workshop in an isolated Snowflake trial account (and won't break anything).

Our abbreviated PR template written markdown: pr_template_writen_markdown

PR preview: pr_template_preview

  1. Our PR is looking good. Let's Merge pull request. merge_pr_github
  2. Then click Confirm merge. confirm_merge_github
  3. It's best practice to keep your repo clean by deleting your working branch once merged into main. You can always restore it later, for now Delete Branch. We're all done in GitHub for today! delete_branch_github
  4. Head back over to your dbt Cloud browser tab. Under Version Control select Pull from "main". If you don't see this, refresh your browser tab and it should appear. pull_from_main_dbt_cloud.png
  5. Select Change branch to your main branch that now appears as (ready-only). change_branch_dbt_cloud.pngchange_to_main.pngcheckout_main_branch.png
  6. Finally, to bring our changes from our main branch in GitHub, select Pull from remotepull_from_remote_dbt_cloud
  7. Now that all of our development work has been merged to the main branch, we can build our deployment job. Given that our production environment and production job were created automatically for us through Partner Connect, all we need to do here is update some default configurations to meet our needs.
  8. In the menu, select Deploy > Environmentsdeploy_environments_ui

Setting your production schema

  1. You should see two environments listed and you'll want to select the Deployment environment then Settings to modify it.
  2. Before making any changes, let's touch on what is defined within this environment. The Snowflake connection shows the credentials that dbt Cloud is using for this environment and in our case they are the same as what was created for us through Partner Connect. Our deployment job will build in our PC_DBT_DB database and use the default Partner Connect role and warehouse to do so. The deployment credentials section also uses the info that was created in our Partner Connect job to create the credential connection. However, it is using the same default schema that we've been using as the schema for our development environment.
  3. Let's update the schema to create a new schema specifically for our production environment. Click Edit to allow you to modify the existing field values. Navigate to Deployment Credentials > schema.
  4. Update the schema name to production. Remember to select Save after you've made the change. name_production_schema
  5. By updating the schema for our production environment to production, it ensures that our deployment job for this environment will build our dbt models in the production schema within the PC_DBT_DB database as defined in the Snowflake Connection section.

Creating multiple jobs

In machine learning you rarely want to retrain your model as often as you want new predictions. Model training is compute intensive and requires person time for development and evaluation, while new predictions can run through an existing model to gain insights about drivers, customers, events, etc. This problem can be tricky, but dbt Cloud makes it easy by: automatically creating dependencies from your code and making setup for environments and jobs simple.

With this in mind we're going to have two jobs:

  1. Let's look at over to our production job created by partner connect. Click on the deploy tab again and then select Jobs. deploy_jobs_ui You should see an existing and preconfigured Partner Connect Trial Job. pc_default_job
  2. Similar to the environment, click on the job, then select Settings to modify it. Let's take a look at the job to understand it before making changes. pc_job_settings
  1. So, what are we changing then? The job name and commands!
  1. Now let's go to run our job. Clicking on the job name in the path at the top of the screen will take you back to the job run history page where you'll be able to click Run to kick off the job. In total we produced 106 entities: 14 view models, 67 tests, 24 table models, 1 incremental model. run_job
  2. Let's go over to Snowflake to confirm that everything built as expected in our production schema. Refresh the database objects in your Snowflake account and you should see the production schema now within our default Partner Connect database. If you click into the schema and everything ran successfully, you should be able to see all of the models we developed. job_run_output
  3. Go back to dbt Cloud and navigate to Deploy > Jobs > Create Job. Edit the following job settings:
  1. Run your job using Run Now. Remember the only difference between our first job and this job is we are excluding model retraining. So we will have one less model in our outputs. We can confirm this in our run steps.
  2. Open the job and go to Run Steps > Invoke. In our job details we can confirm one less entity (105 instead of 106).

That wraps all of our hands on the keyboard time for today!

Fantastic! You've finished the workshop! We hope you feel empowered in using both SQL and Python in your dbt Cloud workflows with Snowflake. Having a reliable pipeline to surface both analytics and machine learning is crucial to creating tangible business value from your data.

To learn more about how to combine Snowpark and dbt Cloud for smarter production, visit our page where you can book a demo to talk to an expert and try our quickstart focusing on dbt basics such as setup, connections, tests, and documentation.

Finally, for more help and information join our dbt community Slack which contains more than 65,000 data practitioners today. We have a dedicated slack channel #db-snowflake to Snowflake related content. Happy dbt'ing!