Modern businesses need modern data strategies, built on platforms that support agility, growth and operational efficiency.
Snowflake is the Data Cloud, a future-proof solution that simplifies data pipelines, so you can focus on data and analytics instead of infrastructure management.
dbt is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation. Now anyone who knows SQL can build production-grade data pipelines. It transforms data in the warehouse, leveraging cloud data platforms like Snowflake.
In this Quickstart, you will follow a step-by-step guide to using dbt with Snowflake, and see some of the benefits this tandem brings.
Let's get started.
ACCOUNTADMIN
accessIn this lab, we are going to analyze the historical trading performance of a company that has trading desks spread across different regions. As inputs, we are going to leverage datasets available in Knoema Economy Data Atlas, which is available in Snowflake Marketplace. We'll also make use of a few manual uploads. To do this, we are going to set up the environments, build scalable pipelines in dbt, establish data tests , and promote code to production.
Just to give you a sneak peek, this is the data lineage we will be creating using dbt.
Stay tuned!
We will be using the new UI to getting started but you can also switch over to the Classic Console if you would like. It will not affect your dbt experience but may change where the buttons are in the Snowflake UI.
New UI:
Classic UI:
If you ever want to change from the new UI to the classic one, click on the home button and then Classic Console
Now we need to obtain our raw data. We are going to the Snowflake Marketplace to connect to the Knoema dataset.
So what is happening here? Knoema has granted access to this data from their Snowflake account to yours. You're creating a new database in your account for this data to live - but the best part is that no data is going to move between accounts! When you query, you'll really be querying the data that lives in the Knoema account. If they change the data, you'll automatically see those changes. No need to define schemas, move data, or create a data pipeline either.
Classic Console
on the bottom of the left hand sidebar. This will port you back where you can see the database displayed the way it was before.The Economy Atlas comes with more than 300 datasets. In order to improve navigation, the data provider has kindly supplied a table called DATASETS. Let's go ahead and find the ones related to the stock history and currency exchange rates that we are going to use in the next step.
select *
from "KNOEMA_ECONOMY_DATA_ATLAS"."ECONOMY"."DATASETS"
where "DatasetName" ILIKE 'US Stock%'
OR "DatasetName" ILIKE 'Exchange%Rates%';
select *
from KNOEMA_ECONOMY_DATA_ATLAS.ECONOMY.USINDSSP2020
where "Date" = '2020-02-03';
Congratulations! You successfully tapped into a live data feed of Trade and FX rates data with NO ETL involved. Isn't this cool? Now let's start building our pipelines.
Now it's time to set up dbt. We are going to be using Snowflake Partner Connect to set up your dbt Cloud account and project. Using this method will allow you to spin up a fully fledged dbt account with your Snowflake connection, managed repository, environments, and credentials in minutes.
If you're using the classic console, it will be on the top right.
Connect to dbt
. Click on Optional Grant
and you are going to add in the KNOEMA_ECONOMY_DATA_ATLAS database. This will grant access for your new dbt user role to the database.Connect
. This will create a dedicated dbt user, database, warehouse, and role for your dbt Cloud trial.To help you version control your dbt project, we have connected it to a managed repository, which means that dbt Labs will be hosting your repository for you. This will give you access to a git workflow without you having to create and host the repository yourself. You will not need to know git for this workshop; dbt Cloud will help guide you through the workflow. In the future, when you're developing your own project, feel free to use your own repository. This will allow you to play with features like Slim CI builds after this workshop.
Develop
. This will spin up your IDE (Integrated Development Environment) where you will be developing your dbt Project.Initialize your project
to set up your dbt project. Once you click on it, dbt will generate our starter project with the core files and folders. Commit
. Enter a commit message, and then click Commit
again to commit your work to your master branch.Commit messages should always be indicative of the work you are saving. This helps you create a reference point for the future in case of auditing and debugging. By committing, you are saving to a remote branch in Github. This will also be the only time you will save straight to your master branch (which is the main branch). We always want to create a degree of separation from development work and your production branch.
create new branch
to check out a new git branch to start developing. Submit
. dbt run
on the command line at the bottom, and click Enter
. The command line is where you will be entering in dbt commands to execute dbt actions.If you want to see the actual code being executed, you can go into the ‘Details' tab next to Summary
and look through the logs. Here you can see that dbt is writing the DDL for you, allowing you to focus on just writing the SQL select statement.
PC_DBT_WH
. Then the development schema with your first initial and last name. And lastly Tables and Views. You should see the table MY_FIRST_DBT_MODEL
and the view MY_SECOND_DBT_MODEL
.Congratulations! You just ran your first dbt models on Snowflake!
We are going to start by adding a few more things to our dbt project configuration in order to improve maintainability.
You should have a file tree that looks like this after completing this task.
dbt_project.yml
. The first thing we are going to do is update the project name from "my_new_project" to "<your_name>_dbt_workshop" at line 5. This will personalize the project a little, and make it yours.Do not forget to save after you have updated the file! This means you will want to either click save or use cmd+s (mac)/ctrl+s(pc). The IDE will also warn you with the teal button on the tab.
Copy code below and paste into line 34 to line 38. Remember to replace <your_name>.
models:
<your_name>_dbt_workshop:
example:
schema: example
staging:
schema: staging
materialized: view
marts:
schema: marts
materialized: table
generate_schema_name.sql
.{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
dbt run
on the command line. Note how the schema has changed to be appending the custom schema name to our default schema per our macro. You can add in more conditionals to fit your own needs in terms of environment promotions.In this workshop, we will show you how different configurations can easily change how and where your dbt project reads and writes from without needing you to change your code with code promotion.
Macros are a way to do this. Macros are written in a pythonic templating language called Jinja. Jinja allows you to do things that normally aren't possible in SQL such as create control structures (like if statements and for loops) or abstract away snippets of SQL into reusable functions applicable throughout your project. Jinja helps you write DRY code in a language that is generally wet.
In this generate_schema_name macro, we are defining what schema your objects should be built into based on the conditionals specified. Going through the macro, what is happening here:
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
Is that we are defining our default schema as the target database and schema defined for your development environment. The Partner Connect flow sets up our development environment to point to the partner connect created database and a schema made of your first initial and last name. You can verify this in your profile. However, if a custom schema is provided as we did in our dbt_project.yml file, dbt will build into a schema that is a concatenation of our default schema and the custom schema name due to this logic:
{%- else -%}
{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
All of this logic is defined in one place and will be applied on your project which makes it simple to scale up your project, especially as business needs arise.
Later on, we will learn how to use macros from a package and see how those macros help us write SQL quickly.
query_tag.sql
.{% macro set_query_tag() -%}
{% set new_query_tag = model.name %} {# always use model name #}
{% if new_query_tag %}
{% set original_query_tag = get_current_query_tag() %}
{{ log("Setting query_tag to '" ~ new_query_tag ~ "'. Will reset to '" ~ original_query_tag ~ "' after materialization.") }}
{% do run_query("alter session set query_tag = '{}'".format(new_query_tag)) %}
{{ return(original_query_tag)}}
{% endif %}
{{ return(none)}}
{% endmacro %}
dbt run
again on the command line.Update the Filter to the user is PC_DBT_USER and the status is Succeeded to see the same view as the screenshot above, you can also remove extra columns by clicking on the arrow after hovering over a column, hovering over "Columns" and unchecking any columns you don't want.
packages.yml
at the same level as your dbt_project.yml
file.packages:
- package: dbt-labs/dbt_utils
version: 0.8.0
Save
.dbt deps
, you tell dbt to install the packages. With a successful run, you can check out the dbt_modules folder to see what packages you have installed and the code that is now available to you. dbt_project.yml
file and delete the example folder with the models in it.Commit
to commit your work!Now we start to get into the fun stuff. In the next few sections, we are going to build our dbt pipelines. This will include transformations that define these these areas of interest:
knoema_sources.yml
file in the staging/knoema folder.version: 2
sources:
- name: knoema_economy_data_atlas
database: knoema_economy_data_atlas
schema: economy
tables:
- name: exratescc2018
- name: usindssp2020
As you might remember, these two objects were mentioned in the Knoema Dataset Catalog table: daily exchange rates and daily US trading history accordingly.
Commit
and provide a good commit message like "set up knoema source file". We are going to keep building that muscle of saving your work as you develop.Now we have defined the sources that we will be building our dbt models on top of. By declaring these, we are able to create lineage from our raw objects to our transformed objects and apply documentation and testing.
stg_knoema_fx_rates.sql
.(To learn more about our recommendations around coding style, check out this repository.)
with source as (
select * from {{source('knoema_economy_data_atlas','exratescc2018') }}
),
renamed as (
select
"Currency" as currency,
"Currency Unit" as currency_unit,
"Frequency" as frequency,
"Date" as exchange_date,
"Value" as exchange_value,
"Indicator" as indicator,
"Indicator Name" as indicator_name,
'Knoema.FX Rates' as data_source_name
from source
)
select * from renamed
Preview
button. You can also click on Compile
to see the compiled code that will be executed.In the compiled code, you will notice that the source function ({{source(‘knoema_economy_data_atlas',‘exratescc2018')}}) compiles to the actual database object.
Let's talk more about this function:
Using the sources function allows you to:
stg_knoema_stock_history.sql
.with source as (
select * from {{source('knoema_economy_data_atlas','usindssp2020')}}
),
renamed as (
select
"Company" as company,
"Company Name" as company_name,
"Company Ticker" as company_symbol,
"Stock Exchange" as stock_exchange,
"Stock Exchange Name" as stock_exchange_name,
"Indicator" as indicator,
"Indicator Name" as indicator_name,
"Units" as units,
"Scale" as scale,
"Frequency" as frequency,
"Date" as stock_date,
"Value" as stock_value,
'Knoema.Stock History' as data_source_name
from source
)
select * from renamed
dbt run --select staging.*
This allows us to be selective and only run the models in the staging folder. Node selectors are very helpful in refining what we actually want to run and test.
Commit
and provide a good commit message like "set up staging models"select *
from pc_dbt_db.<dev_schema>_staging.stg_knoema_stock_history
where company_symbol ='AAPL'
and stock_date ='2021-03-01'
In this dataset, different measures like Close, Open, High and Low price are represented as different rows. For our use case, this is a bit suboptimal - it would be better to see that data transposed into columns; something more like this:
So let's do that!
To achieve this, we are going to have to create a few more models:
New File
. Remove the prewritten path and paste in the following path: models/marts/core/intermediate/int_knoema_stock_history.sqlThis will create the file as well as the intermediate folder in the core folder to organize the intermediate models.
Intermediate tables are transformations to get to the end data product but won't be queried in the downstream processes like in the BI layer.
with stock_history as (
select * from {{ ref('stg_knoema_stock_history') }}
where indicator_name in ('Close', 'Open','High','Low', 'Volume', 'Change %')
),
pivoted as (
select
company_symbol,
company_name,
stock_exchange_name,
stock_date,
data_source_name,
{{ dbt_utils.pivot(
column = 'indicator_name',
values = dbt_utils.get_column_values(ref('stg_knoema_stock_history'), 'indicator_name'),
then_value = 'stock_value'
) }}
from stock_history
group by company_symbol, company_name, stock_exchange_name, stock_date, data_source_name
)
select * from pivoted
int_knoema_stock_history_alt.sql
with stock_history as (
select * from {{ ref('stg_knoema_stock_history') }}
where indicator_name IN ('Close', 'Open','High','Low', 'Volume', 'Change %')
),
pivoted as (
select * from stock_history
pivot(sum(stock_value) for indicator_name in ('Close', 'Open','High','Low', 'Volume', 'Change %')
) as
p(
company_symbol,
company_name,
stock_exchange_name,
stock_date,
data_source_name,
close,
open,
high,
low,
volume,change
)
)
select * from pivoted
What did you think of the difference? While both models create the same end result, the dbt_utils.pivot macro requires less code and less brainpower. All you had to do is fill in the columns and the model to obtain the same end result. It is also more dynamic, automatically updating at compilation if a new value appears.
You might have also noticed that in these models we are using a function called ref. This ref function is very similar to sources function but rather than telling dbt how a model relates to a source declared in the sources.yml file, the ref function tells dbt how a model relates to another model. It's useful in the same reasons as sources():
The main takeaway to remember is in a dbt project, you should never have to hardcode a database object. Always uses sources() and refs()
dbt run --select +int_knoema_stock_history
You should see that we have run all of the nodes (excluding sources) to the left of this lineage.
SELECT *
FROM pc_dbt_db.<dev_schema>_marts.int_knoema_stock_history
WHERE company_symbol = 'AAPL'
AND stock_date = '2021-03-01'
Looks like we have what we wanted!
Now let's take a look at our exchange rates and create some intermediate tables on top of them.
int_fx_rates.sql
in the intermediate folder (models/marts/core/intermediate/int_fx_rates.sql).{{
config(
materialized='view',
tags=["hourly"]
)
}}
select * from {{ ref('stg_knoema_fx_rates') }}
where indicator_name = 'Close'
and frequency = 'D'
and exchange_date > '2016-01-01'
There's now a model configuration on top of our model. dbt offers various materialization options. By default, if a materialization is not declared, the model will be created as a view. For this directory (marts) that the model is nested in, we have configured in our dbt_project.yml
that the default materialization is table
. We will override the project level materialization with our model configuration, declaring the model to be materialized as a view. We also include a tag. Tags can be used to run parts of your project or a good way to group your models based on content and intent.
dbt run --select tag:hourly
Being able to run models based on tags can be helpful for, say, an hourly job where you are selectively running specific models.
int_stock_history_major_currency.sql
in the intermediate folder. This model will start bringing FX and Trade history sets together.with
stock_history as (
select * from {{ ref('int_knoema_stock_history')}}
),
fx_rates as (
select * from {{ ref('int_fx_rates') }}
),
fx_rates_gdp as (
select * from fx_rates
where currency = 'USD/GBP'
),
fx_rates_eur as (
select * from fx_rates
where currency = 'USD/EUR'
),
joined as (
select
stock_history.*,
fx_rates_gdp.exchange_value * stock_history."Open" as gbp_open,
fx_rates_gdp.exchange_value * stock_history."High" as gbp_high,
fx_rates_gdp.exchange_value * stock_history."Low" as gbp_low,
fx_rates_gdp.exchange_value * stock_history."Close" as gbp_close,
fx_rates_eur.exchange_value * stock_history."Open" as eur_open,
fx_rates_eur.exchange_value * stock_history."High" as eur_high,
fx_rates_eur.exchange_value *stock_history."Low" as eur_low,
fx_rates_eur.exchange_value * stock_history."Close" as eur_close
from stock_history
left join fx_rates_gdp on stock_history.stock_date = fx_rates_gdp.exchange_date
left join fx_rates_eur on stock_history.stock_date = fx_rates_eur.exchange_date
)
select * from joined
dbt run --select +int_stock_history_major_currency
Now that we have created a few models, let's talk about dbt documentation. dbt is able to generate a static webpage with a data dictionary by pulling in information from your dbt project as well as your Snowflake information_schema. It also provides an interactive DAG so you can see the full lineage of your models; this is something you might have also noticed in the lineage tab of the IDE. This is a fantastic way to share information with your internal teams, as it contains all-important information about columns, tags, free-form model description, tests as well as the source code that is always in line with the code. So regardless of how big your project grows, it is super easy to understand what's happening with the help of dbt's documentation.
dbt docs generate
View Docs
on the top left side.Following our use case story, we are going to manually upload two small datasets using dbt seed representing trading books of two desks. As you might notice, they were buying and selling AAPL shares, but logging the cash paid/received in different currencies: USD and GBP.
manual_book1.csv
.Book,Date,Trader,Instrument,Action,Cost,Currency,Volume,Cost_Per_Share,Stock_exchange_name
B2020SW1,2021-03-03,Jeff A.,AAPL,BUY,-17420,GBP,200,87.1,NASDAQ
B2020SW1,2021-03-03,Jeff A.,AAPL,BUY,-320050,GBP,3700,86.5,NASDAQ
B2020SW1,2021-01-26,Jeff A.,AAPL,SELL,52500,GBP,-500,105,NASDAQ
B2020SW1,2021-01-22,Jeff A.,AAPL,BUY,-100940,GBP,980,103,NASDAQ
B2020SW1,2021-01-22,Nick Z.,AAPL,SELL,5150,GBP,-50,103,NASDAQ
B2020SW1,2019-08-31,Nick Z.,AAPL,BUY,-9800,GBP,100,98,NASDAQ
B2020SW1,2019-08-31,Nick Z.,AAPL,BUY,-1000,GBP,50,103,NASDAQ
manual_book2.csv
in the data folder.Book,Date,Trader,Instrument,Action,Cost,Currency,Volume,Cost_Per_Share,Stock_exchange_name
B-EM1,2021-03-03,Tina M.,AAPL,BUY,-17420,EUR,200,87.1,NASDAQ
B-EM1,2021-03-03,Tina M.,AAPL,BUY,-320050,EUR,3700,86.5,NASDAQ
B-EM1,2021-01-22,Tina M.,AAPL,BUY,-100940,EUR,980,103,NASDAQ
B-EM1,2021-01-22,Tina M.,AAPL,BUY,-100940,EUR,980,103,NASDAQ
B-EM1,2019-08-31,Tina M.,AAPL,BUY,-9800,EUR,100,98,NASDAQ
dbt seed
command to load the data into Snowflake. It is important to mention that while it is absolutely possible with this approach to bring low hundred-thousands of rows, it was not created for larger datasets and you should be using COPY/Snowpipe or other data integration options recommended for Snowflake in such cases.dbt seed
You can see that it loaded the files in the default schema because we didn't declare a schema for them. If we wanted to make sure the files were loaded into a specific file, we could go back to the dbt_project.yml and add that schema configuration.
int_unioned_book.sql
in the intermediate folder.with
unioned as (
{{ dbt_utils.union_relations(
relations=[ref('manual_book1'), ref('manual_book2')]
) }}
),
renamed as (
select
Book,
Date as book_date,
Trader,
Instrument,
Action as book_action,
Cost,
Currency,
Volume,
Cost_Per_Share,
Stock_exchange_name
from unioned
)
select * from renamed
As you can see, the macro from the dbt_utils package wrote the code, aligned the attributes by name and type, and combined the datasets via a UNION ALL. The beauty of this is, we were able to write 46 lines of code with only 3. It's also dynamic so if we were to add new columns to our referenced models, dbt would update the compile code with the new fields. You can imagine the amount of time saved.
dbt run --select int_unioned_book
int_daily_position.sql
in the intermediate folder.with
stock_history as (
select * from {{ ref('int_stock_history_major_currency') }}
),
unioned_book as (
select * from {{ ref('int_unioned_book') }}
),
cst_market_days as (
select distinct stock_date
from stock_history
where stock_history.stock_date >= (select min(book_date) as min_dt from unioned_book)
),
joined as (
select
cst_market_days.stock_date,
unioned_book.trader,
unioned_book.stock_exchange_name,
unioned_book.instrument,
unioned_book.book,
unioned_book.currency,
sum(unioned_book.volume) as total_shares
from cst_market_days
inner join unioned_book on unioned_book.book_date = cst_market_days.stock_date
where unioned_book.book_date <= cst_market_days.stock_date
{{ dbt_utils.group_by(6) }}
)
select * from joined
You might have noticed we used another dbt_utils macro. Rather than having to manually list out all of the columns to group by, you can use the dbt_utils group_by macro to write it for you.
int_daily_position_with_trades.sql
in the intermediate folder.with unioned_book as (
select * from {{ ref('int_unioned_book') }}
),
daily_position as (
select * from {{ ref('int_daily_position') }}
),
unioned as (
select
book,
book_date,
trader,
instrument,
book_action,
cost,
currency,
volume,
cost_per_share,
stock_exchange_name,
sum(unioned_book.volume)
over(
partition by
instrument,
stock_exchange_name,
trader
order by
unioned_book.book_date rows unbounded preceding)
as total_shares
from unioned_book
union all
select
book,
stock_date as book_date,
trader,
instrument,
'HOLD' as book_action,
0 as cost,
currency,
0 as volume,
0 as cost_per_share,
stock_exchange_name,
total_shares
from daily_position
where (book_date,trader,instrument,book,stock_exchange_name)
not in
(select book_date,trader,instrument,book,stock_exchange_name
from unioned_book
)
)
select * from unioned
dbt run --select int_unioned_book+
Now that we have created all of these models, we are able to attribute shares by trader.
select *
from pc_dbt_db.<dev_schema>_marts.int_daily_position_with_trades
where trader = 'Tina M.'
order by book_date
Now for the last models in our tale. We have the trading history of our desks and our stock price history.
Let's create a model to show how Market Value and PnL changed over time. You might notice that this is going into our core folder. This is because this is a part of the core logic of our fictional company. This model will be a source of truth, to be used by many in downstream processes like a BI tool.
fct_trading_pnl.sql
in the core folder.{{
config(
tags = 'core'
)
}}
with
daily_positions as (
select * from {{ ref('int_daily_position_with_trades' )}}
),
stock_history as (
select * from {{ ref('int_stock_history_major_currency') }}
),
joined as (
select
daily_positions.instrument,
daily_positions.stock_exchange_name,
daily_positions.book_date,
daily_positions.trader,
daily_positions.volume,
daily_positions.cost,
daily_positions.cost_per_share,
daily_positions.currency,
sum(cost) over(
partition by
daily_positions.instrument,
daily_positions.stock_exchange_name,
trader
order by
daily_positions.book_date rows unbounded preceding
)
as cash_cumulative,
case when daily_positions.currency = 'GBP' then gbp_close
when daily_positions.currency = 'EUR' then eur_close
else 'Close'
end AS close_price_matching_ccy,
daily_positions.total_shares * close_price_matching_ccy as market_value,
daily_positions.total_shares * close_price_matching_ccy + cash_cumulative as PnL
from daily_positions
inner join stock_history
on daily_positions.instrument = stock_history.company_symbol
and stock_history.stock_date = daily_positions.book_date
and daily_positions.stock_exchange_name = stock_history.stock_exchange_name
)
select * from joined
dbt run --select fct_trading_pnl.sql
You might have noticed that fct_trading_pnl takes some time to build when you do a dbt run. As we use larger and larger datasets, the run times are getting longer. We have already materialized our models as a table, meaning we are preemptively front loading our run times to save on the query times when the table is queried. We could always increase our warehouse size, but there's another trick up our sleeve we won't touch on just yet.
We can materialize the model as an incremental model, meaning that it will not be rebuilt each time, but rather only the latest rows transformed and added to the existing table. All you have to do is update the model configuration and include the is_incremental macro which comes into action for the incremental runs (and is ignored during initial run and full_refresh option). You can learn more about this incremental materialization here.
fct_trading_pnl_incremental.sql
in the core folder.{{
config(
materialized='incremental',
unique_key= 'pk_key',
tags = 'core'
)
}}
with
daily_positions as (
select * from {{ ref('int_daily_position_with_trades' )}}
),
stock_history as (
select * from {{ ref('int_stock_history_major_currency') }}
),
joined as (
select
daily_positions.instrument,
daily_positions.stock_exchange_name,
daily_positions.book_date,
daily_positions.trader,
daily_positions.volume,
daily_positions.cost,
daily_positions.cost_per_share,
daily_positions.currency,
sum(cost) over(
partition by
daily_positions.instrument,
daily_positions.stock_exchange_name,
trader
order by
daily_positions.book_date rows unbounded preceding
)
as cash_cumulative,
case when daily_positions.currency = 'GBP' then gbp_close
when daily_positions.currency = 'EUR' then eur_close
else 'Close'
end as close_price_matching_ccy,
daily_positions.total_shares * close_price_matching_ccy as market_value,
daily_positions.total_shares * close_price_matching_ccy + cash_cumulative as PnL
from daily_positions
inner join stock_history
on daily_positions.instrument = stock_history.company_symbol
and stock_history.stock_date = daily_positions.book_date
and daily_positions.stock_exchange_name = stock_history.stock_exchange_name
),
primary_key as (
select
{{ dbt_utils.surrogate_key([
'trader',
'instrument',
'book_date',
'stock_exchange_name',
'PnL',
]) }} as pk_key,
*
from joined
)
select * from primary_key
{% if is_incremental() %}
-- this filter will only be applied on an incremental run
where book_date > (select max(book_date) from {{ this }})
{% endif %}
dbt run --select fct_trading_pnl_incremental
twice and compare the logs.For the first run, you will notice that the where clause does not come in into the sql statement.
In subsequent runs, dbt is including the where clause to create a temporary table to then merge into the existing table. The first run of an incremental model will builds the table that will then new rows will be added to in subsequent runs.
First run:
Second run:
You will not see a true performance improvement in this example because our data is not loading exponentially but in production applications, you will.
To build trust in data with your organization, it is impossible to overstate the importance of testing and documentation. While there are many ways to organize automated testing and documentation, dbt comes with an accessible data testing and documentation framework.
For tests, dbt comes with a set of 4 pre-defined data tests:
We are going to set up tests on a few models for this workshop. However, we strongly recommend you establish reasonable test coverage across the board. Adding your own tests is easy; as long as it can be written as a select statement, you can use it as a test. And just like with macros, there are plenty of fantastic packages out there like the dbt_expectations package that provides tests similar to the Great Expectations open source project.
For documentation, dbt takes model and column descriptions and will add them to the documentation site so that you have even more information about your models to share with your stakeholders via the site. It also pulls in information from the Snowflake information schema for a one stop show about your data project.
Now let's add testing and documentation to this project.
intermediate.yml
in the intermediate folder.version: 2
models:
- name: int_fx_rates
description: "An intermediate model that filters stg_knoema_fx_rates"
columns:
- name: currency||exchange_date
tests:
- unique
- not_null
- name: int_unioned_book
description: "An intermediate model unions the manual_book csvs"
columns:
- name: instrument
tests:
- not_null
- relationships:
to: ref('int_knoema_stock_history')
field: company_symbol
- name: int_knoema_stock_history
description: "An intermediate model that pivots the stg_knoema_stock_history model by indicator"
columns:
- name: company_symbol||stock_date
tests:
- not_null
- unique
dbt test
Oh no! One of our tests failed! Let's try to understand why. If you click into the details tab, dbt provides you with the SQL query that failed.
Now let's take that query and paste it either into the Snowflake UI or the IDE to debug it.
with cst as
(
select
company_symbol||stock_date as conctat
from <dev_schema>_marts.int_knoema_stock_history
where company_symbol||stock_date is not null
group by company_symbol||stock_date
having count(*) > 1
limit 1
)
select * from <dev_schema>_marts.int_knoema_stock_history
where company_symbol||stock_date IN (SELECT conctat FROM cst)
Aha! There are shares which are traded on more than one stock exchanges. So we need to include the stock_exchange_name attribute to the unique test key.
intemediate.yml
file and update the test configuration for int_knoema_stock_history model :version: 2
models:
- name: int_fx_rates
description: "An intermediate model that filters stg_knoema_fx_rates"
columns:
- name: currency||exchange_date
tests:
- unique
- not_null
- name: int_unioned_book
description: "An intermediate model unions the manual_book csvs"
columns:
- name: instrument
tests:
- not_null
- relationships:
to: ref('int_stock_history')
field: company_symbol
- name: int_knoema_stock_history
description: "An intermediate model that pivots the stg_knoema_stock_history model by indicator"
columns:
- name: company_symbol||stock_date||stock_exchange_name
tests:
- not_null
- unique
dbt test -m int_knoema_stock_history
Look at all that green! Now the last thing to do is to generate the documentation and see how our description fields are populated.
dbt docs generate
and click on view docs.You should now see the description field populated.
Okay, it seems like we have everything in place: pipelines have been developed, tested and documented.
The next step would be to promote this code up the chain through our environments (which in this lab are simplified to just DEV & PROD). We'll do this by committing the work to our feature branch into the main branch and setting up a job to orchestrate the execution of the models in production. Generally it would be best for a pull request to be opened, so that your code is reviewed and tested via Slim CI prior to promotion to production, but in the interest of keeping things concise for this workshop, we are going to merge into production without review.
By merging into master, you should have reverted back to your master branch for you to start the cycle over again for development.
Click on the hamburger menu and click on Environments
.
You should see two different Environments: Development and Deployment.
Deployment
environment and then click on Settings
in the top right.production
. To do this, click on Edit.Save
at the top. Now in doing this, dbt will default to building in a schema named production
in the declared default database pc_dbt_db. This helps us separate out the objects we create in our production environment from our development sandbox.Jobs
.Partner Connect Trial Job
.Settings
.Edit
on the top right.Production Job
to signify that this is going to be your production run. Update the number of threads to 8. This will allow dbt to run 8 dbt models in the DAG concurrently sans dependencies. 8 is the default we recommend for Snowflake.Save
.Run now
.And this concludes our workshop! If you're interested in exploring further, there are "extra credit" tasks in the appendix of this workshop, in which you'll cover things like how to visualize what you've created with Snowsight.
Congratulations on completing the lab!
Today, you learned how to use dbt and Snowflake to build data transformation pipelines for analytics. You're now ready to apply these fundamentals to your own data. We encourage you to continue with your free trial by loading your own sample or production data, and by continuing to dive into some of the more advanced functionality of dbt Cloud and Snowflake.
select *
from pc_dbt_db.<dev_schema>_marts.int_daily_position_with_trades
where trader = 'Tina M.'
order by book_date
And that's it! Now you have a worksheet that you can slice and dice, share with your colleagues or embed in the SnowSight dashboard as a tile. For more details on SnowSight, please refer to the Snowflake documentation.