Welcome to the Build Data Products and Data Mesh with dbt Cloud Quickstart guide! Before you begin getting your hands into the technology, let's define what a data mesh is and describe what motivates the topic.
A data mesh is a decentralized data management architecture comprising domain-specific data. Instead of having a single centralized data platform, teams own the processes around their own data.
In a data mesh framework, teams own not only their own data, but also the data pipelines and processes associated with transforming it. A central data engineering team maintains both key data sets and a suite of self-service tools to enable individual data ownership. Domain-specific data teams then exchange data via well-defined and versioned contracts.
Data mesh architecture aims to solve the lingering issues in data systems by adopting the same approach to data systems that software engineering teams take to software systems. It does so by enacting the following four principles:
In the end, an operational data mesh architecture may look something like this:
The following two conditions are indicators that you may benefit from a data mesh architecture.
The architectural choice to use a data monolith has numerous knock-on effects. Monolithic approaches break down a data processing pipeline into several stages-ingestion, processing, and serving.
A single team often handles all of these stages. This approach can work at first but breaks down with scale. As more and more requests come in, the data engineering team finds itself unable to respond to them promptly. This leads to an ever-growing backlog of feature requests and bug fixes. This slows down the pace of innovation and also leads to the system becoming more brittle over time.
In this approach, data engineering teams often can't gain the full context behind the underlying data in this model. Since they're responsible for maintaining data sets from multiple disparate teams, they often don't fully understand the business rationale behind the data.
This can lead them to make uninformed – and, sometimes, harmful – decisions that impact business decision-making. For example, a data engineering team may format data in a way that the sales department doesn't expect. This can lead to broken reports or even lost data.
Monolithic systems rarely have clear contracts or boundaries. This means that data formatting changes upstream can break an untold number of downstream consumers. The result? This can cause teams to avoid making necessary changes for fear of breaking everything. This leads to monolithic systems gradually becoming outdated, brittle, and hard to maintain.
Finally, collaboration also becomes more difficult in a monolithic system. Since no one is familiar with the entire codebase, it takes more people and more time to complete data-related tasks. This affects time to market for new products and features – which impacts the company's bottom line.
A data product is a useful information asset, accessed by a clearly defined interface, with an expectation of timeliness and correctness by the consumer. Teams build and maintain data products using processes and technologies that they are responsible for, though they could be supported by other platform-oriented teams.
The properties of a data product are:
Throughout this guide, imagine yourself building a data platform with Snowflake and dbt Cloud for your organization, starting with two teams:
Each step of the guide will learn how to use features of both Snowflake and dbt Cloud to achieve building data products.
At the end of this guide, you will have achieved:
Here are the requirements for this Quickstart Guide:
SNOWFLAKE_SAMPLE_DATA
database and the TPCH_SF1
schema.In the following step within this Quickstart Guide, you will be given instructions on how to sign-up for a Snowflake account.
Here are the requirements for this Quickstart Guide:
In the following step within this Quickstart Guide, you will be given instructions on how to sign-up for a dbt Cloud account using Snowflake Partner Connect.
In this step, you will set up Snowflake and dbt Cloud accounts for use in the rest of the Quickstart guide.
Navigate to signup.snowflake.com and follow the steps. A few things to note:
An email to activate your account will be sent after following the steps in the sign up flow. Click the button in the email "Click to Activate"
Next, specify a username and password for your account.
When prompted to either explore a sample data set or load data into snowflake, select the "Skip for Now" button.
Your Snowflake account is ready!
Data Products
in the lefthand sidebar, then Partner Connect
which is located within the Data Products
section. dbt
into the Search Partner Connect
search bar. Click on the dbt tile.Connect to dbt
that contains all of the associated objects created by Partner Connect. Click on the Optional Grant
dropdown menu and add Snowflake_Sample_Data
in the text box. This will grant your new dbt user role access to the database. Once that's entered, click Connect
. This will create a dedicated dbt user, database, warehouse, and role for your dbt Cloud trial.Your partner account has been created
, click on Activate
.If this is your first time in dbt Cloud, you'll see the screen below. Add your Account Name and a password for your user.
In this step, you will be setting up Snowflake for two teams: the central data team and the finance team, as shown in the diagram below. You will be using least privileged access principles in order to properly secure the data.
The central data team is well-established in the organization and the average team member is capable of building data pipelines that powers business reporting across various domains: finance, marketing, sales, customer support, and so on. The team uses data management best practices like organizing data in dimensional models for maximum re-usability in various BI and AI/ML applications.
The first thing you'll need to do is set up a role specifically for applying these governance practices to the Snowflake environment. The code below will:
foundational_role
role for creating and managing resources in the foundational_db
database using the foundational_wh
warehouse. It will also be able to administer data governance responsibilities, and grant appropriate permissions for masking and tagging.foundational_pii_reader_role
for users who can access PII data unmasked.In the Snowflake UI, click on Projects
in the lefthand sidebar, then Worksheets
which is located within the Projects
section. Then select the blue plus icon in the top-right of the screen to create a new SQL worksheet.
Then copy the code below, paste it into the worksheet, and either:
use role accountadmin;
create database if not exists foundational_db;
create schema if not exists foundational_db.prod;
create or replace warehouse foundational_wh with warehouse_size = xsmall;
create role if not exists foundational_role;
create role if not exists foundational_pii_reader_role;
grant role foundational_pii_reader_role to role foundational_role;
grant usage on database foundational_db to role foundational_role;
grant usage on schema foundational_db.prod to role foundational_role;
grant usage on warehouse foundational_wh to role foundational_role;
grant create schema on database foundational_db to role foundational_role;
grant create table on schema foundational_db.prod to role foundational_role;
grant create view on schema foundational_db.prod to role foundational_role;
grant create tag on schema foundational_db.prod to role foundational_role;
grant create masking policy on schema foundational_db.prod to role foundational_role;
grant apply masking policy on account to role foundational_role;
grant apply tag on account to role foundational_role;
Meanwhile, the average finance team member is more accustomed to consuming dashboards and building spreadsheets, however more and more demands on data has led to the finance team owning and managing more data to rapidly respond to changing demands. And so, the team has upskilled team members and brought on an analytics engineer to use Snowflake and dbt Cloud, in order to create data pipelines building off of the foundation project for daily reporting use-cases.
Now create the finance team workspace. The code below will:
finance_role
role for creating and managing resources in the finance_db
database using the finance_wh
warehouse.create database if not exists finance_db;
create schema if not exists finance_db.prod;
create or replace warehouse finance_wh with warehouse_size = xsmall;
create role if not exists finance_role;
grant usage on warehouse finance_wh to role finance_role;
grant usage on database finance_db to role finance_role;
grant usage on schema finance_db.prod to role finance_role;
grant select on all tables in schema finance_db.prod to role finance_role;
grant create schema on database finance_db to role finance_role;
grant create table on schema finance_db.prod to role finance_role;
grant create view on schema finance_db.prod to role finance_role;
grant usage on database foundational_db to role finance_role;
grant usage on schema foundational_db.prod to role finance_role;
To get this all working correctly, make sure to assign the relevant roles to your own Snowflake database user. Ensure that you're replacing in the script below with your Snowflake username.. If you've forgotten your username, you can find it in the bottom left icon -> My Profile.
use role accountadmin;
grant role foundational_role to user pc_dbt_user;
grant role foundational_role to user <your-snowflake-username>;
grant role foundational_pii_reader_role to user <your-snowflake-username>;
grant role finance_role to user <your-snowflake-username>;
With these basic setup steps within Snowflake, you have begun laying the first layer of an interoperable and secure data mesh. In the next step, you will add dbt Cloud to create the tables in the data products.
Here is where you are in the journey towards a data product:
Now you will create two dbt Cloud Projects: one for the central data team, and one for the finance team, as depicted in the diagram below.
You will notice that you need to input your Snowflake credentials and resources information created in the previous step. dbt Cloud uses Snowflake role and warehouse resources in order to build database tables and views. The platform is powerful enough for the central data team and also accessible enough for newcomers on the finance team to use, all the while allowing collaboration between these two teams.
Now you will create the foundational project in dbt Cloud, which is to be exclusively developed by the central data team. It is sometimes referred to as the Upstream Project when other dbt projects build upon it. We can use the project that was already created for us by Partner Connect:
Foundational Project
, and click "Save" in the bottom right. When prompted with "Your change will impact all users", click "Continue".Production
Production
foundational_role
foundational_db
foundational_wh
prod
Normally, we'd use a service account here to write to the production space
For further details about this step, you may refer to the dbt documentation on creating a new project in dbt Cloud.
Meanwhile, the finance team will build on these foundations, and add more specific transformations or business logic as required for their purposes. Follow the same steps as above, but fill in the finance team Snowflake information:
Finance Project
and click Continue.Foundational Project
(Account Settings > Projects > Foundational Project > Snowflake)finance_role
finance_db
finance_wh
Username and password
finance_repo
.Production
Production
(This should already be set appropriately.)Keep on latest version
Username and password
prod
When setting up dbt Cloud for production, there are four recommended security options to configure. They are out-of-scope for this particular Quickstart Guide, but worth mentioning at this point:
dbt Cloud adds a layer of addressability and discoverability to the data platform. In the next step, you will build your first data product.
Here is where you are in the journey towards a data product:
Now it's time for you to add dbt code in the foundational project using the dbt Cloud IDE. Using the sample TPCH dataset provided by Snowflake, the dbt code will create a fct_orders
table representing all of the orders within our organization. The code below has three layers of transformations: raw data sources, staging models, and core business logic. And by using dbt, you automatically have end-to-end data lineage.
Here are the steps:
models/example
folder.dbt_project.yml
file and rename the project (line 5) from my_new_project
to foundational_project
.dbt_project.yml
file, remove lines 39-42 (the my_new_project
model reference).staging
and marts
. If you've done this right, these directories will be inside the models
directory.Now that you've set up the foundational project, let's start building the data assets. Set up the staging layer as follows:
models/staging/sources.yml
.sources:
- name: tpch
description: "TPCH data source from Snowflake Sample Data"
database: snowflake_sample_data
schema: tpch_sf1
tables:
- name: orders
description: "One record per order"
- name: customer
description: "One record per customer"
- name: lineitem
description: "One record per line item within a single order (1 -> n)"
models/staging/stg_customers.sql
file to select from the customers
table in the tpch
source.with source as (
select * from {{ source('tpch', 'customer') }}
),
cleanup as (
select
c_custkey as customer_key,
c_name as name,
c_address as address,
c_nationkey as nation_key,
c_phone as phone_number,
c_acctbal as account_balance,
c_mktsegment as market_segment,
c_comment as comment
from source
)
select * from cleanup
models/staging/stg_orders.sql
file to select from the orders
table in the tpch
source.with source as (
select * from {{ source('tpch', 'orders') }}
),
renamed as (
select
o_orderkey as order_key,
o_custkey as customer_key,
o_orderstatus as status_code,
o_totalprice as total_price,
o_orderdate as order_date,
o_clerk as clerk_name,
o_orderpriority as priority_code,
o_shippriority as ship_priority,
o_comment as comment
from source
)
select * from renamed
models/staging/stg_line_items.sql
file to select from the line_items
table in the tpch
source.with source as (
select * from {{ source('tpch', 'lineitem') }}
),
renamed as (
select
l_orderkey as order_key,
l_partkey as part_key,
l_suppkey as supplier_key,
l_linenumber as line_number,
l_quantity as quantity,
l_extendedprice as gross_item_sales_amount,
l_discount as discount_percentage,
l_tax as tax_rate,
l_returnflag as return_flag,
l_linestatus as status_code,
l_shipdate as ship_date,
l_commitdate as commit_date,
l_receiptdate as receipt_date,
l_shipinstruct as ship_instructions,
l_shipmode as ship_mode,
l_comment as comment,
-- extended_price is actually the line item total,
-- so we back out the extended price per item
(gross_item_sales_amount/nullif(quantity, 0))::decimal(16,4) as base_price,
(base_price * (1 - discount_percentage))::decimal(16,4) as discounted_price,
(gross_item_sales_amount * (1 - discount_percentage))::decimal(16,4) as discounted_item_sales_amount,
-- We model discounts as negative amounts
(-1 * gross_item_sales_amount * discount_percentage)::decimal(16,4) as item_discount_amount,
((gross_item_sales_amount + item_discount_amount) * tax_rate)::decimal(16,4) as item_tax_amount,
(
gross_item_sales_amount +
item_discount_amount +
item_tax_amount
)::decimal(16,4) as net_item_sales_amount
from source
)
select * from renamed
Now set up the marts layer by creating a file models/marts/fct_orders.sql
to build a fact table with order details:
{{
config(
materialized='table',
)
}}
with orders as (
select * from {{ ref('stg_orders') }}
),
line_items as (
select * from {{ ref('stg_line_items') }}
),
customers as (
select * from {{ ref('stg_customers') }}
),
order_item_summary as (
select
order_key,
sum(gross_item_sales_amount) as gross_item_sales_amount,
sum(item_discount_amount) as item_discount_amount,
sum(item_tax_amount) as item_tax_amount,
sum(net_item_sales_amount) as net_item_sales_amount
from line_items
group by 1
),
final as (
select
orders.order_key,
orders.order_date,
orders.customer_key,
orders.status_code,
orders.priority_code,
orders.ship_priority,
orders.clerk_name,
customers.name,
customers.market_segment,
order_item_summary.gross_item_sales_amount,
order_item_summary.item_discount_amount,
order_item_summary.item_tax_amount,
order_item_summary.net_item_sales_amount
from orders
inner join order_item_summary
on orders.order_key = order_item_summary.order_key
inner join customers
on orders.customer_key = customers.customer_key
)
select *
from final
order by order_date
Navigate to the Command bar at the very bottom of the IDE and execute dbt build
. This will both validate the work you've done thus far and build out the requisite models into your sandbox within Snowflake.
At this point, you may also see what you have built by looking at the Lineage tab in the dbt Cloud IDE, and as shown in the graphic below.
To finish this step up, go to the Version control section in the IDE to:
You now have built the first useful tables in the foundational project, which any user or application can access using SQL. In the next step, you will secure the PII access to privileged users.
Here is where you are in the journey towards a data product:
Congratulations! You built a useful dataset. But, it's not secure. Organizational data privacy policies require Personally Identifiable Information (PII) to be restricted only to a few select people within the organization. Naturally, this restriction is best checked within Snowflake itself, as various applications within the organization access data, so restricting the data at the source helps protect your customers' privacy.
Use Snowsight to execute the following commands to create object tags and a data masking policy.
use role foundational_role;
use database foundational_db;
-- Create a tag for PII data
create tag if not exists foundational_db.prod.pii_data;
-- Create a masking policy for string data
create or replace masking policy foundational_db.prod.pii_mask_string as (val string) returns string ->
case
when is_role_in_session('foundational_pii_reader_role') then val
else '****'
end;
-- Assign the masking policy to the tag
alter tag foundational_db.prod.pii_data set masking policy foundational_db.prod.pii_mask_string;
Now it's time to jump into dbt Cloud to use a model post-hook to apply the masking policy to the name
column of the fct_orders
dbt model immediately after the table is built.
Open up the fct_orders.sql
file and modify the config block at the top to include the post_hook
argument:
{{
config(
materialized='table',
post_hook="alter table {{ this }} modify column name set tag foundational_db.prod.pii_data = 'name'"
)
}}
You have now secured the PII produced in this data product so it is restricted to privileged users, governed by native Snowflake features. In the next step, you will add model contracts and enforce grants with dbt.
Here is where you are in the journey towards a data product:
Now that you have set up with data masking, it's time to set up proper grants to the foundational project's fct_orders
table, add a model contract, and make it accessible to other dbt projects.
By using these configurations within your project, you'll be, in effect, creating a stable set of APIs that your downstream projects are able to reliably consume.
To begin, add the file models/marts/core.yml
and add the code below:
models:
- name: fct_orders
description: "This model contains order information from the transactional systems of the company."
# Model access setting
access: public
config:
# Model contract setting
contract:
enforced: true
# Grant setting
grants:
select: ['foundational_pii_reader_role', 'finance_role']
columns:
- name: order_key
data_type: int
description: "The unique identifier for each order"
tests:
- unique
- not_null
- name: order_date
data_type: date
description: "The date and time when the order was placed"
- name: customer_key
data_type: int
description: "The identifier for the customer who placed the order"
tests:
- not_null
- name: status_code
data_type: varchar
description: "A code representing the current status of the order"
- name: priority_code
data_type: varchar
description: "A code indicating the priority level of the order"
- name: ship_priority
data_type: int
description: "The priority level for shipping the order"
- name: clerk_name
data_type: varchar
description: "The name of the clerk responsible for processing the order"
- name: name
data_type: varchar
description: "The name associated with the order, such as the product name or order name"
- name: market_segment
data_type: varchar
description: "The segment of the market to which the order is targeted or classified"
- name: gross_item_sales_amount
data_type: numeric(38, 3)
description: "The total sales amount for items in the order before any discounts or taxes"
- name: item_discount_amount
data_type: numeric(38, 3)
description: "The total discount amount applied to items in the order"
- name: item_tax_amount
data_type: numeric(38, 3)
description: "The total tax amount applied to items in the order"
- name: net_item_sales_amount
data_type: numeric(38, 3)
description: "The total sales amount for items in the order after discounts and taxes have been applied"
This YAML file does the following:
fct_orders
model access level to public, which means other projects in the dbt Cloud account are now able to reference it.To finish this step up, go to the Version control section in the IDE and:
fct_orders
model once downstream users and applications use it and expect its structure to remain the same due to the model contract? That is where dbt model versions come in, to allow for multiple versions of a model to co-exist for a limited period of time, while downstream queries are migrated.persist_docs
configuration will commit the dbt documentation directly to Snowflake as comments, for use by other applications.In this step you added a straightforward YAML file, and now fct_orders
is a tested data model with documentation, will have the proper grants applied when built, and is publicly available for other dbt Cloud projects to use. In the next step, you will build the foundational project in production.
Here is where you are in the journey towards a data product:
Before a downstream team can leverage assets from this foundational project, you will need to run a deployment job successfully.
To run your first deployment dbt Cloud job, you will need to create a new dbt Cloud job.
In this step, you ran a dbt Cloud job that deployed a production version of the data to Snowflake and registered the project metadata in dbt Cloud's metadata storage. This was necessary for the next step, which is to reference the Foundational data from within the Finance dbt Cloud project.
Here is where you are in the journey towards a data product:
In this penultimate step, you will now set up dbt Cloud to build the finance team's data product by building off the foundational data product, the result of which is an architecture as shown in the diagram below.
models/example
folderdbt_project.yml
file and rename the project (line 5) from my_new_project
to finance_project
.dbt_project.yml
file, remove lines 39-42 (the my_new_project
model reference).dependencies.yml
and add the upstream platform project and click Save.projects:
- name: foundational_project
fct_orders
model created in the separate project by your platform team. In your models directory, create a file models/agg_segment_revenue.sql
select
market_segment,
sum(gross_item_sales_amount) as total_revenue
from {{ ref('foundational_project', 'fct_orders') }}
group by 1
Notice the cross-project ref
by using two arguments to the function: 1) name of the project, as defined within that upstream project and declared in dependencies.yml
, and 2) the name of a public model in that project.
fct_orders
table.To finish this step up, go to the Version control section in the IDE to:
Now run a production dbt Cloud Job of the finance project, as you did with the foundational project.
With a few lines of code, the finance team now is building directly off of the central data team's work. They are synchronized which ensures data consistency, but autonomous to create their data product to power their own applications. The finance team can also feel confident that the data they are addressing will be reliable, due to the model contract that's in-place with the fct_orders
model.
Here is where you are in the journey towards a data product:
You have now completed building projects for this Quickstart Guide. Now it is time to try out dbt Explorer to discover and drill into the dbt assets available within your dbt Cloud account.
fct_orders
and scroll through the page, noting Lineage, Build Status, and Description.order_key
column to see the Column Level Lineage.With dbt Explorer, data products built with dbt are automatically published and searchable, allowing for easy discovery and re-usability. Furthermore, you can use dbt Explorer to drill into data product details including documentation, build status, and lineage at the column-level, model-level, or project-level.
You have now completed your data product journey!
During this quickstart guide, you used native features of Snowflake and dbt Cloud that can be combined to create data products. By using dbt Cloud, you gained additional governance, automation, and interoperability mechanisms than from Snowflake alone. Overall, these two self-service platforms combined increase the likelihood of success of a data mesh project.
If you are interested in this solution for your organization, contact dbt Labs to get started!
To learn more about data mesh, you may refer to dbt Labs' Guide to Data Mesh e-book.