In this Quickstart, we're going to focus on data engineering with Snowflake. We'll specifically build an end-to-end data pipeline with Snowflake. We'll apply the Ingestion-Transformation–Delivery framework, also known as I-T-D, to build the pipeline.
Ingestion
We'll load data from:
Transformation
We'll transform our data using:
Delivery
We'll deliver a final data product using:
To complete this lab, you'll need a Snowflake account. A free Snowflake trial account will work just fine. To open one:
Tasty Bytes is a food truck company that operates globally in many countries. You're a data engineer on the Tasty Bytes team, and you've recently learned from data analysts on the team that:
As a data engineer, your goal is to figure out why this happened, and to also build and end-to-end data pipeline that can keep analysts up-to-date on the weather in Hamburg.
Here's how we'll do this:
Ingestion
Transformation
Deliver
Let's get started!
Let's start by "loading" the raw weather data into Snowflake. It turns out that "loading" is really the wrong word here.
We're using Snowflake's unique data sharing capability in Snowflake Marketplace. Because of this, we don't actually need to copy any data to our Snowflake account with any logic. Instead, we can directly access the weather data shared by a trusted provide in Snowflake Marketplace. Let's go ahead and do this.
This is a live dataset! No need to write ingestion logic to bring the data into your account. The data is maintained and kept fresh by the provider.
Let's now load the Tasty Bytes sales data. This data is currently sitting across many CSV files in an AWS S3 bucket. Let's use Snowflake's COPY INTO command to load the data into your Snowflake account.
CREATE OR REPLACE DATABASE tasty_bytes;
CREATE OR REPLACE SCHEMA raw_pos;
CREATE OR REPLACE FILE FORMAT tasty_bytes.public.csv_ff
type = 'csv';
Run this command.
url
parameter, type:file_format = tasty_bytes.public.csv_ff;
Run this entire command.
COPY INTO tasty_bytes.raw_pos.country
FROM @tasty_bytes.public.s3load/raw_pos/country/;
Run this command. You should see a success message in the console. Great job!
We're not quite done. We loaded about 30 rows of data into the COUNTRY table, but we actually need to load close to 1 GB of sales data. We won't do this line-by-line. Instead, you'll use some SQL that we've written for you in advance.
Observations:
After running the file, you should have all of the data loaded into your account! Confirm the creation of all the tables (and their data) using the object picker to the left.
This completes the Ingestion aspect of our pipeline for this lab.
We now have the necessary data in our Snowflake account. To get closer to the insights that we need – weather-related data for the city of Hamburg, Germany – we need to use SQL to apply transformations to the data. This will bring us closer to the insights we're after. Let's begin.
In this worksheet, we use SQL to explore the suspicion that there are weather-related anomalies affecting sales in Hamburg, Germany. Let's start exploring.
'GERMANY'
as the country, and 'HAMBURG'
as the city. Re-run the block of SQL.-- Query to explore sales in the city of Hamburg, Germany
WITH _feb_date_dim AS
(
SELECT DATEADD(DAY, SEQ4(), '2022-02-01') AS date FROM TABLE(GENERATOR(ROWCOUNT => 28))
)
SELECT
fdd.date,
ZEROIFNULL(SUM(o.price)) AS daily_sales
FROM _feb_date_dim fdd
LEFT JOIN analytics.orders_v o
ON fdd.date = DATE(o.order_ts)
AND o.country = -- Add country
AND o.primary_city = -- Add city
WHERE fdd.date BETWEEN '2022-02-01' AND '2022-02-28'
GROUP BY fdd.date
ORDER BY fdd.date ASC;
It looks like the analysts were correct – there are many dates in February where sales totaled 0.00. So we're on the right path.
tasty_bytes.harmonized.daily_weather_v
view. We'll use this view later on in our pipeline.-- Create view that adds weather data for cities where Tasty Bytes operates
CREATE OR REPLACE VIEW tasty_bytes.harmonized.daily_weather_v
COMMENT = 'Weather Source Daily History filtered to Tasty Bytes supported Cities'
AS
SELECT
hd.*,
TO_VARCHAR(hd.date_valid_std, 'YYYY-MM') AS yyyy_mm,
pc.city_name AS city,
c.country AS country_desc
FROM Weather_Source_LLC_frostbyte.onpoint_id.history_day hd
JOIN Weather_Source_LLC_frostbyte.onpoint_id.postal_codes pc
ON pc.postal_code = hd.postal_code
AND pc.country = hd.country
JOIN TASTY_BYTES.raw_pos.country c
ON c.iso_country = hd.country
AND c.city = hd.city_name;
-- Query the view to explore daily temperatures in Hamburg, Germany for anomalies
SELECT
dw.country_desc,
dw.city_name,
dw.date_valid_std,
AVG(dw.avg_temperature_air_2m_f) AS avg_temperature_air_2m_f
FROM harmonized.daily_weather_v dw
WHERE 1=1
AND dw.country_desc = 'Germany'
AND dw.city_name = 'Hamburg'
AND YEAR(date_valid_std) = '2022'
AND MONTH(date_valid_std) = '2' -- February
GROUP BY dw.country_desc, dw.city_name, dw.date_valid_std
ORDER BY dw.date_valid_std DESC;
-- Query the view to explore wind speeds in Hamburg, Germany for anomalies
SELECT
dw.country_desc,
dw.city_name,
dw.date_valid_std,
MAX(dw.max_wind_speed_100m_mph) AS max_wind_speed_100m_mph
FROM tasty_bytes.harmonized.daily_weather_v dw
WHERE 1=1
AND dw.country_desc IN ('Germany')
AND dw.city_name = 'Hamburg'
AND YEAR(date_valid_std) = '2022'
AND MONTH(date_valid_std) = '2' -- February
GROUP BY dw.country_desc, dw.city_name, dw.date_valid_std
ORDER BY dw.date_valid_std DESC;
-- Create a view that tracks windspeed for Hamburg, Germany
CREATE OR REPLACE VIEW tasty_bytes.harmonized.--add name of view
AS
SELECT
dw.country_desc,
dw.city_name,
dw.date_valid_std,
MAX(dw.max_wind_speed_100m_mph) AS max_wind_speed_100m_mph
FROM harmonized.daily_weather_v dw
WHERE 1=1
AND dw.country_desc IN ('Germany')
AND dw.city_name = 'Hamburg'
GROUP BY dw.country_desc, dw.city_name, dw.date_valid_std
ORDER BY dw.date_valid_std DESC;
A view in Snowflake allows you to store and access the result of a query. This is in contrast to querying multiple tables with raw data to extract insights. The queries that views store can be arbitrarily simple or complex. This helps you query exactly what you need at a much faster speed.
Views also help organize exactly which aspects of data might be valuable, and aid in help with secure data access control. We can use views to power our data pipeline, without sacrificing performance.
We'll use these views in our pipeline later on.
We're missing some critical data for our pipeline. Our analysts have requested that we track certain weather measurements using the metric system. We are tracking a country in Europe after all.
They've specifically requested that we include temperature measurements in Celsius, and precipitation in millimeters.
To do this, we'll create two user-defined functions, or UDFs, that are able to take in existing data from our tables and perform the necessary conversions. We will then invoke them later on to derive these new values.
Let's get started.
FUNCTION
after each the CREATE OR REPLACE
statements. Be sure to remove the /* */
placeholders.CREATE OR REPLACE /* */ tasty_bytes.analytics.fahrenheit_to_celsius(temp_f NUMBER(35,4))
RETURNS NUMBER(35,4)
AS
$$
(temp_f - 32) * (5/9)
$$
;
CREATE OR REPLACE /* */ tasty_bytes.analytics.inch_to_millimeter(inch NUMBER(35,4))
RETURNS NUMBER(35,4)
AS
$$
inch * 25.4
$$
;
Great! We'll use these functions to expand the views we're planning on using in our pipeline.
Let's now use the UDFs to add new columns in our views. These new columns will contains the converted values for temperature and precipitation.
-- Apply UDFs and confirm successful execution
...
-- Code to focus on
ROUND(AVG(analytics.fahrenheit_to_celsius(fd.avg_temperature_air_2m_f)),2) AS avg_temperature_celsius,
ROUND(AVG(fd.tot_precipitation_in),2) AS avg_precipitation_inches,
ROUND(AVG(analytics.inch_to_millimeter(fd.tot_precipitation_in)),2) AS avg_precipitation_millimeters,
...
weather_hamburg
. The entire line should read CREATE OR REPLACE VIEW harmonized.weather_hamburg
.Great job! We've transformed our data using SQL to create views, and we've used UDFs to expand our views (and create new views!). We'll use these transformations later on in our pipeline.
This completes the Transformation aspect of our pipeline for this lab.
We now have the insights that we need, and we can now also deliver them to our data analysts. We have views that track the weather and sales in Hamburg, Germany. So how exactly will we make these insights easily accessible for our data analysts?
We're going to create a Streamlit in Snowflake app that will visualize this data for them.
For some context, Streamlit is a popular open-source Python library for creating data apps in Python – no need for HTML, CSS, or any other front-end frameworks. It's also natively available within Snowflake, which means creating data apps that use the data in our Snowflake environment is both powerful and easy to do.
Let's go ahead and create the app for our analysts.
This code creates an app that visualizes the data within the views we created earlier.
"INSERT NAME OF VIEW HERE"
with tasty_bytes.harmonized.weather_hamburg
and click "Run once more.In just 54 lines of code, we created a Streamlit in Snowflake application with Python that delivers the insights we extracted from our data.
At the top, you could imagine clicking on "Share" and sharing the app with relevant analyst roles in your Snowflake account.
With this application, we've now completed our end-to-end data pipeline. This completes the Delivery aspect of our pipeline for this lab.
Congratulations! You've built an end-to-end data pipeline in Snowflake using the Ingestion-Transformation–Delivery framework, also known as I-T-D. Let's recap what you did.
You built a data pipeline that tracks weather and sales data for Tasty Bytes food trucks in the city of Hamburg, Germany. As part of the I-T-D framework, you:
Ingestion
Loaded data from:
Transformation
Transformed data using:
Delivery
Delivered a final data product using:
Congratulations!
For more resources, check out the following: