In this hands-on lab, you will build a weather prediction data application entirely within Snowflake. You'll leverage Weather & Environment data from Snowflake Marketplace, train a simple ML model using sklearn within Snowflake Notebooks, and create an interactive Streamlit application, all without leaving Snowflake.
You will process NOAA weather data with SQL, Snowpark, and develop a simple ML model to predict temperatures. You'll also create a Python User Defined Function (UDF) in Snowflake, then visualize historical weather data and predictions using Streamlit in Snowflake.
To complete this lab, you'll need a Snowflake account. A free Snowflake trial account will work just fine. To open one:
Snowflake Marketplace provides visibility to a wide variety of datasets from third-party data stewards which broaden access to data points used to transform business processes.
Snowflake Marketplace also removes the need to integrate and model data by providing secure access to data sets fully maintained by the data provider.
Because of this, we don't actually need to copy any data to our Snowflake account with any logic. Instead, we can directly access the weather data shared by a trusted provider in Snowflake Marketplace. Let's begin.
This is a live dataset! No need to write ingestion logic to bring the data into your account. The data is maintained and kept fresh by the provider.
Let's prepare our Snowflake environment so that we can begin exploring the weather data. We'll explore the data in a Snowflake Notebook.
Snowflake Notebooks offer an interactive, cell-based programming environment for Python and SQL. With a Snowflake Notebook, you can perform exploratory data analysis, experiment with feature engineering for machine learning, and perform other data science tasks within Snowflake.
Let's begin.
USE ROLE accountadmin;
USE WAREHOUSE compute_wh;
CREATE OR REPLACE DATABASE weather_lab;
CREATE OR REPLACE SCHEMA weather_schema;
USE DATABASE weather_lab;
CREATE OR REPLACE STAGE weather_schema.udf_stage;
The last line of SQL creates a stage that will be used when we deploy our user-defined function. We'll go into more detail on that in a later step.
weather_lab
database and weather_schema
schema.The notebook contains all of the code necessary to complete the rest of the lab. In this step, we'll explore the weather data using SQL and Snowpark for Python.
Recall that the Weather & Environment dataset serves as a central source of global weather, energy, and environmental metrics. A single, unified schema joins together data across numerous sources that track global environmental factors.
It covers topics like:
...and much more.
We'll use this data to build an application that lets us predict weather for the 90210 zip code (Beverly Hills) in California. We'll specifically explore and use the WEATHER__ENVIRONMENT.CYBERSYN.NOAA_WEATHER_METRICS_TIMESERIES and WEATHER__ENVIRONMENT.CYBERSYN.NOAA_WEATHER_STATION_INDEX datasets.
Let's begin.
USE ROLE accountadmin;
USE WAREHOUSE compute_wh;
USE DATABASE weather_lab;
USE SCHEMA weather_schema;
-- Explore what weather variables are available
SELECT DISTINCT variable_name
FROM WEATHER__ENVIRONMENT.CYBERSYN.NOAA_WEATHER_METRICS_TIMESERIES
LIMIT 20;
-- Average temperature by state, in Celsius
SELECT
idx.state_name,
AVG(ts.value) as avg_temperature
FROM WEATHER__ENVIRONMENT.CYBERSYN.NOAA_WEATHER_METRICS_TIMESERIES ts
JOIN WEATHER__ENVIRONMENT.CYBERSYN.NOAA_WEATHER_STATION_INDEX idx
ON ts.noaa_weather_station_id = idx.noaa_weather_station_id
WHERE ts.variable_name = 'Average Temperature'
AND ts.date >= '2020-01-01'
AND idx.country_geo_id = 'country/USA'
GROUP BY idx.state_name
ORDER BY avg_temperature DESC;
-- Sample temperature data for zip codes for weather stations in California
SELECT
ts.date,
ts.value as temperature,
idx.noaa_weather_station_name,
idx.zip_name,
idx.state_name
FROM WEATHER__ENVIRONMENT.CYBERSYN.NOAA_WEATHER_METRICS_TIMESERIES ts
JOIN WEATHER__ENVIRONMENT.CYBERSYN.NOAA_WEATHER_STATION_INDEX idx
ON ts.noaa_weather_station_id = idx.noaa_weather_station_id
WHERE ts.variable_name = 'Average Temperature'
AND idx.state_name = 'California'
AND ts.date >= '2023-01-01'
AND idx.zip_name IS NOT NULL
ORDER BY ts.date DESC
LIMIT 20;
session.table()
to create a Snowpark DataFrame that we can easily manipulate. Don't worry about the other package imports at the top, we'll use them later on in the notebook.from snowflake.snowpark import Session
from snowflake.snowpark.types import IntegerType, FloatType, StringType
from snowflake.snowpark.functions import avg, sum, col, udf, call_udf, call_builtin, year, month, dayofyear
import pandas as pd
from datetime import date
import numpy as np
# scikit-learn for training ML models, used later in notebook
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
session = Session.builder.getOrCreate()
# Snowpark DataFrame equivalent to previous SQL query
ts = session.table("WEATHER__ENVIRONMENT.CYBERSYN.NOAA_WEATHER_METRICS_TIMESERIES")
idx = session.table("WEATHER__ENVIRONMENT.CYBERSYN.NOAA_WEATHER_STATION_INDEX")
sample_weather_df = (
ts.join(idx, ts.col("NOAA_WEATHER_STATION_ID") == idx.col("NOAA_WEATHER_STATION_ID"))
.filter(ts.col('VARIABLE_NAME') == 'Average Temperature')
.filter(idx.col('STATE_NAME') == 'California')
.filter(ts.col('DATE') >= '2023-01-01')
.filter(idx.col('ZIP_NAME').isNotNull())
.select(
ts.col('DATE'),
ts.col('VALUE').alias('TEMPERATURE'),
idx.col('NOAA_WEATHER_STATION_NAME'),
idx.col('ZIP_NAME'),
idx.col('STATE_NAME')
)
.orderBy(ts.col('DATE').desc())
.limit(100)
)
sample_weather_df.show(20)
Great job! In just a few simple queries, in both SQL and Python, you were able to explore the dataset and get a sense of the type of data it contains.
In the next steps, we'll dive deeper into the dataset, and use several more columns in the data to get ready to train our ML model.
To train our ML model, we'll need to prepare the data that will be used to train the model. In this step, we'll perform the data preparation by transforming the raw weather measurements into a clean, structured dataset suitable for machine learning.
We'll take the daily temperature readings from thousands of weather stations and consolidate them into consistent monthly averages by location. All of what we'll do in this step is also known as feature engineering.
# Create Snowpark DataFrame for temperature data
. It could take around 30 seconds to execute. Once done, you should see output similar to the following:Here's what the code does:
All of this is setting up the foundation that will feed into the linear regression models we'll train in the next step.
In this step, we'll train our ML model. We're specifically going to train a linear regression model to predict weather for the 90210 (Beverly Hills) zip code.
# Train model for 90210
. It could take around 30 seconds to execute. Once done, you should see output similar to the following:Here's what the code does:
The result is a weather prediction model for the Beverly Hills area in California. Rather than using a generic, California-wide model, the features we use in the training helps the model account for the area's geographic characteristics.
In the next step, we'll deploy the model as a User Defined Function (UDF) for making temperature predictions in our application.
Let's now deploy this model as a user-defined function (UDF) in Snowflake. By deploying it as a UDF, we're able to quickly call it throughout our Snowflake environment, whether in a SQL worksheet, or in an application. We're going to call this UDF from our application's front-end, so that we can make weather predictions for specific months.
def predict_temperature
. It could take around 15 seconds to execute. Once done, you should see output similar to the following:Here's what the code does:
predict_temperature
that accepts a zip code and month to make a predictiontest_cases
)With our model now deployed as a UDF, we can use easily use it within our Streamlit in Snowflake application.
Now that we've trained our ML model and created a UDF that allows us to easily use it, we'll build an interactive Streamlit application directly in Snowflake to make predictions about the weather in zip code 90210.
weather_lab
as the database, and weather_schema
as the schema. Leave everything else as-is and create the app.Great job! You successfully created the application.
Congratulations! You've built a data application using data from Snowflake Marketplace. You used the data to train a linear regression model to predict weather in a zip code, and you created a Streamlit in Snowflake app that uses the model to make predictions for different months. Let's recap what you did.
You built a data app that predicts weather for the 90210 zip code in California. It uses a trained linear regression model to make the predictions, and the model was trained using 20+ years of weather data for that area. You also did the feature engineering to account for other geographic variables in the area.
You used the free Weather & Environment data from Snowflake Marketplace, provided by Snowflake.
You explored the data within a Snowflake Notebook, using SQL and Snowpark for Python. You also trained the model within the notebook.
You deployed the trained model as a UDF, so that it can be used throughout the Snowflake environment, like in our app.
You built a Streamlit in Snowflake application that uses the model to make predictions about the weather for different months for the 90210 zip code.
Congratulations!
For more resources, check out the following: