In this guide, we'll be walking you through how to build a time series forecasting model using Zepl's data science notebook with data loaded from Snowflake's Data Marketplace! We will be forecasting on stock market close values. Please note: This walk through is not to be used for investing purposes. This is intended to be used as a guide to demonstrate an example of how to use Zepl and Snowflake.

Prerequisites

What You'll Learn

What You'll Need

What You'll Build

Create a Snowflake Account:

Sign up for free

Clone data from Snowflake's Data Marketplace

  1. Login to Snowflake > Open the Data Marketplace
  2. Search for the ‘US Stock Market Data for Data Science'
  3. Select ‘Get Data'. This user must have ACCOUNTADMIN privileges
  4. Assign a database name and roles for who should have access to this database
  5. Select Create Database > View Database

Troubleshooting

Check if the database was created properly

SELECT * FROM "ZEPL_US_STOCKS_DAILY"."PUBLIC"."STOCK_HISTORY" LIMIT 10

Check if privileges are set correctly: This database needs to have SELECT privileges for a Role that your user has access to. Setting the SELECT privilege for the PUBLIC role will allow all users to read data from this database.

Create a Zepl Account:

Sign up for free using Partner Connect

Connect to the US Stock Market Database

Zepl Documentation

  1. Login to Zepl
  2. Select Resources > Data Sources > Snowflake
  3. Enter the required information below > Select Add:

Troubleshooting

Use the Test Connection button to validate the user credentials and ROLE. Below is a table of example outputs:

Error Message

Action

Failed: Invalid credentials provided for the datasource

Your credentials were entered incorrectly

Failed: failed to run the test query: 390189 (08004): Role 'PRIVATE' specified in the connect string does not exist or not authorized. Contact your local system administrator, or attempt to login with another role, e.g. PUBLIC.

Enter a new Snowflake ROLE

Creating a blank notebook

Zepl Documentation

  1. From any screen, Select Add New (upper left)
  2. Select + New Notebook
  3. Enter Notebook Details:

Understanding Zepl's Notebook Options

Resource Type: Doc

Zepl Containers are isolated environments that are used to execute code. Paired with an Image, these Containers provide a private runtime for a notebook when it spins up. Each of these Containers has a single CPU core and varies in the memory allocated to the notebook session. These Containers can be quickly scaled up or down in the notebook settings for larger or smaller workloads. Each Container has a default time out period of 30 minutes.

Image: Doc

Zepl Images can help you and your team save significant time by creating reproducible environments for all notebooks to access. The Zepl Image builds all of the libraries required to run a notebook so your users don't have to worry about long wait times for installing libraries at the beginning of every notebook and hoping that each notebook environment is configured the same.

Spaces: Doc

A Zepl Space is a collection of notebooks that can be shared with built-in access controls. The "My Notebooks" space is a private space dedicated to your user only. Typically, new spaces are created for a specific project or working group; it's a place for a group of people who are working together on a set of data science problems.

Importing an existing notebook (optional)

Zepl Documentation

  1. From any screen, Select Add New (upper left)
  2. Select + Import Notebook
  3. Type: Fetch From URL
  4. Paste either link in the text field labeled Link to your notebook:
https://app.zepl.com/api/v1/viewer/fetch?url=bm90ZTovL3pzaGFpbnNreUB6ZXBsLmNvbS8xZjNiMjQwZTFjZjQ0MDk4YTU0Njc1MDM3MTMwNThlZC9ub3RlLmpzb24
https://raw.githubusercontent.com/zshainsky/Getting_Started_on_Zepl/master/Snowflake%20Guide%20Stock%20Time%20Series%20Forecasting.zpln
  1. Apply

Attach Snowflake Data Source

Zepl Documentation

  1. In your notebook, select the Data Source icon on the right hand side of the screen.
  2. Select the plus icon next to the data source labeled ‘US_STOCKS_DAILY' which you created in step 3.

In the Zepl Notebook:

First, add this code to the first paragraph

%python
import pandas as pd

# Create a Snowflake Cursor Object
cur = z.getDatasource("US_STOCKS_DAILY")

# Un-comment if warehouse is not specified in the Data Source
# cur.execute("USE WAREHOUSE COMPUTE_WH")

# execute query
cur.execute("""SELECT * FROM ZEPL_US_STOCKS_DAILY.PUBLIC.STOCK_HISTORY LIMIT 100""")

# convert datasource into pandas dataframe
df_100 = cur.fetch_pandas_all()

# print dataframe as table
z.show(df_100)

Second, select "Run Paragraph"

Code Explained

import pandas as pd
Zepl's General Purpose Image pre-installed the pandas library so all you have to do is import.

cur = z.getDatasource("US_STOCKS_DAILY")
This function is used to access any data source that is attached to the notebook. A Cursor object is returned to the cur variable. The specific data source is specified by setting the string input to the corresponding name of your data source. Lastly, the data source must be attached to the notebook before starting the container.

cur.execute("""SELECT * FROM ZEPL_US_STOCKS_DAILY.PUBLIC.STOCK_HISTORY LIMIT 100""")
This will execute a specified query in Snowflake.

df_100 = cur.fetch_pandas_all()
Return a Pandas DataFrame to the df_100 variable using. See documentation for more details here

Troubleshooting:

Overview

Zepl provides several options for loading libraries. The two most used are Custom Images and install during notebook run time. For this guide we will use the python package manager pip to install all of our required libraries. Learn more about Custom Images and how to get activate your account to access this feature.

In the Zepl Notebook:

Add this code to the first paragraph and select "Run Paragraph":

%python
# Install fbprophet Dependencies
!pip install pystan~=2.14 Cython>=0.22 cmdstanpy==0.9.68 numpy>=1.15.4 pandas>=1.0.4 matplotlib>=2.0.0 LunarCalendar>=0.0.9 convertdate>=2.1.2 holidays>=0.10.2 setuptools-git>=1.2 python-dateutil>=2.8.0 tqdm>=4.36.1

# Install fbprophet Library
!pip install fbprophet==0.7.1

Code Explained

!pip install pystan~=2.14 Cython>=0.22 cmdstanpy==0.9.68 numpy>=1.15.4 pandas>=1.0.4 matplotlib>=2.0.0 LunarCalendar>=0.0.9 convertdate>=2.1.2 holidays>=0.10.2 setuptools-git>=1.2 python-dateutil>=2.8.0 tqdm>=4.36.1

This statement uses the `!` to access the container CLI and calls the `pip install` linux command to install the required libraries that support the facebook prophet library. Each version is specified to ensure the notebook runs consistently.

!pip install fbprophet==0.7.1

This statement installs the fbprophet library with a specific version.

Add new paragraph

Add this code to next open paragraph in your notebook and select run paragraph:

%python
# Import Libraries
from fbprophet import Prophet

Code Explained

We must also import the libraries that were just installed in the previous paragraph.

Full output of code:

Troubleshooting

TL;DR: Predicting the future based on past events requires the past to have patterns that may repeat in the future. Future stock market prices are influenced by many external sources, thus solely relying on past events to pick stocks is not the best strategy. In this walk through, we are going to ignore that guidance and do it anyway. This guide is not intended to influence your personal trading decisions.

Overview

In this section we will explore the data for a specific stock ticker, build a time series forecast, and visualize the results. All of the code can be referenced here: Notebook Code. We will be using the Facebook Prophet library with Python for our analysis. Zepl also supports R so feel free to use R as well.

We will be looking at stock prices for Johnson & Johnson (JNJ). Using the FBProphet model, we will look at all of the past Adjusted Close values for JNJ and determine if we can make stock trend predictions with a remotely accurate result. The Prophet library uses an additive model which is composed of 3 major components, trend, seasonality, and holidays (Ref). Prophet is robust to missing data and shifts in the trend, and typically handles outliers well (Ref).

Retrieve Data for Specific Stock Ticker

First let's retrieve data from Snowflake with all of the open, high, low, close values for one specific stock one stock. Add this code to a new paragraph in your notebook:

%python
# Set ticker value
stock_ticker = 'JNJ'

# Get a new DataFrame for a Specific Stock ticker
cur = z.getDatasource("US_STOCKS_DAILY")

# Retrieve data only for the specified stock ticker and order results by date. Distinct will eliminate any duplicate values.
cur = cur.execute("""
            SELECT DISTINCT * FROM 
            ZEPL_US_STOCKS_DAILY.PUBLIC.STOCK_HISTORY 
            WHERE symbol='{}' 
            ORDER BY date
            """.format(stock_ticker))

# Create Pandas DataFrame from query results            
df_stock = cur.fetch_pandas_all()

Code Explained

This is the same code we used above with a different SQL statement. The SQL statement is going to return all rows and columns from the STOCK_HISTORY table WHERE the SYMBOL column is equal to the value of our stock_ticker python variable. In this case, that is ‘JNJ' (Johnson & Johnson).

Create and Train a Prophet Model

Add this code to the next paragraph to build a predictive model using the Prophet library:

%python
# Initialize Prophet object
m = Prophet()

# Drop unused columns
ph_df = df_stock.drop(['OPEN', 'HIGH', 'LOW','VOLUME', 'SYMBOL','CLOSE'], axis=1)

# Rename values so Prophet knows what to predict (y) and when that event happened (ds)
ph_df.rename(columns={'ADJCLOSE': 'y', 'DATE': 'ds'}, inplace=True)

# Fit Model
m.fit(ph_df)

Code Explained

m = Prophet()
First we must create a python object from the Prophet library. This will be used to access all functions for forecasting.

ph_df = df_stock.drop(['OPEN', 'HIGH', 'LOW','VOLUME', 'SYMBOL','CLOSE'], axis=1)
ph_df.rename(columns={'ADJCLOSE': 'y', 'DATE': 'ds'}, inplace=True)
The Prophet library only takes in two columns as inputs before training. Those inputs are a date column and a value to be predicted. In our case the date column in our dataset is labeled as ‘DATE' and we would like to predict the close value of the stock which is labeled as, ‘ADJCLOSE'.

m.fit(ph_df)
Lastly we train (fit) the model on the two input values above DATE and ADJCLOSE.

Make Future Predictions

%python
# Num days to predict
period = 30

# Create an empty DataFrame to store preditison for future dates
future_prices = m.make_future_dataframe(periods=period)

# Predict Prices
forecast = m.predict(future_prices)

# Inspect the forecast prices over the next 30 days. The yhat value is the prediction value.
z.show(forecast.tail(period))

Code Explained

future_prices = m.make_future_dataframe(periods=period)
We must create a new dataframe that can store our future prediction values. future_prices will have 30 empty rows for storing 30 days of future predictions. The period value can be changed to generate greater or fewer predictions.

forecast = m.predict(future_prices)
Create predictions and display the results. This function will generate a number of new columns. The important values are defined below:

Visualize Predictions

The code below will generate a chart to display our trend in relation to actual stock close values. The black dots represent the close values from our original dataset (df_stock['ADJCLOSE']) and the blue line represents the trend determined by our model predictions (forecast['yhat']). The light blue shading represents the upper and lower bound of the trend (forecast['yhat_lower'] or forecast['yhat_upper']).

%python
m.plot(forecast);

Visualize Seasonality

The code below will generate a chart to display the impact of seasonality on the trend. This can be used to identify if certain times of the day, month, or year effects the final trend.

If we look at the yearly trend (final chart) it appears that stock prices tend to drop in march and rise towards the end of the year. These factors may lead us to re-fit the model based on specific holiday trends which can be done easily using prophet. Feel free to play around with adding holiday seasonality as well: Examples

%python
m.plot_components(forecast);

Target Stock Price (30 day prediction)

The code below will display the change in stock price between our forecast value and our last known stock value. At the time of creating this guide, the last known stock price was $161.40 on March 26th 2021. Our predicted value for April 25th 2021 is set at $154.88. The difference is a drop in price by $6.52. While that might be interesting, we still need to evaluate the known error in this model and see if tuning our inputs may lead to a more accurate prediction.

%python
# Get the last known adjusted close value and date from our stock dataset
num_rows_stock = df_stock.shape[0]
last_close_date = df_stock.at[num_rows_stock-1, 'DATE']
last_close_value = df_stock.at[num_rows_stock-1, 'ADJCLOSE']

# Get the prediction value and date from our forecasted trend
num_rows_forecast = forecast.shape[0]
last_forecast_date = forecast.at[num_rows_forecast-1, 'ds'].strftime("%Y-%m-%d")
last_forecast_value = forecast.at[num_rows_forecast-1, 'yhat']

# Display those values using HTML and simple inline styling
print('''%html <h4> Last Known Close Price    ({date}): <b><span style="color:red;">${value:.2f}</span></b></h4>'''.format(date=last_close_date, value=last_close_value))
print('''%html <h4> Forecasted Target Price   ({date}): <b><span style="color:red;">${value:.2f}</span></b></h4>'''.format(date=last_forecast_date, value=last_forecast_value))
print('''%html <h4> Projected Change:                   <b><span style="color:red;">${value:.2f}</span></b></h4>'''.format(value=last_forecast_value - last_close_value))

Finally, let's zoom in and visualize the forecasted values 30 days in the future. The price that our model predicted is displayed in blue and the final 365 days of adjusted close values are shown in black.

%python
import time
# Set graph date range to 180 days + future prediction values
graph_dates = 180 + period

# Total number of days in forecast DataFrame
len_forecast = len(forecast['ds'])

# Set total number of stock values to display
last_x_days = len_forecast - graph_dates

# Display Graph and set the x limit to show last 180 days + 30 forecast period
fig = m.plot(forecast);
ax = fig.gca()
ax.set_xlim([forecast.tail(graph_dates)['ds'][last_x_days], forecast.tail(graph_dates)['ds'][len_forecast-1]]) 

Code Explained

graph_dates = 180 + period
len_forecast = len(forecast['ds'])
last_x_days = len_forecast - graph_dates
We want to display the last 180 days plus our future 30 day prediction period. These variables are used to determine the number of days to display.

fig = m.plot(forecast);
ax = fig.gca()
Plot the forecast as we have done above and retrieve the matplotlib Axis value (ref). The semicolon at the end of the first line ensures only one graph is displayed.

ax.set_xlim([forecast.tail(graph_dates)['ds'][last_x_days], forecast.tail(graph_dates)['ds'][len_forecast-1]])
Limit the X-Axis range for our graph

Overview

Now that we have created a prediction, we should validate the confidence level in our results. We will use cross-validation to make a series of predictions for a given a time window (horizon) and compare the predicted value to the actual value. In our case, we will compare predicted adjusted close values to actual close values.

Cross-Validation

%python
from fbprophet.diagnostics import cross_validation

# Use built in cross_validation function with parallelization across multiple CPU processes
df_cv = cross_validation(m, initial='6570 days', period='90 days', horizon = '30 days', parallel="processes")

# Display last 5 rows of results
z.show(df_cv.tail())

Code Explained

df_cv = cross_validation(m, initial='6570 days', horizon = '30 days', period='90 days', parallel="processes")

The fbprophet library has an easy to use cross_validation() function to automate the entire process. This may take several minutes to complete.

Here is what the input parameters mean:

Cross-Validation Results

%python
from fbprophet.diagnostics import performance_metrics

# Use built in performance_metrics() function to retrieve performance metrics 
df_p = performance_metrics(df_cv)

z.show(df_p.tail())

Code Explained

performance_metrics(df_cv)

We will focus on the Mean Absolute Percentage Error (MAPE). This measurement explains the size of the error in our predictions as a percentage (Ref). Breaking down our result, the lowest MAPE value is around 0.11, which we can translate to say that our predictions differ from the actual value by 11% on average. Depending on your risk level, perhaps that is enough for you. But, we still have a few tools to try to reduce our error.

If you want to retrieve the exact minimum value, use this code below:

%python
# Get lowest MAPE value
mape_perf = df_p.min(level=np.argmin(df_p["mape"]))
print('''%html <h4> Display Lowest MAPE value: <b><span style="color:red;">{value:.2f}</span></b></h4>'''.format(value=mape_perf['mape'][0]))

Visualize Error Results

We can quickly visualize the MAPE value across the 30 day prediction window (horizon) used during cross-validation. The graph below shows how MAPE oscillates over the 30 day prediction. *If anyone knows why that oscillation occurs feel free to message me for a chat, I'd love to know :) zack@zepl.com

%python
from fbprophet.plot import plot_cross_validation_metric
fig = plot_cross_validation_metric(df_cv, metric='mape')

Troubleshooting:

Overview

In order to improve the accuracy of our forecasted predations, we will test out many different values for how sensitive the model is to changepoints and seasonality. Hyperparameter tuning is a fancy name for a very simple concept. In order to tun our model we will test out lots of different parameters and compare the accuracy of the results. The parameters that produce the most accurate prediction values (MAPE or other error values) is the winner. We will perform the same cross-validation as we have in the previous 2 steps, only this time, we will use 4 different values for changepoint_prior_scale and 4 different values for seasonality_prior_scale.

Implement Hyperparameter Tuning and Display Results

%python
# https://facebook.github.io/prophet/docs/diagnostics.html#hyperparameter-tuning

import itertools
import numpy as np
import pandas as pd

# Track original waits for comparison later:
original_changepoint_prior_scale = m.changepoint_prior_scale
original_seasonality_prior_scale = m.seasonality_prior_scale

param_grid = {  
    'changepoint_prior_scale': [0.001, 0.01, 0.1, 0.5],
    'seasonality_prior_scale': [0.01, 0.1, 1.0, 10.0],
}

# Use cutoff values from initial cross-validation set
cutoffs = pd.to_datetime(df_cv['cutoff'].unique())

# Generate all combinations of parameters
all_params = [dict(zip(param_grid.keys(), v)) for v in itertools.product(*param_grid.values())]
rmses = []  # Store the RMSEs for each params here
mae = []  # Store the MAE for each params here
mape = []  # Store the MAPE for each params here

# Use cross validation to evaluate all parameters
for params in all_params:
    m = Prophet(**params).fit(ph_df)  # Fit model with given params
    df_cv = cross_validation(m, cutoffs=cutoffs, horizon='30 days', parallel="processes")
    df_p = performance_metrics(df_cv, rolling_window=1)
    rmses.append(df_p['rmse'].values[0])
    mae.append(df_p['mae'].values[0])
    mape.append(df_p['mape'].values[0])

# Find the best parameters
tuning_results = pd.DataFrame(all_params)
tuning_results['rmse'] = rmses
tuning_results['mae'] = mae
tuning_results['mape'] = mape

z.show(tuning_results)

Lastly, display the parameters with the most optimal results and compare the new parameters with the original ones:

%python
best_params = all_params[np.argmin(mape)]
print(original_changepoint_prior_scale, original_seasonality_prior_scale, best_params)

References: