In this guide, we'll be walking you through how to build a time series forecasting model using Zepl's data science notebook with data loaded from Snowflake's Data Marketplace! We will be forecasting on stock market close values. Please note: This walk through is not to be used for investing purposes. This is intended to be used as a guide to demonstrate an example of how to use Zepl and Snowflake.
ACCOUNTADMIN
privilegesCheck if the database was created properly
SELECT * FROM "ZEPL_US_STOCKS_DAILY"."PUBLIC"."STOCK_HISTORY" LIMIT 10
Check if privileges are set correctly: This database needs to have SELECT
privileges for a Role that your user has access to. Setting the SELECT
privilege for the PUBLIC role will allow all users to read data from this database.
Sign up for free using Partner Connect
Use the Test Connection button to validate the user credentials and ROLE. Below is a table of example outputs:
Error Message | Action |
| Your credentials were entered incorrectly |
| Enter a new Snowflake ROLE |
Resource Type: Doc
Zepl Containers are isolated environments that are used to execute code. Paired with an Image, these Containers provide a private runtime for a notebook when it spins up. Each of these Containers has a single CPU core and varies in the memory allocated to the notebook session. These Containers can be quickly scaled up or down in the notebook settings for larger or smaller workloads. Each Container has a default time out period of 30 minutes.
Image: Doc
Zepl Images can help you and your team save significant time by creating reproducible environments for all notebooks to access. The Zepl Image builds all of the libraries required to run a notebook so your users don't have to worry about long wait times for installing libraries at the beginning of every notebook and hoping that each notebook environment is configured the same.
Spaces: Doc
A Zepl Space is a collection of notebooks that can be shared with built-in access controls. The "My Notebooks" space is a private space dedicated to your user only. Typically, new spaces are created for a specific project or working group; it's a place for a group of people who are working together on a set of data science problems.
https://app.zepl.com/api/v1/viewer/fetch?url=bm90ZTovL3pzaGFpbnNreUB6ZXBsLmNvbS8xZjNiMjQwZTFjZjQ0MDk4YTU0Njc1MDM3MTMwNThlZC9ub3RlLmpzb24
https://raw.githubusercontent.com/zshainsky/Getting_Started_on_Zepl/master/Snowflake%20Guide%20Stock%20Time%20Series%20Forecasting.zpln
First, add this code to the first paragraph
%python
import pandas as pd
# Create a Snowflake Cursor Object
cur = z.getDatasource("US_STOCKS_DAILY")
# Un-comment if warehouse is not specified in the Data Source
# cur.execute("USE WAREHOUSE COMPUTE_WH")
# execute query
cur.execute("""SELECT * FROM ZEPL_US_STOCKS_DAILY.PUBLIC.STOCK_HISTORY LIMIT 100""")
# convert datasource into pandas dataframe
df_100 = cur.fetch_pandas_all()
# print dataframe as table
z.show(df_100)
Second, select "Run Paragraph"
import pandas as pd
Zepl's General Purpose Image pre-installed the pandas library so all you have to do is import.
cur = z.getDatasource("US_STOCKS_DAILY")
This function is used to access any data source that is attached to the notebook. A Cursor object is returned to the cur
variable. The specific data source is specified by setting the string input to the corresponding name of your data source. Lastly, the data source must be attached to the notebook before starting the container.
cur.execute("""SELECT * FROM ZEPL_US_STOCKS_DAILY.PUBLIC.STOCK_HISTORY LIMIT 100""")
This will execute a specified query in Snowflake.
df_100 = cur.fetch_pandas_all()
Return a Pandas DataFrame to the df_100
variable using. See documentation for more details here
ProgrammingError: 000606 (57P03): No active warehouse selected in the current session. Select an active warehouse with the 'use warehouse' command.
USE WAREHOUSE
ProgrammingError: 251005: User is empty.
Zepl provides several options for loading libraries. The two most used are Custom Images and install during notebook run time. For this guide we will use the python package manager pip
to install all of our required libraries. Learn more about Custom Images and how to get activate your account to access this feature.
Add this code to the first paragraph and select "Run Paragraph":
%python
# Install fbprophet Dependencies
!pip install pystan~=2.14 Cython>=0.22 cmdstanpy==0.9.68 numpy>=1.15.4 pandas>=1.0.4 matplotlib>=2.0.0 LunarCalendar>=0.0.9 convertdate>=2.1.2 holidays>=0.10.2 setuptools-git>=1.2 python-dateutil>=2.8.0 tqdm>=4.36.1
# Install fbprophet Library
!pip install fbprophet==0.7.1
!pip install pystan~=2.14 Cython>=0.22 cmdstanpy==0.9.68 numpy>=1.15.4 pandas>=1.0.4 matplotlib>=2.0.0 LunarCalendar>=0.0.9 convertdate>=2.1.2 holidays>=0.10.2 setuptools-git>=1.2 python-dateutil>=2.8.0 tqdm>=4.36.1
This statement uses the `!` to access the container CLI and calls the `pip install` linux command to install the required libraries that support the facebook prophet library. Each version is specified to ensure the notebook runs consistently.
!pip install fbprophet==0.7.1
This statement installs the fbprophet library with a specific version.
Add this code to next open paragraph in your notebook and select run paragraph:
%python
# Import Libraries
from fbprophet import Prophet
We must also import the libraries that were just installed in the previous paragraph.
TL;DR: Predicting the future based on past events requires the past to have patterns that may repeat in the future. Future stock market prices are influenced by many external sources, thus solely relying on past events to pick stocks is not the best strategy. In this walk through, we are going to ignore that guidance and do it anyway. This guide is not intended to influence your personal trading decisions.
In this section we will explore the data for a specific stock ticker, build a time series forecast, and visualize the results. All of the code can be referenced here: Notebook Code. We will be using the Facebook Prophet library with Python for our analysis. Zepl also supports R so feel free to use R as well.
We will be looking at stock prices for Johnson & Johnson (JNJ). Using the FBProphet model, we will look at all of the past Adjusted Close values for JNJ and determine if we can make stock trend predictions with a remotely accurate result. The Prophet library uses an additive model which is composed of 3 major components, trend, seasonality, and holidays (Ref). Prophet is robust to missing data and shifts in the trend, and typically handles outliers well (Ref).
First let's retrieve data from Snowflake with all of the open, high, low, close values for one specific stock one stock. Add this code to a new paragraph in your notebook:
%python
# Set ticker value
stock_ticker = 'JNJ'
# Get a new DataFrame for a Specific Stock ticker
cur = z.getDatasource("US_STOCKS_DAILY")
# Retrieve data only for the specified stock ticker and order results by date. Distinct will eliminate any duplicate values.
cur = cur.execute("""
SELECT DISTINCT * FROM
ZEPL_US_STOCKS_DAILY.PUBLIC.STOCK_HISTORY
WHERE symbol='{}'
ORDER BY date
""".format(stock_ticker))
# Create Pandas DataFrame from query results
df_stock = cur.fetch_pandas_all()
This is the same code we used above with a different SQL statement. The SQL statement is going to return all rows and columns from the STOCK_HISTORY table WHERE the SYMBOL column is equal to the value of our stock_ticker
python variable. In this case, that is ‘JNJ' (Johnson & Johnson).
Add this code to the next paragraph to build a predictive model using the Prophet library:
%python
# Initialize Prophet object
m = Prophet()
# Drop unused columns
ph_df = df_stock.drop(['OPEN', 'HIGH', 'LOW','VOLUME', 'SYMBOL','CLOSE'], axis=1)
# Rename values so Prophet knows what to predict (y) and when that event happened (ds)
ph_df.rename(columns={'ADJCLOSE': 'y', 'DATE': 'ds'}, inplace=True)
# Fit Model
m.fit(ph_df)
m = Prophet()
First we must create a python object from the Prophet library. This will be used to access all functions for forecasting.
ph_df = df_stock.drop(['OPEN', 'HIGH', 'LOW','VOLUME', 'SYMBOL','CLOSE'], axis=1)
ph_df.rename(columns={'ADJCLOSE': 'y', 'DATE': 'ds'}, inplace=True)
The Prophet library only takes in two columns as inputs before training. Those inputs are a date column and a value to be predicted. In our case the date column in our dataset is labeled as ‘DATE' and we would like to predict the close value of the stock which is labeled as, ‘ADJCLOSE'.
m.fit(ph_df)
Lastly we train (fit) the model on the two input values above DATE and ADJCLOSE.
%python
# Num days to predict
period = 30
# Create an empty DataFrame to store preditison for future dates
future_prices = m.make_future_dataframe(periods=period)
# Predict Prices
forecast = m.predict(future_prices)
# Inspect the forecast prices over the next 30 days. The yhat value is the prediction value.
z.show(forecast.tail(period))
future_prices = m.make_future_dataframe(periods=period)
We must create a new dataframe that can store our future prediction values. future_prices
will have 30 empty rows for storing 30 days of future predictions. The period
value can be changed to generate greater or fewer predictions.
forecast = m.predict(future_prices)
Create predictions and display the results. This function will generate a number of new columns. The important values are defined below:
yhat
: Predicted valueyhat_lower
: Lower bound of prediction. 95% confidence interval.yhat_upper
: Upper bound of prediction. 95% confidence interval.The code below will generate a chart to display our trend in relation to actual stock close values. The black dots represent the close values from our original dataset (df_stock['ADJCLOSE']
) and the blue line represents the trend determined by our model predictions (forecast['yhat']
). The light blue shading represents the upper and lower bound of the trend (forecast['yhat_lower']
or forecast['yhat_upper']
).
%python
m.plot(forecast);
The code below will generate a chart to display the impact of seasonality on the trend. This can be used to identify if certain times of the day, month, or year effects the final trend.
If we look at the yearly trend (final chart) it appears that stock prices tend to drop in march and rise towards the end of the year. These factors may lead us to re-fit the model based on specific holiday trends which can be done easily using prophet. Feel free to play around with adding holiday seasonality as well: Examples
%python
m.plot_components(forecast);
The code below will display the change in stock price between our forecast value and our last known stock value. At the time of creating this guide, the last known stock price was $161.40 on March 26th 2021. Our predicted value for April 25th 2021 is set at $154.88. The difference is a drop in price by $6.52. While that might be interesting, we still need to evaluate the known error in this model and see if tuning our inputs may lead to a more accurate prediction.
%python
# Get the last known adjusted close value and date from our stock dataset
num_rows_stock = df_stock.shape[0]
last_close_date = df_stock.at[num_rows_stock-1, 'DATE']
last_close_value = df_stock.at[num_rows_stock-1, 'ADJCLOSE']
# Get the prediction value and date from our forecasted trend
num_rows_forecast = forecast.shape[0]
last_forecast_date = forecast.at[num_rows_forecast-1, 'ds'].strftime("%Y-%m-%d")
last_forecast_value = forecast.at[num_rows_forecast-1, 'yhat']
# Display those values using HTML and simple inline styling
print('''%html <h4> Last Known Close Price ({date}): <b><span style="color:red;">${value:.2f}</span></b></h4>'''.format(date=last_close_date, value=last_close_value))
print('''%html <h4> Forecasted Target Price ({date}): <b><span style="color:red;">${value:.2f}</span></b></h4>'''.format(date=last_forecast_date, value=last_forecast_value))
print('''%html <h4> Projected Change: <b><span style="color:red;">${value:.2f}</span></b></h4>'''.format(value=last_forecast_value - last_close_value))
Finally, let's zoom in and visualize the forecasted values 30 days in the future. The price that our model predicted is displayed in blue and the final 365 days of adjusted close values are shown in black.
%python
import time
# Set graph date range to 180 days + future prediction values
graph_dates = 180 + period
# Total number of days in forecast DataFrame
len_forecast = len(forecast['ds'])
# Set total number of stock values to display
last_x_days = len_forecast - graph_dates
# Display Graph and set the x limit to show last 180 days + 30 forecast period
fig = m.plot(forecast);
ax = fig.gca()
ax.set_xlim([forecast.tail(graph_dates)['ds'][last_x_days], forecast.tail(graph_dates)['ds'][len_forecast-1]])
graph_dates = 180 + period
len_forecast = len(forecast['ds'])
last_x_days = len_forecast - graph_dates
We want to display the last 180 days plus our future 30 day prediction period. These variables are used to determine the number of days to display.
fig = m.plot(forecast);
ax = fig.gca()
Plot the forecast as we have done above and retrieve the matplotlib Axis value (ref). The semicolon at the end of the first line ensures only one graph is displayed.
ax.set_xlim([forecast.tail(graph_dates)['ds'][last_x_days], forecast.tail(graph_dates)['ds'][len_forecast-1]])
Limit the X-Axis range for our graph
Now that we have created a prediction, we should validate the confidence level in our results. We will use cross-validation to make a series of predictions for a given a time window (horizon) and compare the predicted value to the actual value. In our case, we will compare predicted adjusted close values to actual close values.
%python
from fbprophet.diagnostics import cross_validation
# Use built in cross_validation function with parallelization across multiple CPU processes
df_cv = cross_validation(m, initial='6570 days', period='90 days', horizon = '30 days', parallel="processes")
# Display last 5 rows of results
z.show(df_cv.tail())
df_cv = cross_validation(m, initial='6570 days', horizon = '30 days', period='90 days', parallel="processes")
The fbprophet library has an easy to use cross_validation()
function to automate the entire process. This may take several minutes to complete.
Here is what the input parameters mean:
initial
: This is the time window to use for training. We are training on 6,570 days worth of data (18 years).horizon
: This is the time window to make predictions over. We will use 30 days worth of adjusted close values to predict and compare our results to the actual adjusted close values.period
: This is the time window to skip over between prediction windows (horizon
). Setting this to 90 days allows us to make one set of predictions every quarter.parallel
: Cross-validation can be CPU intensive, so we will try to multiple CPUs to reduce overall run time.%python
from fbprophet.diagnostics import performance_metrics
# Use built in performance_metrics() function to retrieve performance metrics
df_p = performance_metrics(df_cv)
z.show(df_p.tail())
performance_metrics(df_cv)
We will focus on the Mean Absolute Percentage Error (MAPE). This measurement explains the size of the error in our predictions as a percentage (Ref). Breaking down our result, the lowest MAPE value is around 0.11, which we can translate to say that our predictions differ from the actual value by 11% on average. Depending on your risk level, perhaps that is enough for you. But, we still have a few tools to try to reduce our error.
If you want to retrieve the exact minimum value, use this code below:
%python
# Get lowest MAPE value
mape_perf = df_p.min(level=np.argmin(df_p["mape"]))
print('''%html <h4> Display Lowest MAPE value: <b><span style="color:red;">{value:.2f}</span></b></h4>'''.format(value=mape_perf['mape'][0]))
We can quickly visualize the MAPE value across the 30 day prediction window (horizon) used during cross-validation. The graph below shows how MAPE oscillates over the 30 day prediction. *If anyone knows why that oscillation occurs feel free to message me for a chat, I'd love to know :) zack@zepl.com
%python
from fbprophet.plot import plot_cross_validation_metric
fig = plot_cross_validation_metric(df_cv, metric='mape')
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
May occur due to lack of memory available in the running container. Stop the container and select a larger container size. Typically 16GB and 32GB should be sufficient. DocImportError: FloatProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
: TQDM not supported for Zepl notebooks. Must use a parallel flag. This issue is tracked here and will be available in the next release of fbprophet (Ref).In order to improve the accuracy of our forecasted predations, we will test out many different values for how sensitive the model is to changepoints and seasonality. Hyperparameter tuning is a fancy name for a very simple concept. In order to tun our model we will test out lots of different parameters and compare the accuracy of the results. The parameters that produce the most accurate prediction values (MAPE or other error values) is the winner. We will perform the same cross-validation as we have in the previous 2 steps, only this time, we will use 4 different values for changepoint_prior_scale
and 4 different values for seasonality_prior_scale
.
%python
# https://facebook.github.io/prophet/docs/diagnostics.html#hyperparameter-tuning
import itertools
import numpy as np
import pandas as pd
# Track original waits for comparison later:
original_changepoint_prior_scale = m.changepoint_prior_scale
original_seasonality_prior_scale = m.seasonality_prior_scale
param_grid = {
'changepoint_prior_scale': [0.001, 0.01, 0.1, 0.5],
'seasonality_prior_scale': [0.01, 0.1, 1.0, 10.0],
}
# Use cutoff values from initial cross-validation set
cutoffs = pd.to_datetime(df_cv['cutoff'].unique())
# Generate all combinations of parameters
all_params = [dict(zip(param_grid.keys(), v)) for v in itertools.product(*param_grid.values())]
rmses = [] # Store the RMSEs for each params here
mae = [] # Store the MAE for each params here
mape = [] # Store the MAPE for each params here
# Use cross validation to evaluate all parameters
for params in all_params:
m = Prophet(**params).fit(ph_df) # Fit model with given params
df_cv = cross_validation(m, cutoffs=cutoffs, horizon='30 days', parallel="processes")
df_p = performance_metrics(df_cv, rolling_window=1)
rmses.append(df_p['rmse'].values[0])
mae.append(df_p['mae'].values[0])
mape.append(df_p['mape'].values[0])
# Find the best parameters
tuning_results = pd.DataFrame(all_params)
tuning_results['rmse'] = rmses
tuning_results['mae'] = mae
tuning_results['mape'] = mape
z.show(tuning_results)
Lastly, display the parameters with the most optimal results and compare the new parameters with the original ones:
%python
best_params = all_params[np.argmin(mape)]
print(original_changepoint_prior_scale, original_seasonality_prior_scale, best_params)
parallel="processes"
. It will achieve the highest performance when the parallel cross validation can be done on a single machine." Ref