Time series forecasting with TensorFlow
- Sumit Dey
- Apr 5, 2022
- 28 min read
Updated: Apr 6, 2022
Time series problems deal with data over time. Such as the number of staff members in a company over 15-years, sales of computers for the past 5-years, electricity usage for the past 50-years.
The timeline can be short (seconds/minutes) or long (years/decades). And the problems you might investigate using can usually be broken down into two categories.
Problem Type Examples Output
Classification Anomaly detection, time series identification Discrete (a label)
(where did this time series come from?)
Forecasting Predicting stock market prices, forecasting future Continuous ( demand for a product, stocking inventory a number)
requirements
In both cases above, a supervised learning approach is often used, meaning, you'd have some example data and a label associated with that data.
For example, in forecasting the price of Bitcoin, your data could be the historical price of Bitcoin for the past month and the label could be today's price (the label can't be tomorrow's price because that's what we'd want to predict).
Get Data
To build a time series forecasting model, the first thing we're going to need is data.
And since we're trying to predict the price of Bitcoin, we'll need Bitcoin data.
You can find the data we're going to use on GitHub.
# Download Bitcoin historical data from GitHub
# Note: you'll need to select "Raw" to download the data in the correct format
!wget https://raw.githubusercontent.com/sumitdeyonline/machinelearning/main/BTC-USD.csv

Importing time series data with pandas
Now we've got some data to work with, let's import it using pandas so we can visualize it.
Because our data is in CSV (comma separated values) format (a very common data format for time series), we'll use the pandas read_csv() function and because our data has a date component, we'll tell pandas to parse the dates using the parse_dates parameter passing it the name our of the date column ("Date").
# Import with pandas
import pandas as pd
# Parse dates and set date column to index
df = pd.read_csv("/content/BTC-USD.csv",
parse_dates=["Date"],
index_col=["Date"]) # parse the date column (tell pandas column 1 is a datetime)
df.head()

Let's get some more info.
df.info()

# How many samples do we have?
len(df)

We've collected the historical price of Bitcoin. The frequency at which a time series value is collected is often referred to as seasonality. This is usually measured in the number of samples per year. For example, collecting the price of Bitcoin once per day would result in a time series with a seasonality of 365. Time series data collected with different seasonality values often exhibit seasonal patterns (e.g. electricity demand being higher in Summer months for air conditioning than in Winter months)
Types of the time series
Trend - Time series has a clear long-term increase or decrease(may or may not be linear)
Seasonal - Time-series affected by seasonal factors such as time of year(e.g. increased sales towards the end of the year) or day of week
Cyclic - Time series shows rises and falls over an unfixed period, these tend to be longer/more variable than seasonal patterns.
Deep learning algorithms usually flourish with lots of data, in the range of thousands to millions of samples. In our case, we've got the daily prices of Bitcoin, a max of 365 samples per year. But that doesn't we can't try them with our data.
import matplotlib.pyplot as plt
bitcoin_prices.plot(figsize=(10, 7))
plt.ylabel("BTC Price")
plt.title("Price of Bitcoin from 1 Jan 2015 to 3 March 2022", fontsize=16)
plt.legend(fontsize=14)
plt.xlabel("Date")
plt.ylabel("BTC Price");

Importing time series data with Python's CSV module
If your time series data comes in CSV form you don't necessarily have to use pandas.
You can use Python's in-built csv module. And if you're working with dates, you might also want to use Python's datetime. Let's see how we can replicate the plot we created before except this time using Python's csv and datetime modules.
# Importing and formatting historical Bitcoin data with Python
import csv
from datetime import datetime
timesteps = []
btc_price = []
with open("/content/BTC-USD.csv", "r") as f:
csv_reader = csv.reader(f, delimiter=",") # read in the target CSV
next(csv_reader) # skip first line (this gets rid of the column titles)
for line in csv_reader:
timesteps.append(datetime.strptime(line[1], "%m/%d/%Y")) # get the dates as dates (not strings), strptime = string parse time
btc_price.append(float(line[2])) # get the closing price as float
# View first 10 of each
timesteps[:10], btc_price[:10]

Format Data Part 1: Creating train and test sets for time series data
Usually, you could create a train and test split using a function like Scikit-Learn's outstanding train_test_split() but as we'll see in a moment, this doesn't really cut it for time series data. In time series problems, you'll either have univariate or multivariate data.
Univariate time series data deals with one variable, for example, using the price of Bitcoin to predict the price of Bitcoin.
Multivariate time series data deals with more than one variable, for example, predicting electricity demand using the day of the week, time of year, and the number of houses in a region.
Create train & test sets for time series (the wrong way)
we've figured out we're dealing with a univariate time series, so we only have to make a split on one variable (for multivariate time series, you will have to split multiple variables).
How about we first see the wrong way of splitting time series data? Let's turn our DataFrame index and column into NumPy arrays.
# Get bitcoin date array
timesteps = bitcoin_prices.index.to_numpy()
prices = bitcoin_prices["Price"].to_numpy()
timesteps[:10], prices[:10]

We'll use the ever faithful train_test_split from Scikit-Learn to create our train and test sets.
# Wrong way to make train/test sets for time series
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(timesteps, # dates
prices, # prices
test_size=0.2,
random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Looks like the splits worked well, but let's not trust numbers on a page, let's visualize, visualize, visualize!
# Let's plot wrong train and test splits
plt.figure(figsize=(10, 7))
plt.scatter(X_train, y_train, s=5, label="Train data")
plt.scatter(X_test, y_test, s=5, label="Test data")
plt.xlabel("Date")
plt.ylabel("BTC Price")
plt.legend(fontsize=14)
plt.show();

What's wrong with this plot? We're trying to use the historical price of Bitcoin to predict future prices of Bitcoin. Our test data is scattered all throughout the training data.
This kind of random split is okay for datasets without a time component (such as images or passages of text for classification problems) but for time series, we've got to take the time factor into account. To fix this, we've got to split our data in a way that reflects what we're actually trying to do. We need to split our historical Bitcoin data to have a dataset that reflects the past (train set) and a dataset that reflects the future (test set).
Create train & test sets for time series (the right way)
There's no way we can actually access data from the future.
But we can engineer our test set to be in the future with respect to the training set.
To do this, we can create an arbitrary point in time to split our data.
Everything before the point in time can be considered the training set and everything after the point in time can be considered the test set.
# Create train and test splits the right way for time series data
split_size = int(0.8 * len(prices)) # 80% train, 20% test
# Create train data splits (everything before the split)
X_train, y_train = timesteps[:split_size], prices[:split_size]
# Create test data splits (everything after the split)
X_test, y_test = timesteps[split_size:], prices[split_size:]
len(X_train), len(X_test), len(y_train), len(y_test)

Looks like our custom-made splits are the same lengths as the splits we made with train_test_split. But again, these are numbers on a page. Let's visualize.
# Plot correctly made splits
plt.figure(figsize=(10, 7))
plt.scatter(X_train, y_train, s=5, label="Train data")
plt.scatter(X_test, y_test, s=5, label="Test data")
plt.xlabel("Date")
plt.ylabel("BTC Price")
plt.legend(fontsize=14)
plt.show();

That looks much better! We're going to be using the training set (past) to train a model to try and predict values on the test set (future). Because the test set is an artificial future, we can gauge how our model might perform on actual future data.
Create a plotting function
Rather than retyping matplotlib commands to continuously plot data, let's make a plotting function we can reuse later.
# Create a function to plot time series data
def plot_time_series(timesteps, values, format='.', start=0, end=None, label=None):
"""
Plots a timesteps (a series of points in time) against values (a series of values across timesteps).
Parameters
---------
timesteps : array of timesteps
values : array of values across time
format : style of plot, default "."
start : where to start the plot (setting a value will index from start of timesteps & values)
end : where to end the plot (setting a value will index from end of timesteps & values)
label : label to show on plot of values
"""
# Plot the series
plt.plot(timesteps[start:end], values[start:end], format, label=label)
plt.xlabel("Time")
plt.ylabel("BTC Price")
if label:
plt.legend(fontsize=14) # make label bigger
plt.grid(True)
# Try out our plotting function
plt.figure(figsize=(10, 7))
plot_time_series(timesteps=X_train, values=y_train, label="Train data")
plot_time_series(timesteps=X_test, values=y_test, label="Test data")

Looking nice! Time for some modeling experiments.
Modeling Experiments
We can build almost any kind of model for our problem as long as the data inputs and outputs are formatted correctly. However, just because we can build almost any kind of model, doesn't mean it'll perform well/should be used in a production setting.
We'll see what this means as we build and evaluate models throughout.
Before we discuss what modeling experiments we're going to run, there are two terms you should be familiar with, horizon and window.
horizon = number of timesteps to predict into future
window = number of timesteps from past used to predict horizon
For example, if we wanted to predict the price of Bitcoin for tomorrow (1 day in the future) using the previous week's worth of Bitcoin prices (7 days in the past), the horizon would be 1 and the window would be 7.
Let's do some model experiments
Naïve forecast (baseline) - Model 0
Let's start with a baseline. One of the most common baseline models for time series forecasting, the naïve model (also called the naïve forecast), requires no training at all.
That's because all the naïve model does is use the previous timestep value to predict the next timestep value. The formula looks like below

In simple English, the prediction at timestep t (y-hat) is equal to the value at timestep t-1 (the previous timestep).
# Create a naïve forecast
naive_forecast = y_test[:-1] # Naïve forecast equals every value excluding the last value
naive_forecast[:10], naive_forecast[-10:] # View frist 10 and last 10

#Plot naive forecast
plt.figure(figsize=(10, 7))
plot_time_series(timesteps=X_train, values=y_train, label="Train data")
plot_time_series(timesteps=X_test, values=y_test, label="Test data")
plot_time_series(timesteps=X_test[1:], values=naive_forecast, format="-", label="Naive forecast");

Let's zoom in to take a better look. We can do so by creating an offset value and passing it to the start parameter of our plot_time_series() function.
plt.figure(figsize=(10, 7))
offset = 300 # offset the values by 300 timesteps
plot_time_series(timesteps=X_test, values=y_test, start=offset, label="Test data")
plot_time_series(timesteps=X_test[1:], values=naive_forecast, format="-", start=offset, label="Naive forecast");

When we zoom in we see the naïve forecast comes slightly after the test data. This makes sense because the naive forecast uses the previous timestep value to predict the next timestep value. Forecast made. Time to evaluate it.
Evaluating a time series model
Time series forecasting often involves predicting a number (in our case, the price of Bitcoin).
And what kind of problem is predicting a number? Ten points if you said regression.
With this known, we can use regression evaluation metrics to evaluate our time-series forecasts. The main thing we will be evaluating is: how do our model's predictions (y_pred) compare against the actual values (y_true or ground truth values)?
For all of the following metrics, lower is better (for example an MAE of 0 is better than an MAE 100).
Scale-dependent errors
These are metrics that can be used to compare time series values and forecasts that are on the same scale. For example, Bitcoin historical prices in USD versus Bitcoin forecast values in USD.
MAE (mean absolute error) - Easy to interpret (a forecast is X amount different from the actual amount). Forecast methods that minimize the MAE will lead to forecasts of the median.
RMSE (root mean square error) - Forecasts which minimize the RMSE lead to forecasts of the mean.
code - tf.sqrt(tf.keras.metrics.mean_square_error())
Percentage errors
Percentage errors do not have units, this means they can be used to compare forecasts across different datasets.
MAPE (mean absolute percentage error) - Most commonly used percentage error. May explode (not work) if y=0.
sMAPE (symmetric mean absolute percentage error) - Recommended not to be used by Forecasting: Principles and Practice, though it is used in forecasting competitions.
code - Custom implementation
Scaled errors
MASE (mean absolute scaled error) - MASE equals one for the naive forecast (or very close to one). A forecast which performs better than the naïve should get <1 MASE.
code - See sktime's mase_loss()
Since we're going to be evaluating a lot of models, let's write a function to help us calculate evaluation metrics on their forecasts.
And since TensorFlow doesn't have a ready-made version of MASE (mean absolute scaled error), how about we create our own? We'll take inspiration from sktime's (Scikit-Learn for time series) MeanAbsoluteScaledError class which calculates the MASE.
# MASE implemented courtesy of sktime - https://github.com/alan-turing-institute/sktime/blob/ee7a06843a44f4aaec7582d847e36073a9ab0566/sktime/performance_metrics/forecasting/_functions.py#L16
def mean_absolute_scaled_error(y_true, y_pred):
"""
Implement MASE (assuming no seasonality of data).
"""
mae = tf.reduce_mean(tf.abs(y_true - y_pred))
# Find MAE of naive forecast (no seasonality)
mae_naive_no_season = tf.reduce_mean(tf.abs(y_true[1:] - y_true[:-1])) # our seasonality is 1 day (hence the shifting of 1 day)
return mae / mae_naive_no_season
You'll notice the version of MASE above doesn't take in the training values like sktime's mae_loss(). In our case, we're comparing the MAE of our predictions on the test to the MAE of the naïve forecast on the test set. In practice, if we've created the function correctly, the naïve model should achieve an MASE of 1 (or very close to 1). Any model worse than the naïve forecast will achieve an MASE of >1 and any model better than the naïve forecast will achieve an MASE of <1.
Let's put each of our different evaluation metrics together into a function.
def evaluate_preds(y_true, y_pred):
# Make sure float32 (for metric calculations)
y_true = tf.cast(y_true, dtype=tf.float32)
y_pred = tf.cast(y_pred, dtype=tf.float32)
# Calculate various metrics
mae = tf.keras.metrics.mean_absolute_error(y_true, y_pred)
mse = tf.keras.metrics.mean_squared_error(y_true, y_pred) # puts and emphasis on outliers (all errors get squared)
rmse = tf.sqrt(mse)
mape = tf.keras.metrics.mean_absolute_percentage_error(y_true, y_pred)
mase = mean_absolute_scaled_error(y_true, y_pred)
return {"mae": mae.numpy(),
"mse": mse.numpy(),
"rmse": rmse.numpy(),
"mape": mape.numpy(),
"mase": mase.numpy()}
Looking good! How about we test our function on the naive forecast?
naive_results = evaluate_preds(y_true=y_test[1:],
y_pred=naive_forecast)
naive_results

Alright, looks like we've got some baselines to beat. Taking a look at the naïve forecast's MAE, it seems on average each forecast is ~$567 different than the actual Bitcoin price.
How does this compare to the average price of Bitcoin in the test dataset?
# Find average price of Bitcoin in test dataset
tf.reduce_mean(y_test).numpy()

looking at these two values is starting to give us an idea of how our model is performing:
The average price of Bitcoin in the test dataset is: $42,177 (note: average may not be the best measure here, since the highest price is over 3x this value and the lowest price is over 4x lower)
Each prediction in the naive forecast is on average off by $1218
That's up your own interpretation. Personally, I'd prefer a model which was closer to the mark.
Format Data Part 2: Windowing dataset
We'd be ready to start building models by now? Only one more step (really two) to go.
We've got to window our time series. Why do we window? Windowing is a method to turn a time series dataset into a supervised learning problem. In other words, we want to use windows of the past to predict the future. For example for a univariate time series, windowing for one week (window=7) to predict the next single value (horizon=1) might look like
Window for one week (univariate time series)
[0, 1, 2, 3, 4, 5, 6] -> [7]
[1, 2, 3, 4, 5, 6, 7] -> [8]
[2, 3, 4, 5, 6, 7, 8] -> [9]
Or for the price of Bitcoin, it'd look like this:
Window for one week with the target of predicting the next day (Bitcoin prices)
[123.654, 125.455, 108.584, 118.674, 121.338, 120.655, 121.795] -> [123.033]
[125.455, 108.584, 118.674, 121.338, 120.655, 121.795, 123.033] -> [124.049]
[108.584, 118.674, 121.338, 120.655, 121.795, 123.033, 124.049] -> [125.961]
Let's build some functions which take in a univariate time series and turn it into windows and horizons of specified sizes. We'll start with the default horizon size of 1 and a window size of 7 (these aren't necessarily the best values to use, I've just picked them).
HORIZON = 1 # predict 1 step at a time
WINDOW_SIZE = 7 # use a week worth of timesteps to predict the horizon
Now we'll write a function to take in an array and turn it into a window and horizon.
# Create function to label windowed data
def get_labelled_windows(x, horizon=1):
"""
Creates labels for windowed dataset.
E.g. if horizon=1 (default)
Input: [1, 2, 3, 4, 5, 6] -> Output: ([1, 2, 3, 4, 5], [6])
"""
return x[:, :-horizon], x[:, -horizon:]
# Test out the window labelling function
test_window, test_label = get_labelled_windows(tf.expand_dims(tf.range(8)+1, axis=0), horizon=HORIZON)
print(f"Window: {tf.squeeze(test_window).numpy()} -> Label: {tf.squeeze(test_label).numpy()}")

Now we need a way to make windows for an entire-time series.
We could do this with Python for loops, however, for large time series, that'd be quite slow.
To speed things up, we'll leverage NumPy's array indexing.
Let's write a function which:
Creates a window step of specific window size, for example: [[0, 1, 2, 3, 4, 5, 6, 7]]
Uses NumPy indexing to create a 2D of multiple window steps, for example: [[0, 1, 2, 3, 4, 5, 6, 7], [1, 2, 3, 4, 5, 6, 7, 8], [2, 3, 4, 5, 6, 7, 8, 9]]
Uses the 2D array of multiple window steps to index on a target series
Uses the get_labelled_windows() function we created above to turn the window steps into windows with a specified horizon
# Create function to view NumPy arrays as windows
import numpy as np
def make_windows(x, window_size=7, horizon=1):
"""
Turns a 1D array into a 2D array of sequential windows of window_size.
"""
# 1. Create a window of specific window_size (add the horizon on the end for later labelling)
window_step = np.expand_dims(np.arange(window_size+horizon), axis=0)
# print(f"Window step:\n {window_step}")
# 2. Create a 2D array of multiple window steps (minus 1 to account for 0 indexing)
window_indexes = window_step + np.expand_dims(np.arange(len(x)-(window_size+horizon-1)), axis=0).T # create 2D array of windows of size window_size
# print(f"Window indexes:\n {window_indexes[:3], window_indexes[-3:], window_indexes.shape}")
# 3. Index on the target array (time series) with 2D array of multiple window steps
windowed_array = x[window_indexes]
# 4. Get the labelled windows
windows, labels = get_labelled_windows(windowed_array, horizon=horizon)
return windows, labels
Let's see how it goes.
# View the first 3 windows/labels
full_windows, full_labels = make_windows(prices, window_size=WINDOW_SIZE, horizon=HORIZON)
for i in range(3):
print(f"Window: {full_windows[i]} -> Label: {full_labels[i]}")

Turning windows into training and test sets
Almost like the stained glass windows on the Sistine Chapel, well, maybe not that good but still. Time to turn our windows into training and test splits. We could've windowed our existing training and test splits, however, with the nature of windowing (windowing often requires an offset at some point in the data), it usually works better to window the data first, then split it into training and test sets. Let's write a function that takes in full sets of windows and their labels and splits them into train and test splits.
# Make the train/test splits
def make_train_test_splits(windows, labels, test_split=0.2):
"""
Splits matching pairs of windows and labels into train and test splits.
"""
split_size = int(len(windows) * (1-test_split)) # this will default to 80% train/20% test
train_windows = windows[:split_size]
train_labels = labels[:split_size]
test_windows = windows[split_size:]
test_labels = labels[split_size:]
return train_windows, test_windows, train_labels, test_labels
Look at that amazing function, let's test it.
train_windows, test_windows, train_labels, test_labels = make_train_test_splits(full_windows, full_labels)
len(train_windows), len(test_windows), len(train_labels), len(test_labels)

Make a modeling checkpoint
Because our model's performance will fluctuate from experiment to experiment, we'll want to make sure we're comparing apples to apples. What I mean by this is that in order for a fair comparison, we want to compare each model's best performance against each model's best performance. For example, if model_1 performed incredibly well on epoch 55 but its performance fell off toward epoch 100, we want the version of the model from epoch 55 to compare to other models rather than the version of the model from epoch 100. And the same goes for each of our other models: compare the best against the best. To take of this, we'll implement a ModelCheckpoint callback. The ModelCheckpoint callback will monitor our model's performance during training and save the best model to file by setting save_best_only=True. Because we're going to be running multiple experiments, it makes sense to keep track of them by saving models to file under different names. To do this, we'll write a small function to create a ModelCheckpoint callback which saves a model to a specified filename.
import os
# Create a function to implement a ModelCheckpoint callback with a specific filename
def create_model_checkpoint(model_name, save_path="model_experiments"):
return tf.keras.callbacks.ModelCheckpoint(filepath=os.path.join(save_path, model_name), # create filepath to save model
verbose=0, # only output a limited amount of text
save_best_only=True) # save only the best model to file
Dense model (window = 7, horizon = 1) - Model 1
Time to build one of our models. If you think we've been through a fair bit of preprocessing before getting here, you're right. Often, preparing data for a model is one of the largest parts of any machine learning project. And once you've got a good model in place, you'll probably notice far more improvements from manipulating the data (e.g. collecting more, improving the quality) than from manipulating the model.
We're going to start by keeping it simple, model_1 will have:
A single dense layer with 128 hidden units and ReLU (rectified linear unit) activation
An output layer with linear activation (or no activation)
Adam optimizer and MAE loss function
Batch size of 128
100 epochs
Why these values? I picked them out of experimentation. A batch size of 32 works pretty well too and we could always train for fewer epochs but since the model runs so fast (you'll see in a second, it's because the number of samples we have isn't massive) we might as well train for more.
import tensorflow as tf
from tensorflow.keras import layers
# Set random seed for as reproducible results as possible
tf.random.set_seed(42)
# Construct model
model_1 = tf.keras.Sequential([
layers.Dense(128, activation="relu"),
layers.Dense(HORIZON, activation="linear") # linear activation is the same as having no activation
], name="model_1_dense") # give the model a name so we can save it
# Compile model
model_1.compile(loss="mae",
optimizer=tf.keras.optimizers.Adam(),
metrics=["mae"]) # we don't necessarily need this when the loss function is already MAE
# Fit model
model_1.fit(x=train_windows, # train windows of 7 timesteps of Bitcoin prices
y=train_labels, # horizon value of 1 (using the previous 7 timesteps to predict next day)
epochs=100,
verbose=1,
batch_size=128,
validation_data=(test_windows, test_labels),
callbacks=[create_model_checkpoint(model_name=model_1.name)]) # create ModelCheckpoint callback to save best model

-----
-----

Let's evaluate it.
# Evaluate model on test data
model_1.evaluate(test_windows, test_labels)

You'll notice the model achieves the same val_loss (in this case, this is MAE) as the last epoch. But if we load in the version of model_1 which was saved to file using the ModelCheckpoint callback, we should see an improvement in results.
# Load in saved best performing model_1 and evaluate on test data
model_1 = tf.keras.models.load_model("model_experiments/model_1_dense")
model_1.evaluate(test_windows, test_labels)

Making forecasts with a model (on the test dataset)
We've trained a model and evaluated it on the test data, but the project we're working on is called BitPredict, so how do you think we could use our model to make predictions?
Since we're going to be running more modeling experiments, let's write a function which:
Takes in a trained model (just like model_1)
Takes in some input data (just like the data the model was trained on)
Passes the input data to the model's predict() method
Returns the predictions
def make_preds(model, input_data):
"""
Uses model to make predictions on input_data.
Parameters
----------
model: trained model
input_data: windowed input data (same kind of data model was trained on)
Returns model predictions on input_data.
"""
forecast = model.predict(input_data)
return tf.squeeze(forecast) # return 1D array of predictions
Now we've got some prediction values, let's use the evaluate_preds() we created before to compare them to the ground truth.
# Make predictions using model_1 on the test dataset and view the results
model_1_preds = make_preds(model_1, test_windows)
# Evaluate preds
model_1_results = evaluate_preds(y_true=tf.squeeze(test_labels), # reduce to right shape
y_pred=model_1_preds)
model_1_results

Let's use the plot_time_series() function to plot model_1_preds against the test data.
offset = 300
plt.figure(figsize=(10, 7))
# Account for the test_window offset and index into test_labels to ensure correct plotting
plot_time_series(timesteps=X_test[-len(test_windows):], values=test_labels[:, 0], start=offset, label="Test_data")
plot_time_series(timesteps=X_test[-len(test_windows):], values=model_1_preds, start=offset, format="-", label="model_1_preds")

What's wrong with these predictions? As mentioned before, they're on the test dataset. So they're not actual forecasts. With our current model setup, how do you think we'd make forecasts for the future? We'll cover this later on.
Dense (window = 30, horizon = 1) - Model 2
A naïve model is currently beating our handcrafted deep learning model. Let's continue our modeling experiments. We'll keep the previous model architecture but use a window size of 30. In other words, we'll use the previous 30 days of Bitcoin prices to try and predict the next day's price.
Data Preparation
We'll start our second modeling experiment by preparing datasets using the functions we created earlier.
HORIZON = 1 # predict one step at a time
WINDOW_SIZE = 30 # use 30 timesteps in the past
# Make windowed data with appropriate horizon and window sizes
full_windows, full_labels = make_windows(prices, window_size=WINDOW_SIZE, horizon=HORIZON)
# Make train and testing windows
train_windows, test_windows, train_labels, test_labels = make_train_test_splits(windows=full_windows, labels=full_labels)
len(train_windows), len(test_windows), len(train_labels), len(test_labels)

Now let's construct model_2, a model with the same architecture as model_1 as well as the same training routine.
tf.random.set_seed(42)
# Create model (same model as model 1 but data input will be different)
model_2 = tf.keras.Sequential([
layers.Dense(128, activation="relu"),
layers.Dense(HORIZON) # need to predict horizon number of steps into the future
], name="model_2_dense")
model_2.compile(loss="mae",
optimizer=tf.keras.optimizers.Adam())
model_2.fit(train_windows,
train_labels,
epochs=100,
batch_size=128,
verbose=0,
validation_data=(test_windows, test_labels),
callbacks=[create_model_checkpoint(model_name=model_2.name)])

Let's evaluate our model's performance.
# Evaluate model 2 preds
model_2.evaluate(test_windows, test_labels)

How about we try loading in the best performing model_2 which was saved to file thanks to our ModelCheckpoint callback.
# Load in best performing model
model_2 = tf.keras.models.load_model("model_experiments/model_2_dense/")
model_2.evaluate(test_windows, test_labels)

But let's not stop there, let's make some predictions with model_2 and then evaluate them just as we did before.
# Get forecast predictions
model_2_preds = make_preds(model_2,
input_data=test_windows)
# Evaluate results for model 2 predictions
model_2_results = evaluate_preds(y_true=tf.squeeze(test_labels), # remove 1 dimension of test labels
y_pred=model_2_preds)
model_2_results
It looks like model_2 performs worse than the naïve model as well as model_1! Does this mean a smaller window size is better? How do the predictions look?
offset = 300
plt.figure(figsize=(10, 7))
# Account for the test_window offset
plot_time_series(timesteps=X_test[-len(test_windows):], values=test_labels[:, 0], start=offset, label="test_data")
plot_time_series(timesteps=X_test[-len(test_windows):], values=model_2_preds, start=offset, format="-", label="model_2_preds")

We can able to do more experiments on the Model as below
Model 3: Dense (window = 30, horizon = 7) - Let's try and predict 7 days ahead given the previous 30 days.
Model 4: Conv1D - We'll be using a Conv1D model. Conv1D models can be used for seq2seq (sequence to sequence) problems. In our case, the input sequence is the previous 7 days of Bitcoin price data and the output is the next day (in seq2seq terms this is called a many to one problem).
Model 5: RNN (LSTM) - Let's reuse the same data we used for the Conv1D model, except this time we'll create an LSTM-cell powered RNN to model our Bitcoin data.
Make a multivariate time series
So far all of our models have barely kept up with the naïve forecast. And so far all of them have been trained on a single variable (also called univariate time series): the historical price of Bitcoin. If predicting the price of Bitcoin using the price of Bitcoin hasn't worked out very well, maybe giving our model more information may help. More information is a vague term because we could actually feed almost anything to our model(s) and they would still try to find patterns. For example, we could use the historical price of Bitcoin as well as anyone with the name Daniel Bourke Tweeted on that day to predict the future price of Bitcoin.
This will be different for almost every time series you work on but in our case, we could try to see if the Bitcoin block reward size adds any predictive power to our model(s). The Bitcoin block reward size is the number of Bitcoin someone receives from mining a Bitcoin block. At its inception, the Bitcoin block reward size was 50. But every four years or so, the Bitcoin block reward halves. For example, the block reward size went from 50 (starting January 2009) to 25 on November 28, 2012. Let's encode this information into our time series data and see if it helps a model's performance.
Alright, time to add another feature column, the block reward size. First, we'll need to create variables for the different block reward sizes as well as the dates they came into play.
The following block rewards and dates were sourced from cmcmarkets.com.
Block Reward Start Date
50 3 January 2009
25 28 November 2012
12.5 28 November 2012
6.25 11 May 2020
3.125 TBA (expected 2024)
1.5625 TBA (expected 2028)
# Block reward values
block_reward_1 = 50 # 3 January 2009 (2009-01-03) - this block reward isn't in our dataset (it starts from 01 October 2013)
block_reward_2 = 25 # 28 November 2012
block_reward_3 = 12.5 # 9 July 2016
block_reward_4 = 6.25 # 11 May 2020
# Block reward dates (datetime form of the above date stamps)
block_reward_2_datetime = np.datetime64("2012-11-28")
block_reward_3_datetime = np.datetime64("2016-07-09")
block_reward_4_datetime = np.datetime64("2020-05-11")
# Get date indexes for when to add in different block dates
block_reward_2_days = (block_reward_3_datetime - bitcoin_prices.index[0]).days
block_reward_3_days = (block_reward_4_datetime - bitcoin_prices.index[0]).days
block_reward_2_days, block_reward_3_days

Now we can add another feature to our dataset block_reward (this gets lower over time so it may lead to increasing prices of Bitcoin).
# Add block_reward column
bitcoin_prices_block = bitcoin_prices.copy()
bitcoin_prices_block["block_reward"] = None
# Set values of block_reward column (it's the last column hence -1 indexing on iloc)
bitcoin_prices_block.iloc[:block_reward_2_days, -1] = block_reward_2
bitcoin_prices_block.iloc[block_reward_2_days:block_reward_3_days, -1] = block_reward_3
bitcoin_prices_block.iloc[block_reward_3_days:, -1] = block_reward_4
bitcoin_prices_block.head()

We've officially added another variable to our time series data.
Let's see what it looks like.
# Plot the block reward/price over time
# Note: Because of the different scales of our values we'll scale them to be between 0 and 1.
from sklearn.preprocessing import minmax_scale
scaled_price_block_df = pd.DataFrame(minmax_scale(bitcoin_prices_block[["Price", "block_reward"]]), # we need to scale the data first
columns=bitcoin_prices_block.columns,
index=bitcoin_prices_block.index)
scaled_price_block_df.plot(figsize=(10, 7));

When we scale the block reward and the Bitcoin price, we can see the price goes up as the block reward goes down, perhaps this information will be helpful to our model's performance.
Making a windowed dataset with pandas
we used some custom-made functions to window our univariate time series.
However, since we've just added another variable to our dataset, these functions won't work. Since our data is in a pandas DataFrame, we can leverage the pandas.DataFrame.shift() method to create a windowed multivariate time series. The shift() method offsets an index by a specified number of periods.
# Setup dataset hyperparameters
HORIZON = 1
WINDOW_SIZE = 7
# Make a copy of the Bitcoin historical data with block reward feature
bitcoin_prices_windowed = bitcoin_prices_block.copy()
# Add windowed columns
for i in range(WINDOW_SIZE): # Shift values for each step in WINDOW_SIZE
bitcoin_prices_windowed[f"Price+{i+1}"] = bitcoin_prices_windowed["Price"].shift(periods=i+1)
bitcoin_prices_windowed.head(10)

Now that we've got a windowed dataset, let's separate features (X) from labels (y).
Remember in our windowed dataset, we're trying to use the previous WINDOW_SIZE steps to predict HORIZON steps.
Window for a week (7) to predict a horizon of 1 (multivariate time series)
WINDOW_SIZE & block_reward -> HORIZON
[0, 1, 2, 3, 4, 5, 6, block_reward] -> [7]
[1, 2, 3, 4, 5, 6, 7, block_reward] -> [8]
[2, 3, 4, 5, 6, 7, 8, block_reward] -> [9]
We'll also remove the NaN values using pandas dropna() method, this equivalent to starting our windowing function at sample 0 (the first sample) + WINDOW_SIZE.
# Let's create X & y, remove the NaN's and convert to float32 to prevent TensorFlow errors
X = bitcoin_prices_windowed.dropna().drop("Price", axis=1).astype(np.float32)
y = bitcoin_prices_windowed.dropna()["Price"].astype(np.float32)
X.head()

# Make train and test sets
split_size = int(len(X) * 0.8)
X_train, y_train = X[:split_size], y[:split_size]
X_test, y_test = X[split_size:], y[split_size:]
len(X_train), len(y_train), len(X_test), len(y_test)

Training and test multivariate time series datasets made! Time to build a model.
Model 6: Dense (multivariate time series)
To keep things simple, let's the model_1 architecture and use it to train and make predictions on our multivariate time series data. By replicating the model_1 architecture we'll be able to see whether or not adding the block reward feature improves or detracts from model performance.
tf.random.set_seed(42)
# Make multivariate time series model
model_6 = tf.keras.Sequential([
layers.Dense(128, activation="relu"),
# layers.Dense(128, activation="relu"), # adding an extra layer here should lead to beating the naive model
layers.Dense(HORIZON)
], name="model_6_dense_multivariate")
# Compile
model_6.compile(loss="mae",
optimizer=tf.keras.optimizers.Adam())
# Fit
model_6.fit(X_train, y_train,
epochs=100,
batch_size=128,
verbose=0, # only print 1 line per epoch
validation_data=(X_test, y_test),
callbacks=[create_model_checkpoint(model_name=model_6.name)])

You might've noticed that the model inferred the input shape of our data automatically (the data now has an extra feature). Often this will be the case, however, if you're running into shape issues, you can always explicitly define the input shape using input_shape parameter of the first layer in a model. Time to evaluate our multivariate model.
# Make sure best model is loaded and evaluate
model_6 = tf.keras.models.load_model("model_experiments/model_6_dense_multivariate")
model_6.evaluate(X_test, y_test)

# Make predictions on multivariate data
model_6_preds = tf.squeeze(model_6.predict(X_test))
# Evaluate preds
model_6_results = evaluate_preds(y_true=y_test,
y_pred=model_6_preds)
model_6_results

It looks like adding in the block reward may have helped our model slightly.
But there are a few more things we could try.
Model 7: N-BEATS algorithm
So far we've tried a bunch of smaller models, models with only a couple of layers.
But one of the best ways to improve a model's performance is to increase the number of layers in it. That's exactly what the N-BEATS (Neural Basis Expansion Analysis for Interpretable Time Series Forecasting) algorithm does. The N-BEATS algorithm focuses on univariate time series problems and achieved state-of-the-art performance in the winner of the M4 competition (a forecasting competition). For our next modeling experiment, we're going to be replicating the generic architecture of the N-BEATS algorithm (see section 3.3 of the N-BEATS paper).
Model 8: Creating an ensemble (stacking different models together)
After all that effort, the N-BEATS algorithm's performance was underwhelming.
But again, this is part of the parcel of machine learning. Not everything will work.
That's when we refer back to the motto: experiment, experiment, experiment.
Our next experiment is creating an ensemble of models.An ensemble involves training and combining multiple different models on the same problem. Ensemble models are often the types of models you'll see winning data science competitions on websites like Kaggle.
Model 9: Train a model on the full historical data to make predictions for future
What would a forecasting model be worth if we didn't use it to predict the future?
It's time we created a model which is able to make future predictions on the price of Bitcoin. To make predictions into the future, we'll train a model on the full dataset and then get to make predictions to some future horizon. Previously, we split our data into training and test sets to evaluate how our model did on pseudo-future data (the test set). But since the goal of a forecasting model is to predict values into the actual future, we won't be using a test set.
Windows and labels ready! Let's turn them into performance optimized TensorFlow Datasets by:
Turning X_all and y_all into tensor Datasets using tf.data.Dataset.from_tensor_slices()
Combining the features and labels into a Dataset tuple using tf.data.Dataset.zip()
Batch and prefetch the data using tf.data.Dataset.batch() and tf.data.Dataset.prefetch() respectively
# Train model on entire data to make prediction for the next day
X_all = bitcoin_prices_windowed.drop(["Price", "block_reward"], axis=1).dropna().to_numpy() # only want prices, our future model can be a univariate model
y_all = bitcoin_prices_windowed.dropna()["Price"].to_numpy()
# 1. Turn X and y into tensor Datasets
features_dataset_all = tf.data.Dataset.from_tensor_slices(X_all)
labels_dataset_all = tf.data.Dataset.from_tensor_slices(y_all)
# 2. Combine features & labels
dataset_all = tf.data.Dataset.zip((features_dataset_all, labels_dataset_all))
# 3. Batch and prefetch for optimal performance
BATCH_SIZE = 1024 # taken from Appendix D in N-BEATS paper
dataset_all = dataset_all.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
dataset_all

And now let's create a model similar to model_1 except with an extra layer, we'll also fit it to the entire dataset for 100 epochs (feel free to play around with the number of epochs or callbacks here, you've got the skills to now).
tf.random.set_seed(42)
# Create model (nice and simple, just to test)
model_9 = tf.keras.Sequential([
layers.Dense(128, activation="relu"),
layers.Dense(128, activation="relu"),
layers.Dense(HORIZON)
])
# Compile
model_9.compile(loss=tf.keras.losses.mae,
optimizer=tf.keras.optimizers.Adam())
# Fit model on all of the data to make future forecasts
model_9.fit(dataset_all,
epochs=100,
verbose=0) # don't print out anything, we've seen this all before

Make predictions on the future
Let's predict the future and get rich! Well... maybe not.
As you've seen so far, our machine learning models have performed quite poorly at predicting the price of Bitcoin (time series forecasting in open systems is typically a game of luck), often worse than the naive forecast. That doesn't mean we can't use our models to try and predict the future right? To do so, let's start by defining a variable INTO_FUTURE which decides how many time steps we'd like to predict into the future.
Let's create a function that returns INTO_FUTURE forecasted values using a trained model. To do so, we'll build the following steps:
The function which takes as input:
a list of values (the Bitcoin historical data)
a trained model (such as model_9)
a window into the future to predict (our INTO_FUTURE variable)
the window size a model was trained on (WINDOW_SIZE) - the model can only predict on the same kind of data it was trained on
Creates an empty list for future forecasts (this will be returned at the end of the function) and extracts the last WINDOW_SIZE values from the input values (predictions will start from the last WINDOW_SIZE values of the training data)
Loop INTO_FUTURE times making a prediction on WINDOW_SIZE datasets which update to remove the first value and append the latest prediction
Eventually, future predictions will be made using the model's own previous predictions as input
# How many timesteps to predict into the future?
INTO_FUTURE = 14 # since our Bitcoin data is daily, this is for 14 days
# 1. Create function to make predictions into the future
def make_future_forecast(values, model, into_future, window_size=WINDOW_SIZE) -> list:
"""
Makes future forecasts into_future steps after values ends.
Returns future forecasts as list of floats.
"""
# 2. Make an empty list for future forecasts/prepare data to forecast on
future_forecast = []
last_window = values[-WINDOW_SIZE:] # only want preds from the last window (this will get updated)
# 3. Make INTO_FUTURE number of predictions, altering the data which gets predicted on each time
for _ in range(into_future):
# Predict on last window then append it again, again, again (model starts to make forecasts on its own forecasts)
future_pred = model.predict(tf.expand_dims(last_window, axis=0))
print(f"Predicting on: \n {last_window} -> Prediction: {tf.squeeze(future_pred).numpy()}\n")
# Append predictions to future_forecast
future_forecast.append(tf.squeeze(future_pred).numpy())
# print(future_forecast)
# Update last window with new pred and get WINDOW_SIZE most recent preds (model was trained on WINDOW_SIZE windows)
last_window = np.append(last_window, future_pred)[-WINDOW_SIZE:]
return future_forecast
Time to bring BitPredict to life and make future forecasts of the price of Bitcoin.
# Make forecasts into future of the price of Bitcoin
# Note: if you're reading this at a later date, you may already be in the future, so the forecasts
# we're making may not actually be forecasts, if that's the case, readjust the training data.
future_forecast = make_future_forecast(values=y_all,
model=model_9,
into_future=INTO_FUTURE,
window_size=WINDOW_SIZE)

----
----

future_forecast[:10]

Plot future forecasts
To plot our model's future forecasts against the historical data of Bitcoin, we're going to need a series of future dates (future dates from the final date of where our dataset ends).
How about we create a function to return a date range from some specified start date to a specified number of days into the future (INTO_FUTURE).To do so, we'll use a combination of NumPy's datetime64 datatype (our Bitcoin dates are already in this datatype) as well as NumPy's timedelta64 method which helps to create date ranges.
def get_future_dates(start_date, into_future, offset=1):
"""
Returns array of datetime values from ranging from start_date to start_date+horizon.
start_date: date to start range (np.datetime64)
into_future: number of days to add onto start date for range (int)
offset: number of days to offset start_date by (default 1)
"""
start_date = start_date + np.timedelta64(offset, "D") # specify start date, "D" stands for day
end_date = start_date + np.timedelta64(into_future, "D") # specify end date
return np.arange(start_date, end_date, dtype="datetime64[D]") # return a date range between start date and end date
# Last timestep of timesteps (currently in np.datetime64 format)
last_timestep = bitcoin_prices.index[-1]
# Get next two weeks of timesteps
next_time_steps = get_future_dates(start_date=last_timestep,
into_future=INTO_FUTURE)
next_time_steps

We've now got a list of dates we can use to visualize our future Bitcoin predictions.
But to make sure the lines of the plot connect (try not running the cell below and then plotting the data to see what I mean), let's insert the last timestep and Bitcoin price of our training data to the next_time_steps and future_forecast arrays.
# Insert last timestep/final price so the graph doesn't look messed
next_time_steps = np.insert(next_time_steps, 0, last_timestep)
future_forecast = np.insert(future_forecast, 0, btc_price[-1])
next_time_steps, future_forecast

Time to plot!
# Plot future price predictions of Bitcoin
plt.figure(figsize=(10, 7))
plot_time_series(bitcoin_prices.index, btc_price, start=2500, format="-", label="Actual BTC Price")
plot_time_series(next_time_steps, future_forecast, format="-", label="Predicted BTC Price")

It looks like our predictions are starting to form a bit of a cyclic pattern (up and down in the same way). Perhaps that's due to our model overfitting the training data and not generalizing well for future data. Also, as you could imagine, the further you predict into the future, the higher your chance for error (try seeing what happens when you predict 100 days into the future).
Model 10: Why forecasting is BS (the turkey problem)
When creating any kind of forecast, you must keep the turkey problem in mind.
The turkey problem is an analogy for when your observational data (your historical data) fails to capture a future event that is catastrophic and could lead you to ruin.
The story goes, a turkey lives a good life for 1000 days, being fed every day and taken care of by its owners until the evening before Thanksgiving. Based on the turkey's observational data, it has no reason to believe things shouldn't keep going the way they are. In other words, how could a turkey possibly predict that on day 1001, after 1000 consecutive good days, it was about to have a far from ideal day.

How does this relate to predicting the price of Bitcoin (or the price of any stock or figure in an open market)? You could have the historical data of Bitcoin for its entire existence and build a model which predicts it perfectly. But then one day for some unknown and unpredictable reason, the price of Bitcoin plummets 100x in a single day.
Think about it in your own life, how many times have the most significant events happened seemingly out of the blue? As in, you could go to a cafe and run into the love of your life, despite visiting the same cafe for 10-years straight and never running into this person before. The same thing goes for predicting the price of Bitcoin, you could make money for 10-years straight and then lose it all in a single day.
# Let's introduce a Turkey problem to our BTC data (price BTC falls 100x in one day)
btc_price_turkey = btc_price.copy()
btc_price_turkey[-1] = btc_price_turkey[-1] / 100
# Manufacture an extra price on the end (to showcase the Turkey problem)
btc_price_turkey[-10:]

Notice the last value is 100x lower than what it actually was (remember, this is not a real data point, its only to illustrate the effects of the turkey problem). Now we've got Bitcoin prices including a turkey problem data point, let's get the timesteps.
# Get the timesteps for the turkey problem
btc_timesteps_turkey = np.array(bitcoin_prices.index)
btc_timesteps_turkey[-10:]

Let's see our artificially created turkey problem Bitcoin data.
plt.figure(figsize=(10, 7))
plot_time_series(timesteps=btc_timesteps_turkey,
values=btc_price_turkey,
format="-",
label="BTC Price + Turkey Problem",
start=2500)

Before we build a model, let's create some windowed datasets with our turkey data.
# Create train and test sets for turkey problem data
full_windows, full_labels = make_windows(np.array(btc_price_turkey), window_size=WINDOW_SIZE, horizon=HORIZON)
len(full_windows), len(full_labels)
X_train, X_test, y_train, y_test = make_train_test_splits(full_windows, full_labels)
len(X_train), len(X_test), len(y_train), len(y_test)

Building a turkey model (model to predict turkey data)
With our updated data, we only changed 1 value. Let's see how it affects a model.
To keep things comparable to previous models, we'll create a turkey_model which is a clone of model_1 (same architecture, but different data). That way, when we evaluate the turkey_model we can compare its results to model_1_results and see how much a single data point can influence a model's performance.
# Clone model 1 architecture for turkey model and fit the turkey model on the turkey data
turkey_model = tf.keras.models.clone_model(model_1)
turkey_model._name = "Turkey_Model"
turkey_model.compile(loss="mae",
optimizer=tf.keras.optimizers.Adam())
turkey_model.fit(X_train, y_train,
epochs=100,
verbose=0,
validation_data=(X_test, y_test),
callbacks=[create_model_checkpoint(turkey_model.name)])

# Evaluate turkey model on test data
turkey_model.evaluate(X_test, y_test)

# Load best model and evaluate on test data
turkey_model = tf.keras.models.load_model("model_experiments/Turkey_Model/")
turkey_model.evaluate(X_test, y_test)

Now let's make some predictions with our model and evaluate them on the test data.
# Make predictions with Turkey model
turkey_preds = make_preds(turkey_model, X_test)
# Evaluate turkey preds
turkey_results = evaluate_preds(y_true=y_test,
y_pred=turkey_preds)
turkey_results

----
----

Finally, we'll visualize the turkey predictions over the test turkey data.
plt.figure(figsize=(10, 7))
# plot_time_series(timesteps=btc_timesteps_turkey[:split_size], values=btc_price_turkey[:split_size], label="Train Data")
offset=300
plot_time_series(timesteps=btc_timesteps_turkey[-len(X_test):],
values=btc_price_turkey[-len(y_test):],
format="-",
label="Turkey Test Data", start=offset)
plot_time_series(timesteps=btc_timesteps_turkey[-len(X_test):],
values=turkey_preds,
label="Turkey Preds",
start=offset);

Think about it like this, just like a turkey who lives 1000 joyful days, based on observation alone has no reason to believe day 1001 won't be as joyful as the last, a model which has been trained on historical data of Bitcoin which has no single event where the price decreased by 100x in a day, has no reason to predict it will in the future. A model cannot predict anything in the future outside of the distribution it was trained on. In turn, highly unlikely price movements (based on historical movements), upward or downward will likely never be part of a forecast.
Comments