Training a Recurrent Neural Network Using Keras

Sep 5, 2019 16 min read

Frame the Problem
Get the Data
Explore the Data
Prepare the Data for Training
A Non Machine Learning Baseline
Machine Learning Baseline
Building a RNN with Keras
A RNN Baseline
Extra

The attractive nature of RNNs comes froms our desire to work with data that has some form of statistical dependency on previous and future outputs. This can take the form of text, such as learning how words in a sentence should flow. It can take the form of timeseries data, such as seismograms, just to name a few. Recurrent neural networks processes sequences by iterating through the sequence of elements and maintaining a state containing information relative to what it has seen so far. [Chollet, F. (2017)]

With our basic understanding of RNNs, it’s time to dive into a small examples using real timeseries data.

Let’s remind ourselves what the general machine learning workflow is so that we don’t get lost:

Frame the Problem –> Get the Data –> Explore the Data –> Prepare the Data –> Short-List Promising Models –> Fine-Tune the System

We won’t have time to go through each of these steps in detail, but I encourage you to read Chapter 2 from Géron’s book [see Suggested Reading below].

# Run this first
import os
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop

Frame the Problem

We want to know if it is possible to predict the temperature in the future? For a given day, we want to look back some days (i.e. time steps) and use said information to predict the weather in the future. We’ll use the following variables to contrain what we mean by ‘past’ and ‘future’:

look back - number of timesteps to look back from
delay - number of timesteps in the future
steps - our sample rate

In our case that we will set look back = 1440 (1 day consists of 1440 minutes), step = 6 (one data point per hour), and delay = 1440 (one day in the future).

# How many timesteps back the input data should go.
# Note: 1 day is 1440 minutes
lookback = 1440

# The period, in timesteps, at which you sample data. Set to 6 in
# order to draw one data point every hour.
step = 6

# How many timesteps in the future the target should be.
delay = 1440

Get the Data

Our data set is a timeseries from the Weather Station at the Max Planck Institute for Biogeochemistry in Jena, Germany. Lucky for us, this is already in a format we can quickly and easily work with. We just need to read it in, store it in a workable format, and do some very basic data processing.

Download the data and store it in your working directory. Once you have done that, execute this block:

data_dir = './'
fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv')

f = open(fname)
data = f.read()
f.close()

Since we are working with a .csv file we first need to read it in, and then beat it into something we can work with. Depending on how large your data set is (both row and column-wise) you might want to invest learning using Pandas. It is a library that allows for easy data manipulation and works great for .csv, .xlsx, and .txt sort of data file types.

We only have a few thousand samples (420551, to be exact), so we will just use some build-in Python functions:

lines = data.split('\n')
header = lines[0].split(',')
lines = lines[1:]

print("Our header information: ",'\n')
print(header)

Our header information:  

['"Date Time"', '"p (mbar)"', '"T (degC)"', '"Tpot (K)"', '"Tdew (degC)"', '"rh (%)"', '"VPmax (mbar)"', '"VPact (mbar)"', '"VPdef (mbar)"', '"sh (g/kg)"', '"H2OC (mmol/mol)"', '"rho (g/m**3)"', '"wv (m/s)"', '"max. wv (m/s)"', '"wd (deg)"']

Note that we seperated the header titles and stored them in the “header” variable. We also made sure to remove the first column in our floats which contained the “Date Time” values. [Line #3]

Now that we have read our data in, let’s parse it and store it into a Numpy array so we can easily access the float values later:

Parse the Data into a Numpy Array

float_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
    values = [float(x) for x in line.split(',')[1:]]
    float_data[i, :] = values

I like to do sanity checks along the way, so let’s make sure our dimensions make sense:

print(float_data.shape)

(420551, 14)

We have 420551 time steps (done every 10 minutes), with 14 columns/features. I.e. we created a 2D tensor with dimensions: (# of samples, features).

Great. So far so good.

Explore the Data

It’s always a good idea to get a sense of what our data looks like. Data exploration is a necessary step to get a broad sense of what kind of data you are working with and to get an overall ‘feel’ for the data. Depending on the data type, you might want to make histograms or scatter plots for each numerical attribute. You could also compute some basic statistical attributes of your data set such as the mean, standard deviations, minimum and maximum values, and percentiles. (You can take advantage of Pandas’ built-in function for quick results).

Let’s do a VERY mild peek at our data. Specifically, let’s plot the temperature and see how it behaves over time:

temp = float_data[:, 1]  # temperature (in degrees Celsius)

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(14,5))
ax.plot(range(len(temp)), temp)
ax.set_ylabel('Degrees [C]')
ax.set_xlabel('Time [min]')

plt.show()

As one might expect (or hope to expect), we see a periodic temperature pattern over the years.

Prepare the Data for Training

Depending on what your data set is, you might consider using transformation Pipelines for preprocessing your data. In our case, we want to take care of the fact our data has different scales. Neural networks don’t perform well if their scales are dramatically different. No problem, we’ll just normalize our timeseries.

Note: We only do this to our training data (our first 200000 time steps). So only compute and modify these sampels:

mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std

Split Data into Training, Validation, and Testing

Always split your data set into a trainig, validation, and a test set. As a rule of thumb: you’ll want to split it such that you leave 20% of your data for testing and keep the rest as your training and validation set. The bottom snippet of code might look confusing at first, but all it is is a Python Generator for creating the above mentioned data sets.

# Generator yields a tuple (samples, targets), where samples is one batch 
# of input data and targets is the corresponding array of
# target temperatures.
def generator(data, lookback, delay, min_index, max_index,\
              shuffle=False, batch_size=128, step=6):
    """
    Parameters
    ----------
    data: array, dtype = floats or integers
        array containing our data 
    
    lookback: integer
        number of timesteps to take backward from current index
    
    delay: integer
        number of timesteps to take forward from current index
    
    min_index: integer
        index value used to set lower bound in data array
        e.g. array_test[min_index:]
    
    max_index: integer
        index value used to cap data array
        e.g. array_test[:max_index]
    
    shuffle: boolean
        used to determine whether or not to shuffle data or
        keep in given order
    
    batch_size: integer
        how many sequences to give per epoch
        
    step: integer
        The period, in timesteps, we sample data with
    
    Returns:
    -------
    samples: array, dtype = float or int
        a single batch of input data 
    
    targets: array, dtype = float, int, or str
        a single array of target values (in this case temperature)
    """
    
    if max_index is None:
        max_index = len(data) - delay - 1
    i = min_index + lookback
    while 1:
        if shuffle:
            rows = np.random.randint(min_index + lookback, max_index, size=batch_size)
        else:
            if i + batch_size >= max_index:
                i = min_index + lookback
            rows = np.arange(i, min(i + batch_size, max_index))
            i += len(rows)
        samples = np.zeros((len(rows),lookback // step,data.shape[-1]))
        targets = np.zeros((len(rows),))
        for j, row in enumerate(rows):
            # range(start, end, steps)
            # First value you start from a random time spot, then go back one day.
            # i.e. start: random_time - day, end on random_time, take steps of one hour
            # Want one sample every hour) --> 1440 minutes / 60 minutes = 240 timesteps
            indices = range(rows[j] - lookback, rows[j], step)
            samples[j] = data[indices]
            samples[j] = data[indices]
        yield samples, targets

Going back to our problem, we said we wanted to use data going back a certain amount of time (data points in our timeseries) to predict the weather in the future (defined by the variable delay). Let’s create the data sets we need to feed our recurrent neural network:

# How many timesteps back the input data should go.
# Note: 1 day is 1440 minutes
lookback = 1440

# The period, in timesteps, at which you sample data. Set to 6 in
# order to draw one data point every hour.
step = 6

# How many timesteps in the future the target should be.
delay = 144

# The number of samples per batch.
batch_size = 128

# Generate data
train_gen = generator(float_data, lookback=lookback, delay=delay,\
                      min_index=0, max_index=200000, shuffle=True,\
                      step=step, batch_size=batch_size)

val_gen = generator(float_data, lookback=lookback, delay=delay,\
                    min_index=200001, max_index=300000, step=step,\
                    batch_size=batch_size)

test_gen = generator(float_data, lookback=lookback, delay=delay,\
                     min_index=300001, max_index=None, step=step,\
                     batch_size=batch_size)

# How many steps to draw from val_gen in order to see the entire validation set.
val_steps = (300000 - 200001 - lookback)

# How many steps to draw from test_gen in order to see the entire test set.
test_steps = (len(float_data) - 300001 - lookback)

Establishing a Common Sense Non Machine Learning Baseline

Before running this model you want to establish a baseline. Baslines are useful for a couple of reasons: first, it gives you something to compare the performance of your machine learning algorithm with. If your computationally heavy, machine learning algorithm does not perform better than a simple MAE, well…then there’s not really a point to use a machine learning algorithm; The second reason for creating a baseline: they serve as a sanity check.

Since we are curious about predicting the temperature in the future, a common sense non machine learning baseline could involve using temperature from the last 24 hours to say something about right now.

One of doing this could be using an MAE. Here is a code snippet example:

np.mean(np.abs(preds - targets))
And the associated code snippet for the entire loop:  
batch_maes = []

for step in range(val_steps):
    samples, targets = next(val_gen)
    preds = samples[:, -1, 1]
    mae = np.mean(np.abs(preds - targets))
    batch_maes.append(mae)
    
print(np.mean(batch_maes))

If you compute the standard deviation of the temperature and run the code snippet above, you’ll be get an MAE = 0.29 . This number obviously doesn’t make any intuitive sense because our data was normalized to be centered at 0 and have a standard deviation of 1. Translating this into something more understandable, we have an absolute error of:

0.29 * temperature_std = 2.57°C

Which is not terrible, but it’s not great. We hope that our machine learning approach will better. Let’s find out.

A Basic Machine Learning Approach

I would suggest trying to execute the next block but then interrupting the kernal once you get a flavor of how long it’ll take to run.

Because of this shortcoming, I’ve provided screen shots for the remaining part of the notebook of what will show up if you were to run this:

model = Sequential()

# input shape: (240, 14)
model.add(layers.Flatten(input_shape=(lookback // step, float_data.shape[-1])))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
                              steps_per_epoch=100,
                              epochs=20,
                              validation_data=val_gen,
                              validation_steps=val_steps)

Interpreting the Results of a Basic Machine Learning Algorithm

Let’s plot the loss curves for both the training and validation data sets:

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(loss))

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

What do we see? On the x-axis we one again have epochs and the y-axis we see the output of our loss function, the mean absolute error. Looking at the training loss we note that with increasing epoch our loss is going down. i.e. the function we are trying to minimize is actually being minimized.

But, we can’t celebrate just yet. Arguably more important is understanding our validation loss. We would hope that it too would have smaller loss values with each iteration. Instead, we see that our validation loss is increasing after iteration 5.

A Blurb on the Hypothesis Space

Interesing enough, our ‘common sense’ approach earlier gave us a MAE of 0.29. Which, comparing to our results, is actually better. Why is this? More specifically, why didn’t we find the same hypothesis used for the common sense approach?

Remember that the overall objective of a machine learning algorithm is to find the mapping function/hypothesis between an independent variable X and a dependent variable y such that it best minimizes the cost function. The hypothesis space, where our mapping function lives, of our machine learning network is the space of all possible 2-layer networks with the configuration that we defined. So, it may be that the space we just defined with our machine learning algorithm doesn’t actually have the hypothesis we originally found.

Let’s now move on to the main purpose of this notebook, using a recurrent neural netork.

Building a RNN with Keras

The simplest RNN we can use with Keras is literally called

model.add(SimpleRNN( ))

For example:

from keras.models import Sequential
from keras import layers 
from keras.layers import Embedding, Dense, SimpleRNN

model = Sequential()
model.add(SimpleRNN(10, input_shape=(3, 1)))
model.add(Dense(1))

Let’s look at this line by line:

Line 5: Defined our model architect using a Sequential class.

Line 6: Added our RNN layer (which also serves as our input layer).

Line 7: Added a fully connected (i.e. Dense) layer as our output layer.

The model.summary() function is a convenient way of checking how our deep neural network textually looks like. It provides key information of our architecture such as:

the layer type and the order of the layers from input (first row) to output (bottom row before the ‘=’);
the shape of the tensor for each output (and thus, what is going into the next layer);
and the number of weights (here labeled ‘parameters’) per layer along with a summary of the total number of weights.

For example:

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
simple_rnn_1 (SimpleRNN)     (None, 10)                120       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 11        
=================================================================
Total params: 131
Trainable params: 131
Non-trainable params: 0
_________________________________________________________________

What do we see? The first line is our header

[ Layer(type), Output Shape,, and Param # ]
Where Output Shape is the shape of the tensor that is leaving our first layer (SimpleRNN) and going into the next layer Dense (i.e. a fully connected layer).

In the next line We see that we have an output shape of (None, 10) and 120 Parameters:

simple_rnn_1 (SimpleRNN) (None, 10) 120

What does this mean? When we wrote line 6:

SimpleRNN(10, input_shape=(3, 1))

We specified that we had 10 weights (parameters) and input shape of (3,1). The 3 here means we have 3 sequences(e.g. three timeseries points) we want to input and 1 featuere (e.g. Temperature).

The formula for the number of parameters/weights:

Parameters = num_weights x num_weights + num_features x num_weights + biases

= $10 * 10 + 1 * 10 + 10 = 120$

Note: Full explaination of the parameters below

Finally, we have our output layer. In this example we defined it as a Dense layer:

Dense(1)

So this last Dense layer takes its input (10 (the output of the previous layer) and adds the bias to give us 11 parameters/weights. Since we defined the dense layer as: Dense(1) we are telling our neural network that we want a single output.

A Recurrent Neural Network Baseline

Just like in our previous notebook we’ll create our deep neural network by first defining our model architecture with Keras’ Sequential class. We’ll make a key change in that instead of using a SimpleRNN we’ll use a GRU layer. I won’t go into the details here (if you are curious to learn more, I would suggest reading this blog post) of what GRUs are for brevity. But basically, GRUs are a variation of our Recurrent Neural Network except that it handles the vanishing gradient problem. The link to the paper describing this by Cho, et al. can be found below in the Suggested Reading section.

model = Sequential()
model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))
# Note that our output layer is defined to have a single value(i.e. a temperature)
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
                              steps_per_epoch=500,
                              epochs=20,
                              validation_data=val_gen,
                              validation_steps=val_steps)

What does the above show us? For each epoch Keras is printing out:

1) How long it took (e.g. 10 seconds in this case)
2) The loss from the training and validation data sets.

The second point here is worth spending some time thinking about. Remember that the overall objective is for us to create an algorithm which learns from the data we give it. i.e. we want our algorithm to generalize to data sets it has never seen before. We should expect, therefore, that the training loss decreases for every epoch. Does this happen in our case? Plotting the training and validation loss will help us understand a litte more of how well our deep neural network did:

Interpreting Results

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(loss))

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

What do we see? On the x-axis we once again have epochs and the y-axis we see the output of our loss function, the mean absolute error (which we defined when we compiled our RNN in code line 6) . Looking at the training loss we note that with increasing epoch our loss is going down. i.e. the function we are trying to minimize is actually being minimized just like in our basic machine learning approach.

Once again, however, our validation loss is increasing after a certain number of epochs.

The fact our validation loss is increasing tells us two important aspects of our model: first, our algorithm did not generalize well (i.e. it did not learn) and second, our algorithm was starting to overfit on the training set.

That’s a bit disappointing…but don’t fret! As it explained in Chollet’s Deep Learning with Python[1], there are some tricks we can use to help stop and/or prevent overfitting; nameley, we can take advantage of dropping out, using a regularization, and early stopping. I won’t go into the details here, but I would suggest reading Skalski’s blog post for more information.

More on How the Weights Are Computed for a RNN:

From the output above we have 120 parameters. Why do we have 120 parameters?

Remember, there are two things going on with our simple RNN: First you have the recurrent loop, where the state is fed recurrently into the model to generate the next step. Weights for the recurrent step are:

recurrent_weights = num_units * num_units

Second, there is a new input of your sequence at each step:

input_weights = num_features * num_units

So now we have the weights, whats missing are the biases - for every unit one bias:

biases = num_units * 1

In our case we have that num_units = $10$ and num_features = $1$.

Putting this altogether we have the following formula for the number of parameters/weights:

Parameters = num_units x num_units + num_features x num_units + biases

Where num_units is the number of weights in the RNN (10) and num_features is the number features of our input. (In thie case 1).

Parameters = $10 * 10 + 1 * 10 + 10 = 120$

About this Notebook

The above example and code is from Ch.6 of Chollet’s Deep Learning with Python[1]. Content was added for further clarification and readability.

Sources

[1]Chollet, F. (2017). Deep Learning with Python. Manning Publications.

Noah Luna

Recent MS Graduate Student

My research interests include applications of deep neural networks in seismology, theoretrical frameworks of deep neural networks, seismic sources, and time-series analysis.