angle-uparrow-clockwisearrow-counterclockwisearrow-down-uparrow-leftatcalendarcard-listchatcheckenvelopefolderhouseinfo-circlepencilpeoplepersonperson-fillperson-plusphoneplusquestion-circlesearchtagtrashx

LSTM multi-step hyperparameter optimization with Keras Tuner

Our black box is tuned by someone else, nice. But this also takes a lot of time and has a large carbon footprint, not so nice.

13 February 2022 Updated 2 March 2022
post main image
https://www.pexels.com/nl-nl/@life-of-pix

A previous post was about Hyperparameter optimization with Talos. I could not get this to work with my LSTM model for univariate multi-step time series forecasting, because of the 3D input, so I switched to Keras Tuner.
In this post I try to predict the next period of a sine wave using the Hyperband tuning algorithm. To reduce tuner time, I reduced the number of hyperparameters that can be optimized and also limited the possible values for each parameter.
Spoiler alert: Always include an optimizer like Keras Tuner in your code, it will save you a lot of time.

Keep the hyperparameters in one place

Keras Tuner examples on the internet most of the time have the parameters somewhere in the code of the model. I want them to be in one place and added some classes to handle this. At the moment I am using only the 'Choice' option for the hyperparameters, which takes an array with discrete values. Here is how I did this:

# tuner parameters
class TunerParameter:
    def __init__(self, name, val):
        self.name = name
        self.val = val
        self.args = (name, val)

class TunerParameters:
    def __init__(self, pars=None):
        self.tpars = []
        for p in pars:
            tpar = TunerParameter(p[0], p[1])
            self.tpars.append(tpar)
            setattr(self, p[0], tpar)
    def get_pars(self):
        return self.tpars

# hyperparameters, name and value, are the inputs for hp.Choice()
tpars = TunerParameters(
    pars=[
        ('first_neuron', [64, 128]),
        ('second_neuron', [64, 128]),
        ('third_neuron', [64, 128]),
        ('learning_rate', [1e-4, 1e-2]),
        ('batch_size', [8, 32]),
    ],
)

Now we can refer to the parameters for example as:

tpars.learning_rate.args

Hyperband and the batch_size parameter

By default you cannot specify a 'Choice' for the batch_size parameter. To do this we must subclass the Hyperband class, as described in 'How to tune the number of epochs and batch_size? #122', see links below. We do not tune the epochs because Hyperband sets the epochs to train for via its own logic. Hyperparameters we added and pass for tuning must be removed from the parameters that are passed to Hyperband.

# subclass tuner to add hyperparameters (here, batch_size)
class MyTuner(kt.tuners.Hyperband):
    def __init__(self, *args, **kwargs):
        self.tpars = None
        if 'tpars' in kwargs:
            self.tpars = kwargs.pop('tpars')
        super(MyTuner, self).__init__(*args, **kwargs)

    def run_trial(self, trial, *args, **kwargs):
        if self.tpars is not None:
            for tpar in self.tpars:
                kwargs[tpar.name] = trial.hyperparameters.Choice(tpar.name, tpar.val)
        return super(MyTuner, self).run_trial(trial, *args, **kwargs)

Input data generation and processing

We generate the time series data using the Python sin() function. In the example we split the input data into training data and test data. We generate 4 periods for the training data and 2 periods for the test data. And then predict the next period. The training data is also used for validation (validation_split=0.33). Note that we do not (!) shuffle the input data when splitting.

We use n_steps_in=5 input values. To predict a full period we set the number of prediction steps, n_steps_out, equal to the number of datapoints in a period. Here we use 20 datapoints per period.  For more information about LSTM univariate multi-step forecasting, check the post 'How to Develop LSTM Models for Time Series Forecasting', see links below.

With n_steps_in=5 and n_steps_out=20 we have a 'block of 25 (n_steps_in=5 + n_steps_out=20) datapoints' that shifts over all the datapoints. This means that after conversion, we have 120 - 25 = 95 values input data, and the start of the first value is at offset 5 steps.

We make a prediction with the last n_steps_in values of our dataset, this is illustrated below.

<--------------------- input data -------------------->

<----------- training data ---------><---test data --->

    0        1        2        3        4        5    
|--------|--------|--------|--------|--------|--------|
                                       
0                                                    120 datapoints


blocks of n_steps_in + n_steps_out

.....iiiioooooooooo....................................


to make a prediction we need the last n_steps_in values: 

                                                           6
                                                      |--------|
                                       
                                                  iiii

The code

There is not much more to tell about the code. We use EarlyStopping, I still do not know if this good or bad practice. Of course, the total time taken by a run is reduced, but we may miss the best parameters. But this is how we do this when using Deep Learning optimization. We give some values and use a result that is 'good enough'. Do not forget to run the code with other input.

# optimizing hyperparameters with keras tuner
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense, LSTM
import keras_tuner as kt
import numpy as np
import plotly.graph_objects as go
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
import sys

# tuner parameters
class TunerParameter:
    def __init__(self, name, val):
        self.name = name
        self.val = val
        self.args = (name, val)

class TunerParameters:
    def __init__(self, pars=None):
        self.tpars = []
        for p in pars:
            tpar = TunerParameter(p[0], p[1])
            self.tpars.append(tpar)
            setattr(self, p[0], tpar)
    def get_pars(self):
        return self.tpars

# hyperparameters name and value, are the inputs for hp.Choice()
tpars = TunerParameters(
    pars=[
        ('first_neuron', [64, 128]),
        ('second_neuron', [64, 128]),
        ('third_neuron', [64, 128]),
        ('learning_rate', [1e-4, 1e-2]),
        ('batch_size', [8, 32]),
    ],
)

# split a univariate sequence into samples
# see:
# How to Develop LSTM Models for Time Series Forecasting
# https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting
def split_sequence(sequence, n_steps_in, n_steps_out):
    X, y = list(), list()
    for i in range(len(sequence)):
        # find the end of this pattern
        end_ix = i + n_steps_in
        out_end_ix = end_ix + n_steps_out
        # check if we are beyond the sequence
        if out_end_ix > len(sequence):
            break
        # gather input and output parts of the pattern
        seq_x, seq_y = sequence[i:end_ix], sequence[end_ix:out_end_ix]
        X.append(seq_x)
        y.append(seq_y)
    return np.array(X), np.array(y)

# choose a number of time steps, prediction = 20 steps
n_steps_in = 5
n_steps_out = 20

# train:test = 2:1 (0.33 split)
n_periods_train = 4
n_periods_test = 2
# total periods
n_periods = n_periods_train + n_periods_test
# data points per period, predict a full period (n_steps_out)
period_points = n_steps_out
# generate sine wave data points
xs = np.linspace(0, n_periods * 2 * np.pi, n_periods * period_points)
raw_seq = np.sin(xs)

# plot the input data
dp = [i for i in range(len(raw_seq))]
fig = go.Figure()
fig.add_trace(go.Scattergl(y=raw_seq, x=dp, name='Sin'))
fig.update_layout(height=500, width=700, xaxis_title='datapoints', yaxis_title='Sine wave')
fig.show()

'''
# another dataset you may want try 
raw_seq = []
for n in range(0, n_periods * period_points):
    raw_seq.append(n)
print('raw_seq = {}'.format(raw_seq))
print('len(raw_seq) = {}'.format(len(raw_seq)))
'''

# split into samples
X, y = split_sequence(raw_seq, n_steps_in, n_steps_out)
# reshape from [samples, timesteps] into [samples, timesteps, features]
n_features = 1
X = X.reshape((X.shape[0], X.shape[1], n_features))

X_len = len(X)
y_len = len(y)
print('X_len = {}'.format(X_len))
print('y_len = {}'.format(y_len))

x_pred = np.array(raw_seq[-1*n_steps_in:])
x_pred = x_pred.reshape((1, n_steps_in, n_features))

class DLM:
    
    def __init__(
        self,
        n_steps_in=None,
        n_steps_out=None,
        n_features=None,
        tpars=None,
    ):
        self.n_steps_in = n_steps_in
        self.n_steps_out = n_steps_out
        self.n_features = n_features
        self.tpars = tpars
        # input_shape
        self.layer_input_shape = (self.n_steps_in, self.n_features)
        print('layer_input_shape = {}'.format(self.layer_input_shape))

    def data_split(self, X, y):
        X0, X1, y0, y1 = train_test_split(X, y, test_size=0.33, shuffle=False)
        return X0, X1, y0, y1

    def get_model(self, hp):
        model = Sequential()
        # add layers
        a = ('first_neuron', [64, 128])

        model.add(LSTM(
            hp.Choice(*self.tpars.first_neuron.args),
            activation='relu',
            return_sequences=True,
            input_shape=self.layer_input_shape,
        ))
        model.add(LSTM(
            hp.Choice(*self.tpars.second_neuron.args),
            activation='relu',
            return_sequences=True,
        ))
        model.add(LSTM(
            hp.Choice(*self.tpars.third_neuron.args),
            activation='relu',
        ))
        model.add(Dense(n_steps_out))

        # compile
        model.compile(
            optimizer=Adam(learning_rate=hp.Choice(*self.tpars.learning_rate.args)),
            loss='mean_absolute_error',
            metrics=['mean_absolute_error'],
            run_eagerly=True,
        )
        return model

    def get_X_pred(self, x):
        a = []
        for i in range(0, self.n_steps_in):
            a.append(x[i][0])
        a = [x[i][0] for i in range(0, self.n_steps_in)]
        X_pred = np.array(a)
        X_pred = X_pred.reshape((1, self.n_steps_in, self.n_features))
        return X_pred

    def get_plot_data(self, model, X, y, x_offset):
        x_plot = []
        y_plot = []
        y_predict_plot = []
        for i, x in enumerate(X):
            y_plot.append(y[i][0])
            X_pred = self.get_X_pred(x)
            predictions = model.predict(X_pred)
            y_predict_plot.append(predictions[0][0])
        x_plot = [x + x_offset for x in range(len(X))]
        return x_plot, y_plot, y_predict_plot


# subclass tuner to add hyperparameters (here, batch_size)
class MyTuner(kt.tuners.Hyperband):
    def __init__(self, *args, **kwargs):
        self.tpars = None
        if 'tpars' in kwargs:
            self.tpars = kwargs.pop('tpars')
        super(MyTuner, self).__init__(*args, **kwargs)

    def run_trial(self, trial, *args, **kwargs):
        if self.tpars is not None:
            for tpar in self.tpars:
                kwargs[tpar.name] = trial.hyperparameters.Choice(tpar.name, tpar.val)
        return super(MyTuner, self).run_trial(trial, *args, **kwargs)


dlm = DLM(
    n_steps_in=n_steps_in,
    n_steps_out=n_steps_out,
    n_features=n_features,
    tpars=tpars,
)

# split data 
X_train, X_test, y_train, y_test = dlm.data_split(X, y)
print('len(X_train) = {}, len(X_test) = {}, X0_shape = {}'.format(len(X_train), len(X_test), X_train.shape))

# use subclassed HyperBand tuner
tuner = MyTuner(
    dlm.get_model,
    objective=kt.Objective('val_mean_absolute_error', direction='min'),
    allow_new_entries=True,
    tune_new_entries=True,
    hyperband_iterations=2,
    max_epochs=260,
    directory='keras_tuner_dir',
    project_name='keras_tuner_demo',
    tpars=[tpars.batch_size],
)

print('\n{}\ntuner.search_space_summary()\n{}\n'.format('-'*60, '-'*60))
tuner.search_space_summary()

tuner.search(
    X_train,
    y_train,
    validation_split=0.33,
    callbacks=[EarlyStopping('val_loss', patience=3)]
)

print('\n{}\ntuner.results_summary()\n{}\n'.format('-'*60, '-'*60))
tuner.results_summary()

print('\n{}\nbest_hps\n{}\n'.format('-'*60, '-'*60))
best_hps = tuner.get_best_hyperparameters()[0]
for tpar in tpars.get_pars():
    print('- {} = {}'.format(tpar.name, best_hps[tpar.name]))

h_model = tuner.hypermodel.build(best_hps)
print('\n{}\nh_model.summary()\n{}\n'.format('-'*60, '-'*60))
h_model.summary()

# plot test data performance
num_epochs = 100
history = h_model.fit(X_test, y_test, epochs=num_epochs, validation_split=0.33)

fig = go.Figure()
fig.add_trace(go.Scattergl(y=history.history['mean_absolute_error'], name='Test'))
fig.add_trace(go.Scattergl(y=history.history['val_mean_absolute_error'], name='Valid'))
fig.update_layout(height=500, width=700, xaxis_title='Epoch', yaxis_title='Mean Absolute Error')
fig.show()

# plot train using predict(), test using predict(), and actual prediction
x_offset = n_steps_in
x_train_plot, y_train_plot, y_train_predict_plot = dlm.get_plot_data(h_model, X_train, y_train, x_offset)
x_offset += len(x_train_plot)
x_test_plot, y_test_plot, y_test_predict_plot = dlm.get_plot_data(h_model, X_test, y_test, x_offset)
x_offset += len(x_test_plot)
# use x_pred to get the prediction and show the n_steps_out
predictions = h_model.predict(x_pred)
y_predict_plot = predictions[0]
x_offset = n_periods * period_points
x_predict_plot = [x + x_offset for x in range(n_steps_out)]

fig = go.Figure()
fig.add_trace(go.Scattergl(y=y_train_predict_plot, x=x_train_plot, name='Train-pred'))
fig.add_trace(go.Scattergl(y=y_test_predict_plot, x=x_test_plot, name='Test-pred'))
fig.add_trace(go.Scattergl(y=y_predict_plot, x=x_predict_plot, name='Prediction'))
fig.update_layout(height=500, width=700, xaxis_title='Epoch', yaxis_title='Train, Test and Prediction')
fig.show()

h_eval_dict = h_model.evaluate(X_test, y_test, return_dict=True)
print('h_eval_dict = {}.'.format(h_eval_dict))

Results

I did not add plots to this post, if you want to see them, run the code. Note that in the final plot we are showing values predicted for the training data and test data. The predicted training data starts at step n_steps=5 and and the predicted test data ends at (n_periods * period_points)=120 - n_steps_out=20 which is correct. The requested prediction starts at step 120 at has 20 steps.

The sine wave prediction result is not bad for such a small dataset. When I tried a slope, see code, the prediction result has much more noise. I guess this has to do with the limited number of parameters in the optimization and the fact that the predicted values of the sine wave are in the range as the training and test data. Still the prediction for the slope looks good, using multi-step we get more values and can see a trend.

Energy consumption and neural network tuning

A lot of todays energy consumption is taken by systems tuning neural networks. The article 'It takes a lot of energy for machines to learn – here's why AI is so power-hungry', see links below, mentions that training BERT (Bidirectional Encoder Representations from Transformers) only once has the carbon footprint of a passenger flying a round trip between New York and San Francisco.  By training and tuning multiple times, the cost became the equivalent of 315 passengers, or an entire 747 jet.

How many neural networks are trained everyday? And how many are like BERT? Let's make a calculation based on 12.000 BERT like networks being optimized every two months (I believe it is much more). Then we have (12.000/60=) 200 full 747's flying round trip between New York and San Francisco every day! And then there are also the smaller networks. Hmmm ....

Summary

Hyperparameter tuning appeared not difficult with Keras Tuner, infact I like it very much because I want the neural network be a black box. With Keras Tuner the switches and screws on it are still there but somebody else is adjusting them. That is nice. But the tuning process takes a lot of time and energy. Unfortunately, there is no other way at the moment.

Links / credits

How to Develop LSTM Models for Time Series Forecasting
https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting

How to tune the number of epochs and batch_size? #122
https://github.com/keras-team/keras-tuner/issues/122

It takes a lot of energy for machines to learn – here's why AI is so power-hungry
https://theconversation.com/it-takes-a-lot-of-energy-for-machines-to-learn-heres-why-ai-is-so-power-hungry-151825

KerasTuner API
https://keras.io/api/keras_tuner

Leave a comment

Comment anonymously or log in to comment.

Comments

Leave a reply

Reply anonymously or log in to reply.