angle-uparrow-clockwisearrow-counterclockwisearrow-down-uparrow-leftatcalendarcard-listchatcheckenvelopefolderhouseinfo-circlepencilpeoplepersonperson-fillperson-plusphoneplusquestion-circlesearchtagtrashx

Predicting values using Deep Learning and Keras

Using Keras we can create a Deep Learning black box that can predict future values

28 January 2022 Updated 28 January 2022
post main image

I have a dataset, many rows with N inputs and 1 output, and want to predict the output value for any new combination of input values. I am also a data science noob, but stories on the internet about Deep Learning suggest we can easily create some kind of black box with some neurons, nodes, in it, and then use the dataset to train the black box. After this we can feed any inputs and our black box gives us output values with some error margin. Sounds too good to be true? Is it really this easy?

Start with KISS (Keep It Simple Stupid)

When you search the internet for 'python data prediction' you get a lot of hits of course. There are many examples I like but most are far too complex for a noob. Predict stock prices using time series, too complex. Predict house prices, much better. There also is a Wine Quality dataset. I like wine. can be good, maybe another time.

In this post I try to solve a trivial problem. I replace the function:

y = x0 + 2*x1

by:

  • a Linear Regression model, and,
  • a Deep Learning, in fact a neural network, a black box

A huge advantage is that I can generate the dataset myself. This also means I can compare predictions against expected values.

Machine Learning: Regression vs classification

To choose the right algorithm, it is important to first understand whether the Machine Learning task is a regression or a classification problem. Regression and classification are both types of Supervised Machine Learning algorithms. Supervised Machine Learning uses the concept of utilizing datasets with known output values to make predictions.

Regression is an algorithm that can be trained with a dataset to predict outputs that are numeric values, numbers. Classification is an algorithm that can be trained with a dataset to predict outputs that are labels, categories, usually 0's and 1's.

Regression example

We have a dataset consisting of rows with 2 inputs and a 1 output, the output value is a number that can have 'any' value. Consider a housing dataset where the inputs are the number of persons that can live in the house, the number of rooms, and the output is the price.

 persons | rooms | price
---------+-------+--------
  5      |  4    | 20.000
  3      |  2    | 24.000

If we have a large enough dateset we can use a regression algorithm to predict the price for any combination of persons and rooms. The output will be 22.500, 18.100 etc.

Classification example

We have a dataset consisting of rows with 3 inputs and a 1 output, the output value is a number that is 0 or 1. Consider a housing dataset where the inputs are the number of persons that can live in the house, the number of rooms, the price, and the output is 0 or 1 depending on whether the house is favored, liked, by people searching for a house on a website.

 persons | rooms | price   | liked
---------+-------+---------+--------
  5      |  4    | 20.000  | 0
  3      |  2    | 24.000  | 1

If we have a large enough dataset we can use a classification algorithm to predict if a house is liked for any combination of persons, rooms and price. The output will be 0 (not liked), or 1 (liked).

Underfitting and overfitting

Our dataset is split into a training dataset and a test dataset. Using these we can determine how good or bad our model or algorithm performs.

Underfitting means that the model has not been able to determine the relevant relations between the data. Probably the model is too simple. An example is trying to represent nonlinear relations with a linear model.

Underfitting:

  • Works poor with training data
  • Works poor with test data

Overfitting means that the model also determined relations between data and random fluctuations. Probably the model the is too complex, it learns too well.

Overfitting:

  • Works well with training data
  • Works poor with test data

Both cases can also occur when there is not enough data in our dataset.

Linear regression

I suggest you read through the examples in the great tutorial 'Linear Regression in Python', see links below. Below I show my version with two inputs, training and test data and predicting a value. Because we are modeling with Linear Regression, it is not surprising that the coefficient of determination is 1.0, and that we only need a very small amount of training data.

# Linear regression: y = x0 + 2*x1
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# generate dataset
X_items = []
y_items = []
for x0 in range(0, 3):
    for x1 in range(3, 5):
        y = x0 + 2*x1
        X_items.append([x0, x1])
        y_items.append(y)

X = np.array(X_items).reshape((-1, 2))
y = np.array(y_items)
print('X = {}'.format(X))
print('y = {}'.format(y))

# split dataset in training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print('X_train = {}'.format(X_train))
print('y_train = {}'.format(y_train))
print('X_test = {}'.format(X_test))
print('y_test = {}'.format(y_test))

# create model
model = LinearRegression()

# mess up a training value
#X_train[0][0] += 2

# calculate optimal values of weights b0 and b1
model.fit(X_train, y_train)

# show results, mess up test data
print('model result:')
print('- intercept (b0) = {}'.format(model.intercept_))
print('- slope (b1) = {}'.format(model.coef_))
print('- coefficient of determination for training data = {}'.
    format(model.score(X_train, y_train)))
print('- coefficient of determination for test data = {}'.
    format(model.score(X_test, y_test)))

x = np.array([8, 9]).reshape((-1, 2))
y_pred = model.predict(x)
print('predicted response for x = {}: {}'.format(x, y_pred))

The script gives the following output:

X = [[0 3]
 [0 4]
 [1 3]
 [1 4]
 [2 3]
 [2 4]]
y = [ 6  8  7  9  8 10]
X_train = [[2 3]
 [0 3]
 [1 4]
 [2 4]]
y_train = [ 8  6  9 10]
X_test = [[1 3]
 [0 4]]
y_test = [7 8]
model result:
- intercept (b0) = -3.552713678800501e-15
- slope (b1) = [1. 2.]
- coefficient of determination for training data = 1.0
- coefficient of determination for test data = 1.0
predicted response for x = [[8 9]]: [26.]

To create a a little of trouble I changed a training value, see above. This gives the result:

model result:
- intercept (b0) = -2.3529411764705817
- slope (b1) = [0.52941176 2.76470588]
- coefficient of determination for training data = 0.9865546218487395
- coefficient of determination for test data = -0.5570934256055304
predicted response for x = [[8 9]]: [26.76470588]

By adding a more training data we get a better fit.

Deep Learning with Keras

This should be more like the black box approach. I just add some neurons and layers and that should be it. But is it really? Here I use Keras because it appears to be very popular. The following example was very helpful: 'Keras 101: A simple (and interpretable) Neural Network model for House Pricing regression', see links below. The plotting of both loss and mean average error made much sense.

We will do the following steps:

  • Load the data
  • Define the model
  • Compile the model
  • Train (fit) the model
  • Evaluate the model
  • Make predictions

Load the data Here we generate the data ourselves, see also above.

Define the model The first Dense layer needs the input_shape parameter set. I started with 100 neurons in the first layer, 50 in the second layer, 25 in the third. Why? I have no idea, did not look into this yet.

Compile the model Not much to say about this.

Train (fit) the model We are using the validation_split parameter, without it we cannot plot. When validation_split is specified part of ther training data is used for validation. The data used for validation during fit can change, it is probably better to use fixed data otherwise we get a different fit every time.

Evaluate the model We use the test data here.

Predictions Simply use model.predict().

Of course we need more training data here. But also not really that much more. I also added code to save and load the model. Below is the code I created.

# Keras deep learning: y = x0 + 2*x1
from keras.models import Sequential, load_model
from keras.layers import Dense
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split

use_saved_model = False
#use_saved_model = True

# create dataset
def fx(x0, x1):
    return x0 + 2*x1

X_items = []
y_items = []
for x0 in range(0, 18, 3):
    for x1 in range(2, 27, 3):
        y = fx(x0, x1)
        X_items.append([x0, x1])
        y_items.append(y)

X = np.array(X_items).reshape((-1, 2))
y = np.array(y_items)
print('X = {}'.format(X))
print('y = {}'.format(y))
X_data_shape = X.shape
print('X_data_shape = {}'.format(X_data_shape))

class DLM:
    
    def __init__(
        self,
        model_name='default_model',
    ):
        self.model_name = model_name
        self.dense_input_shape=(2, )
        self.dense_neurons = [100, 50, 25]
        self.fit_params = {
            'epochs': 100, 
            'validation_split': 0.1, 
            'verbose': 0,
        }

    def data_split_train_test(
        self,
        X,
        y,
    ):
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(X, y, test_size=0.3, random_state=1)
        print('self.X_train = {}'.format(self.X_train))
        print('self.X_test = {}'.format(self.X_test))
        print('self.y_train = {}'.format(self.y_train))
        print('self.y_test = {}'.format(self.y_test))

        print('training data row count = {}'.format(len(self.y_train)))
        print('test data row count = {}'.format(len(self.y_test)))

        X_train_data_shape = self.X_train.shape
        print('X_train_data_shape = {}'.format(X_train_data_shape))

    def get_model(
        self,
    ):
        self.model = Sequential()
        self.model.add(Dense(self.dense_neurons[0], input_shape=self.dense_input_shape, activation='relu', name='dense_input'))
        for i, n in enumerate(self.dense_neurons[1:]):
            self.model.add(Dense(n, activation='relu', name='dense_hidden_' + str(i)))
        self.model.add(Dense(1, activation='linear', name='dense_output'))
        self.model.compile(optimizer='adam', loss='mse', metrics=['mean_absolute_error'])
        self.model_summary()
        return self.model

    def model_summary(
        self,
    ):
        self.model.summary()

    def train(
        self, 
        model,
        plot=False,
    ):
        history = model.fit(self.X_train, self.y_train, **self.fit_params)

        if plot:

            fig = go.Figure()
            fig.add_trace(go.Scattergl(y=history.history['loss'], name='Train'))
            fig.add_trace(go.Scattergl(y=history.history['val_loss'], name='Valid'))
            fig.update_layout(height=500, width=700, xaxis_title='Epoch', yaxis_title='Loss')
            fig.show()

            fig = go.Figure()
            fig.add_trace(go.Scattergl(y=history.history['mean_absolute_error'], name='Train'))
            fig.add_trace(go.Scattergl(y=history.history['val_mean_absolute_error'], name='Valid'))
            fig.update_layout(height=500, width=700, xaxis_title='Epoch', yaxis_title='Mean Absolute Error')
            fig.show() 

        return history

    def evaluate(
        self, 
        model,
    ):
        mse_nn, mae_nn = model.evaluate(self.X_test, self.y_test)
        print('Mean squared error on test data: ', mse_nn)
        print('Mean absolute error on test data: ', mae_nn)
        return mse_nn, mae_nn

    def predict(
        self,
        model,
        x0,
        x1,
        fx=None,
    ):
        x = np.array([[x0, x1]]).reshape((-1, 2))
        predictions = model.predict(x)
        expected = ''
        if fx is not None:
            expected = ', expected = {}'.format(fx(x0, x1))
        print('for x = {}, predictions = {}{}'.format(x, predictions, expected))
        return predictions

    def save_model(
        self,
        model,
    ):
        model.save(self.model_name)

    def load_saved_model(
        self,
    ):
        self.model = load_model(self.model_name)
        return self.model

# create & save or used saved
dlm = DLM()
if use_saved_model:
    model = dlm.load_saved_model()    
else:
    dlm.data_split_train_test(X, y)
    model = dlm.get_model()
    dlm.train(model, plot=True)
    dlm.evaluate(model)
    dlm.save_model(model)

# predict
dlm.predict(model, 4, 17, fx=fx)
dlm.predict(model, 23, 79, fx=fx)
dlm.predict(model, 40, 33, fx=fx)

The script gives the following output, excluding the input data:

training data row count = 37
test data row count = 17
X_train_data_shape = (37, 2)
2022-01-28 16:00:30.598860: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-01-28 16:00:30.598886: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-01-28 16:00:30.598911: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (myra): /proc/driver/nvidia/version does not exist
2022-01-28 16:00:30.599110: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_input (Dense)         (None, 100)               300       
                                                                 
 dense_hidden_0 (Dense)      (None, 50)                5050      
                                                                 
 dense_hidden_1 (Dense)      (None, 25)                1275      
                                                                 
 dense_output (Dense)        (None, 1)                 26        
                                                                 
=================================================================
Total params: 6,651
Trainable params: 6,651
Non-trainable params: 0
_________________________________________________________________
1/1 [==============================] - 0s 25ms/step - loss: 0.1018 - mean_absolute_error: 0.2752
Mean squared error on test data:  0.10178931057453156
Mean absolute error on test data:  0.27519676089286804
2022-01-28 16:00:33.549267: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
for x = [[ 4 17]], predictions = [[38.0433]], expected = 38
for x = [[23 79]], predictions = [[177.94098]], expected = 181
for x = [[40 33]], predictions = [[103.54724]], expected = 106

Improving Deep Learning performance

While writing this post I changed the number of neurons, dense layers, the validation_split parameter. It all resulted in some changes, sometimes good, sometimes bad. The biggest improvement without doubt is to add more training data, but how much is enough?

Summary

The most important thing I learned is that Deep Learning requires a large dataset, the bigger the better. Do I like the black box? Yes, and so far I did not really look inside. I can now use this as a start for some real world projects. There are also things like normalizing the inputs, adding weights to the inputs. Much more to read ...

Links / credits

How to find the value for Keras input_shape/input_dim?
https://www.machinecurve.com/index.php/2020/04/05/how-to-find-the-value-for-keras-input_shape-input_dim

Keras 101: A simple (and interpretable) Neural Network model for House Pricing regression
https://towardsdatascience.com/keras-101-a-simple-and-interpretable-neural-network-model-for-house-pricing-regression-31b1a77f05ae

Keras examples
https://keras.io/examples

Linear Regression in Python
https://realpython.com/linear-regression-in-python

Predictive Analysis in Python
https://medium.com/my-data-camp-journey/predictive-analysis-in-python-97ca5b64e97f

Regression Tutorial with the Keras Deep Learning Library in Python
https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python

Leave a comment

Comment anonymously or log in to comment.

Comments

Leave a reply

Reply anonymously or log in to reply.