# Predicting values using Deep Learning and Keras

Using Keras we can create a Deep Learning black box that can predict future values

I have a dataset, many rows with N inputs and 1 output, and want to predict the output value for any new combination of input values. I am also a data science noob, but stories on the internet about Deep Learning suggest we can easily create some kind of black box with some neurons, nodes, in it, and then use the dataset to train the black box. After this we can feed any inputs and our black box gives us output values with some error margin. Sounds too good to be true? Is it really this easy?

## Start with KISS (Keep It Simple Stupid)

When you search the internet for 'python data prediction' you get a lot of hits of course. There are many examples I like but most are far too complex for a noob. Predict stock prices using time series, too complex. Predict house prices, much better. There also is a Wine Quality dataset. I like wine. can be good, maybe another time.

In this post I try to solve a trivial problem. I replace the function:

`y = x0 + 2*x1`

by:

- a Linear Regression model, and,
- a Deep Learning, in fact a neural network, a black box

A huge advantage is that I can generate the dataset myself. This also means I can compare predictions against expected values.

## Machine Learning: Regression vs classification

To choose the right algorithm, it is important to first understand whether the Machine Learning task is a regression or a classification problem. Regression and classification are both types of Supervised Machine Learning algorithms. Supervised Machine Learning uses the concept of utilizing datasets with known output values to make predictions.

Regression is an algorithm that can be trained with a dataset to predict outputs that are numeric values, numbers. Classification is an algorithm that can be trained with a dataset to predict outputs that are labels, categories, usually 0's and 1's.

### Regression example

We have a dataset consisting of rows with 2 inputs and a 1 output, the output value is a number that can have 'any' value. Consider a housing dataset where the inputs are the number of persons that can live in the house, the number of rooms, and the output is the price.

```
persons | rooms | price
---------+-------+--------
5 | 4 | 20.000
3 | 2 | 24.000
```

If we have a large enough dateset we can use a regression algorithm to predict the price for any combination of persons and rooms. The output will be 22.500, 18.100 etc.

## Classification example

We have a dataset consisting of rows with 3 inputs and a 1 output, the output value is a number that is 0 or 1. Consider a housing dataset where the inputs are the number of persons that can live in the house, the number of rooms, the price, and the output is 0 or 1 depending on whether the house is favored, liked, by people searching for a house on a website.

```
persons | rooms | price | liked
---------+-------+---------+--------
5 | 4 | 20.000 | 0
3 | 2 | 24.000 | 1
```

If we have a large enough dataset we can use a classification algorithm to predict if a house is liked for any combination of persons, rooms and price. The output will be 0 (not liked), or 1 (liked).

## Underfitting and overfitting

Our dataset is split into a training dataset and a test dataset. Using these we can determine how good or bad our model or algorithm performs.

Underfitting means that the model has not been able to determine the relevant relations between the data. Probably the model is too simple. An example is trying to represent nonlinear relations with a linear model.

Underfitting:

- Works poor with training data
- Works poor with test data

Overfitting means that the model also determined relations between data and random fluctuations. Probably the model the is too complex, it learns too well.

Overfitting:

- Works well with training data
- Works poor with test data

Both cases can also occur when there is not enough data in our dataset.

## Linear regression

I suggest you read through the examples in the great tutorial 'Linear Regression in Python', see links below. Below I show my version with two inputs, training and test data and predicting a value. Because we are modeling with Linear Regression, it is not surprising that the coefficient of determination is 1.0, and that we only need a very small amount of training data.

```
# Linear regression: y = x0 + 2*x1
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# generate dataset
X_items = []
y_items = []
for x0 in range(0, 3):
for x1 in range(3, 5):
y = x0 + 2*x1
X_items.append([x0, x1])
y_items.append(y)
X = np.array(X_items).reshape((-1, 2))
y = np.array(y_items)
print('X = {}'.format(X))
print('y = {}'.format(y))
# split dataset in training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print('X_train = {}'.format(X_train))
print('y_train = {}'.format(y_train))
print('X_test = {}'.format(X_test))
print('y_test = {}'.format(y_test))
# create model
model = LinearRegression()
# mess up a training value
#X_train[0][0] += 2
# calculate optimal values of weights b0 and b1
model.fit(X_train, y_train)
# show results, mess up test data
print('model result:')
print('- intercept (b0) = {}'.format(model.intercept_))
print('- slope (b1) = {}'.format(model.coef_))
print('- coefficient of determination for training data = {}'.
format(model.score(X_train, y_train)))
print('- coefficient of determination for test data = {}'.
format(model.score(X_test, y_test)))
x = np.array([8, 9]).reshape((-1, 2))
y_pred = model.predict(x)
print('predicted response for x = {}: {}'.format(x, y_pred))
```

The script gives the following output:

```
X = [[0 3]
[0 4]
[1 3]
[1 4]
[2 3]
[2 4]]
y = [ 6 8 7 9 8 10]
X_train = [[2 3]
[0 3]
[1 4]
[2 4]]
y_train = [ 8 6 9 10]
X_test = [[1 3]
[0 4]]
y_test = [7 8]
model result:
- intercept (b0) = -3.552713678800501e-15
- slope (b1) = [1. 2.]
- coefficient of determination for training data = 1.0
- coefficient of determination for test data = 1.0
predicted response for x = [[8 9]]: [26.]
```

To create a a little of trouble I changed a training value, see above. This gives the result:

```
model result:
- intercept (b0) = -2.3529411764705817
- slope (b1) = [0.52941176 2.76470588]
- coefficient of determination for training data = 0.9865546218487395
- coefficient of determination for test data = -0.5570934256055304
predicted response for x = [[8 9]]: [26.76470588]
```

By adding a more training data we get a better fit.

## Deep Learning with Keras

This should be more like the black box approach. I just add some neurons and layers and that should be it. But is it really? Here I use Keras because it appears to be very popular. The following example was very helpful: 'Keras 101: A simple (and interpretable) Neural Network model for House Pricing regression', see links below. The plotting of both loss and mean average error made much sense.

We will do the following steps:

- Load the data
- Define the model
- Compile the model
- Train (fit) the model
- Evaluate the model
- Make predictions

**Load the data **Here we generate the data ourselves, see also above.

**Define the model **The first Dense layer needs the input_shape parameter set. I started with 100 neurons in the first layer, 50 in the second layer, 25 in the third. Why? I have no idea, did not look into this yet.

**Compile the model **Not much to say about this.

**Train (fit) the model **We are using the validation_split parameter, without it we cannot plot. When validation_split is specified part of ther training data is used for validation. The data used for validation during fit can change, it is probably better to use fixed data otherwise we get a different fit every time.

**Evaluate the model **We use the test data here.

**Predictions **Simply use model.predict().

Of course we need more training data here. But also not really that much more. I also added code to save and load the model. Below is the code I created.

```
# Keras deep learning: y = x0 + 2*x1
from keras.models import Sequential, load_model
from keras.layers import Dense
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
use_saved_model = False
#use_saved_model = True
# create dataset
def fx(x0, x1):
return x0 + 2*x1
X_items = []
y_items = []
for x0 in range(0, 18, 3):
for x1 in range(2, 27, 3):
y = fx(x0, x1)
X_items.append([x0, x1])
y_items.append(y)
X = np.array(X_items).reshape((-1, 2))
y = np.array(y_items)
print('X = {}'.format(X))
print('y = {}'.format(y))
X_data_shape = X.shape
print('X_data_shape = {}'.format(X_data_shape))
class DLM:
def __init__(
self,
model_name='default_model',
):
self.model_name = model_name
self.dense_input_shape=(2, )
self.dense_neurons = [100, 50, 25]
self.fit_params = {
'epochs': 100,
'validation_split': 0.1,
'verbose': 0,
}
def data_split_train_test(
self,
X,
y,
):
self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print('self.X_train = {}'.format(self.X_train))
print('self.X_test = {}'.format(self.X_test))
print('self.y_train = {}'.format(self.y_train))
print('self.y_test = {}'.format(self.y_test))
print('training data row count = {}'.format(len(self.y_train)))
print('test data row count = {}'.format(len(self.y_test)))
X_train_data_shape = self.X_train.shape
print('X_train_data_shape = {}'.format(X_train_data_shape))
def get_model(
self,
):
self.model = Sequential()
self.model.add(Dense(self.dense_neurons[0], input_shape=self.dense_input_shape, activation='relu', name='dense_input'))
for i, n in enumerate(self.dense_neurons[1:]):
self.model.add(Dense(n, activation='relu', name='dense_hidden_' + str(i)))
self.model.add(Dense(1, activation='linear', name='dense_output'))
self.model.compile(optimizer='adam', loss='mse', metrics=['mean_absolute_error'])
self.model_summary()
return self.model
def model_summary(
self,
):
self.model.summary()
def train(
self,
model,
plot=False,
):
history = model.fit(self.X_train, self.y_train, **self.fit_params)
if plot:
fig = go.Figure()
fig.add_trace(go.Scattergl(y=history.history['loss'], name='Train'))
fig.add_trace(go.Scattergl(y=history.history['val_loss'], name='Valid'))
fig.update_layout(height=500, width=700, xaxis_title='Epoch', yaxis_title='Loss')
fig.show()
fig = go.Figure()
fig.add_trace(go.Scattergl(y=history.history['mean_absolute_error'], name='Train'))
fig.add_trace(go.Scattergl(y=history.history['val_mean_absolute_error'], name='Valid'))
fig.update_layout(height=500, width=700, xaxis_title='Epoch', yaxis_title='Mean Absolute Error')
fig.show()
return history
def evaluate(
self,
model,
):
mse_nn, mae_nn = model.evaluate(self.X_test, self.y_test)
print('Mean squared error on test data: ', mse_nn)
print('Mean absolute error on test data: ', mae_nn)
return mse_nn, mae_nn
def predict(
self,
model,
x0,
x1,
fx=None,
):
x = np.array([[x0, x1]]).reshape((-1, 2))
predictions = model.predict(x)
expected = ''
if fx is not None:
expected = ', expected = {}'.format(fx(x0, x1))
print('for x = {}, predictions = {}{}'.format(x, predictions, expected))
return predictions
def save_model(
self,
model,
):
model.save(self.model_name)
def load_saved_model(
self,
):
self.model = load_model(self.model_name)
return self.model
# create & save or used saved
dlm = DLM()
if use_saved_model:
model = dlm.load_saved_model()
else:
dlm.data_split_train_test(X, y)
model = dlm.get_model()
dlm.train(model, plot=True)
dlm.evaluate(model)
dlm.save_model(model)
# predict
dlm.predict(model, 4, 17, fx=fx)
dlm.predict(model, 23, 79, fx=fx)
dlm.predict(model, 40, 33, fx=fx)
```

The script gives the following output, excluding the input data:

```
training data row count = 37
test data row count = 17
X_train_data_shape = (37, 2)
2022-01-28 16:00:30.598860: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-01-28 16:00:30.598886: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-01-28 16:00:30.598911: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (myra): /proc/driver/nvidia/version does not exist
2022-01-28 16:00:30.599110: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_input (Dense) (None, 100) 300
dense_hidden_0 (Dense) (None, 50) 5050
dense_hidden_1 (Dense) (None, 25) 1275
dense_output (Dense) (None, 1) 26
=================================================================
Total params: 6,651
Trainable params: 6,651
Non-trainable params: 0
_________________________________________________________________
1/1 [==============================] - 0s 25ms/step - loss: 0.1018 - mean_absolute_error: 0.2752
Mean squared error on test data: 0.10178931057453156
Mean absolute error on test data: 0.27519676089286804
2022-01-28 16:00:33.549267: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
for x = [[ 4 17]], predictions = [[38.0433]], expected = 38
for x = [[23 79]], predictions = [[177.94098]], expected = 181
for x = [[40 33]], predictions = [[103.54724]], expected = 106
```

## Improving Deep Learning performance

While writing this post I changed the number of neurons, dense layers, the validation_split parameter. It all resulted in some changes, sometimes good, sometimes bad. The biggest improvement without doubt is to add more training data, but how much is enough?

## Summary

The most important thing I learned is that Deep Learning requires a large dataset, the bigger the better. Do I like the black box? Yes, and so far I did not really look inside. I can now use this as a start for some real world projects. There are also things like normalizing the inputs, adding weights to the inputs. Much more to read ...

## Links / credits

How to find the value for Keras input_shape/input_dim?

https://www.machinecurve.com/index.php/2020/04/05/how-to-find-the-value-for-keras-input_shape-input_dim

Keras 101: A simple (and interpretable) Neural Network model for House Pricing regression

https://towardsdatascience.com/keras-101-a-simple-and-interpretable-neural-network-model-for-house-pricing-regression-31b1a77f05ae

Keras examples

https://keras.io/examples

Linear Regression in Python

https://realpython.com/linear-regression-in-python

Predictive Analysis in Python

https://medium.com/my-data-camp-journey/predictive-analysis-in-python-97ca5b64e97f

Regression Tutorial with the Keras Deep Learning Library in Python

https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python

## Read more

##### Deep Learning Machine Learning

### Recent

- LogLineFollower: Follow lines of a growing log file
- Connect to a service on a Docker host from a Docker container
- AIOHTTP: Detecting DNS timeout with custom nameservers
- Flask Message Flashing: Replace Bootstrap Alerts by Toasts
- SQLAlchemy: Using Cascade Deletes to delete related objects
- SQLAlchemy PostgreSQL: Add a second BigInteger Primary Key

### Most viewed

- Flask SQLAlchemy CRUD application with WTForms QuerySelectField and QuerySelectMultipleField
- Using UUIDs instead of Integer Autoincrement Primary Keys with SQLAlchemy and MariaDb
- Using Python's pyOpenSSL to verify SSL certificates downloaded from a host
- Flask RESTful API request parameter validation with Marshmallow schemas
- SLQAlchemy dynamic query building and filtering including soft deletes
- Documenting a Flask RESTful API with OpenAPI (Swagger) using APISpec