Un intento de resolver el Tic-Tac-Toe usando Keras y LSTM

Deep Learning no es la forma de resolver el tres en raya, pero sin duda es una bonita experiencia educativa.

2 marzo 2022

Después de implementar mi primer modelo Deep Learning LSTM para un proyecto pensé si Deep Learning también podría resolver un juego. El primer juego que me viene a la mente es el Tic-Tac-Toe. Entonces buscas en internet y parece que hay mucha gente que ha tenido la misma idea. Por supuesto.

A continuación presento mi solución para resolver el Tic-Tac-Toe usando Keras y LSTM (Long Short Term Memory). Se trata de Deep Learning y no de una solución de Aprendizaje por Refuerzo, que es un tema totalmente diferente.

La solución no utiliza un único modelo, sino un modelo para cada movimiento. ¿Por qué? Porque quería evitar que los datos de entrenamiento de una primera jugada se "contaminaran" con los datos de entrenamiento de futuras jugadas.

Sobre el juego Tic-Tac-Toe

Hay 255168 formas de jugar a este juego. De estas partidas, 131184 son ganadas por el primer jugador, 77904 van a parar al oponente y 46080 terminan en empate.

Un tablero consta de campos con contenido vacío, X y O. El número total de tableros posibles es 3^9 = 19.683.

Lo que sabemos del juego Tic-Tac-Toe

Nuestra caja negra Deep Learning sabe lo siguiente sobre el juego:

Hay dos jugadores
Se juega en 9 campos, llamémosle tablero
Los jugadores se turnan para hacer un movimiento
Hacer un movimiento significa asignar un campo no utilizado a un jugador
Esto significa que sabemos qué campos están disponibles cuando hacemos un movimiento
Después de algunos movimientos alguien nos dice que la partida ha terminado y quién ha ganado, perdido o si ha habido un empate

Eso es todo lo que sabemos, no mucho. ¿Podemos usar Deep Learning para jugar partidas ganadas?

Algunas preguntas y respuestas

¿Cómo representamos el tablero?
¿Cómo representamos una jugada?
¿Debemos usar regresión o clasificación y cómo usarla?
¿Debemos usar MLP (Precepción Multicapa) o LSTM (Memoria a Largo Plazo) o algo más?
¿Debería cada movimiento tener su propio modelo?
¿Deben tener ambos jugadores su propio modelo?

¿Cómo se representa el tablero?

El tablero es un vector de nueve números, uno para cada campo. El número es 0 cuando puede ser utilizado por el jugador1 o el jugador2. Es 1 cuando lo toma el jugador1 y 2 cuando lo toma el jugador1.

¿Cómo se representa la jugada?

Esto no es realmente importante, vamos a utilizar una lista de una posición de fila y columna.

¿Regresión o clasificación y cómo usarla?

Con la regresión podemos intentar predecir un valor para la siguiente jugada.

Con la clasificación podemos tratar de predecir un vector de mejores movimientos.

Aquí elijo la regresión. Tenemos dos jugadores. Los valores para el jugador1:

2: ganó
1: empate
0: perdido

La predicción será un valor entre 2 y 0. Si es nuestro turno, predecimos el valor para todas las jugadas disponibles. Entonces seleccionamos el mejor valor disponible, que es el máximo. Esto da nuestra jugada. Para el jugador 2, el mejor valor disponible es el mínimo.

MLP (Precepción Multicapa) o LSTM (Memoria de Largo Plazo) o algo más?

No sé, probemos con LSTM.

¿Cada movimiento tiene su propio modelo?

Creo que esto es importante. Si sólo tenemos un modelo, entonces los datos después del paso N también se incluyen. Eso parece totalmente erróneo.

¿Deben tener ambos jugadores su propio modelo?

Yo creo que sí. El jugador 1 está siempre un movimiento por delante, el jugador 2 está siempre un movimiento por detrás.

Datos de entrenamiento

Estamos utilizando la regresión, ver arriba, para generar un valor para cada movimiento posible en un momento determinado. El jugador1 selecciona la jugada con el valor máximo, el oppenente, el jugador2 selecciona la jugada con el valor mínimo.

El tablero es un vector de nueve campos. Los campos pueden estar vacíos, X para el jugador1 y O para el jugador2. Una partida se compone de varios vectores. Asignamos el resultado del juego a todos los vectores de una partida.

Por ejemplo, los datos de un juego tienen el aspecto siguiente

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0] -> 2 # player1 makes a move
[1, 0, 0, 0, 0, 0, 0, 2, 0, 0] -> 2 # player2
[1, 1, 0, 0, 0, 0, 0, 2, 0, 0] -> 2 # player1
[1, 1, 0, 0, 0, 0, 0, 2, 0, 2] -> 2 # player2
[1, 1, 1, 0, 0, 0, 0, 2, 0, 2] -> 2 # player1 won

Para obtener información sobre LSTM , consulte "Cómo desarrollar modelos LSTM para la previsión de series temporales", véanse los enlaces siguientes. El modelo es un multivariate LSTM utilizando n_steps_in=2 y n_steps_out=1.

Para generar los datos de entrenamiento jugamos el juego muchas veces.

Datos de juegos únicos

Para entrenar nuestro modelo generamos datos jugando un número de partidas. Lo hacemos utilizando jugadas aleatorias. Esto significa que podemos tener muchas partidas duplicadas en nuestro conjunto de datos, en otras palabras, el conjunto de datos está contaminado. Por el momento me aseguro de que no hay partidas duplicadas, utilizando una firma del tablero y el resultado final.

Un modelo para cada jugada

Si utilizamos un único modelo, los datos de entrenamiento se "contaminan" con los datos futuros. Cuando es nuestra jugada, queremos saber cuál es la siguiente mejor jugada. Todo lo que se hace después de ese paso contamina nuestros datos existentes. Por eso utilizo varios modelos, uno para cada paso.

move[n]    model[n] with data upto step[n+1]
move[n+1]  model[n+1] with data upto step[n+2]
etc.

Para el jugador 1 esto significa que en la jugada[n] utilizamos el modelo[n] que ha sido entrenado con datos hasta la jugada n+1. De esta manera podemos seleccionar el mejor movimiento disponible para hacer.

El parámetro n_steps_in=2 significa que no tenemos modelo para la primera jugada. Lo que hacemos es que el primer movimiento sea aleatorio, para ambos jugadores.

Tampoco utilizamos un modelo si sólo queda una jugada.

Rendimiento

Bien, he implementado esto. ¿Cómo funciona? El resultado del entrenamiento muestra para todos los modelos que el error medio_absoluto se reduce de alrededor de 0,8 a 0,6. No es muy bueno.

En los resultados de abajo tengo los dos jugadores. Un jugador puede ser

RD: hace movimientos aleatorios
NN: hace movimientos Neural Network

El primer jugador es el jugador1, el segundo es el jugador2. Los datos de entrenamiento se generan como se ha descrito anteriormente. Se generan hasta que tengamos, por ejemplo, 1000 x won-draw-lost = 1000xwon, 1000xlost, y 1000xdraw.

1. RD vs RD

Ambos jugadores hacen jugadas al azar.

Resultado de 4 ejecuciones de 1000 partidas:

+-------+-------+-------+----------+-----------+-----------+
| won   | lost  | draw  | won/lost | avg_moves | perc draw |
+-------+-------+-------+----------+-----------+-----------+
|   578 |   293 |   129 |     1.97 |       7.7 |     12.9% |
|   593 |   274 |   133 |     2.16 |       7.7 |     13.3% |
|   590 |   276 |   134 |     2.14 |       7.6 |     13.4% |
|   589 |   296 |   115 |     1.99 |       7.6 |     11.5% |

Esto no parece descabellado. El jugador1 comienza la partida y por tanto tiene ventaja sobre el jugador2.

2. NN vs RD, tren con 1000 x ganados-empate-perdidos

El jugador1 hace movimientos Neural Network , el jugador2 hace movimientos al azar.

Resultados de 4 ejecuciones de 1000 juegos:

+-------+-------+-------+----------+-----------+-----------+
| won   | lost  | draw  | won/lost | avg_moves | perc draw |
+-------+-------+-------+----------+-----------+-----------+
|   706 |   192 |   102 |     3.68 |       7.4 |     10.2% |
|   690 |   203 |   107 |     3.40 |       7.3 |     10.7% |
|   721 |   172 |   107 |     4.19 |       7.4 |     10.7% |
|   726 |   168 |   106 |     4.32 |       7.4 |     10.6% |

Mejor que las jugadas al azar, pero lejos de la perfección.

3. NN vs RD, entrenar con 2000 x ganadas-empate-perdidas

El jugador1 hace movimientos Neural Network , el jugador2 hace movimientos aleatorios.

Resultados de 4 ejecuciones de 1000 juegos:

+-------+-------+-------+----------+-----------+-----------+
| won   | lost  | draw  | won/lost | avg_moves | perc draw |
+-------+-------+-------+----------+-----------+-----------+
|   706 |   192 |   102 |     3.68 |       7.4 |     10.2% |
|   690 |   203 |   107 |     3.40 |       7.3 |     10.7% |
|   721 |   172 |   107 |     4.19 |       7.4 |     10.7% |
|   726 |   168 |   106 |     4.32 |       7.4 |     10.6% |

De nuevo mejor, pero de nuevo lejos de la perfección.

4. NN vs RD, tren con 5000 x ganadas-empatadas-perdidas

El jugador1 hace movimientos Neural Network , el jugador2 hace movimientos aleatorios.

Resultados de 4 ejecuciones de 1000 juegos:

+-------+-------+-------+----------+-----------+-----------+
| won   | lost  | draw  | won/lost | avg_moves | perc draw |
+-------+-------+-------+----------+-----------+-----------+
|   901 |    49 |    50 |    18.39 |       7.0 |      5.0% |
|   888 |    58 |    54 |    15.31 |       7.0 |      5.4% |
|   889 |    56 |    55 |    15.88 |       7.0 |      5.5% |
|   898 |    48 |    54 |    18.71 |       7.0 |      5.4% |

¡Esto se parece más! ¿Podemos hacerlo mejor?

5. NN vs RD, entrenar con 20000 x ganadas-empate-perdidas

El jugador1 hace movimientos Neural Network , el jugador2 hace movimientos al azar.

Resultados de 4 ejecuciones de 1000 juegos:

+-------+-------+-------+----------+-----------+-----------+
| won   | lost  | draw  | won/lost | avg_moves | perc draw |
+-------+-------+-------+----------+-----------+-----------+
|   955 |    21 |    24 |    45.48 |       6.7 |      2.4% |
|   959 |    15 |    26 |    63.93 |       6.7 |      2.6% |
|   962 |    13 |    25 |    74.00 |       6.7 |      2.5% |
|   969 |    12 |    19 |    80.75 |       6.7 |      1.9% |

Enorme mejora, ¡más datos (de entrenamiento) son importantes! ¿Las victorias del jugador2 son pura suerte?

6. RD vs NN, entrenar con 20000 x ganadas-empate-perdidas

Ahora cambiamos el orden. El jugador1 hace movimientos aleatorios y el jugador2 hace movimientos Neural Network .

Resultados de 4 ejecuciones de 1000 partidas:

+-------+-------+-------+----------+-----------+-----------+
| won   | lost  | draw  | won/lost | avg_moves | perc draw |
+-------+-------+-------+----------+-----------+-----------+
|   145 |   739 |   116 |     0.20 |       6.8 |     11.6% |
|   127 |   761 |   112 |     0.17 |       6.8 |     11.2% |
|   142 |   735 |   123 |     0.19 |       6.8 |     12.3% |
|   142 |   750 |   108 |     0.19 |       6.8 |     10.8% |

Esto parece razonable. Los resultados del jugador2 no son tan buenos como los del jugador1. También el número de empates aumentó. ¿Tiene esto que ver con el hecho de que el jugador1 comienza primero?

7. NN vs NN, entrenar con 20000 x ganado-empate-perdido

¡Ambos jugadores hacen movimientos Neural Network !

Resultados de las ejecuciones de 1000 partidas:

+-------+-------+-------+----------+-----------+-----------+
| won   | lost  | draw  | won/lost | avg_moves | perc draw |
+-------+-------+-------+----------+-----------+-----------+
|   684 |   104 |   212 |     6.58 |       7.2 |     21.2% |
|   672 |   102 |   226 |     6.59 |       7.2 |     22.6% |
|   675 |   112 |   213 |     6.03 |       7.2 |     21.3% |

Esperamos que las partidas ganadas y perdidas estén cerca, pero no es el caso. También esperamos ver muchas más tablas. Eso parece bien.

Sólo podemos concluir: Nuestra caja negra no es lo suficientemente inteligente ... :-(

El código

A continuación se muestra el código para que pueda probar usted mismo. Keras Tuner se utiliza para optimizar el hyperparameters.

Variables importantes:

tuner_dir
Este es el directorio creado por Keras Tuner.
must_train
Establezca True si desea entrenar los modelos.
Establezca False si desea utilizar los modelos entrenados.
won_draw_lost_count
Esto es sobre los datos de entrenamiento, ver arriba.

Normalmente se quiere entrenar una vez y luego hacer varias ejecuciones. Si cambia won_draw_lost_count, asegúrese de borrar primero el directorio tuner_dir. Incrementar won_draw_lost_count incrementará el tiempo de entrenamiento.

# tictactoe.py

# std
import copy
import os
import random
import sys
import time
import operator

# optimizing hyperparameters with keras tuner
from keras.callbacks import EarlyStopping
from keras.models import load_model, Sequential
from keras.layers import Dense, Dropout, LSTM
import keras_tuner as kt
import numpy as np
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from tensorflow.keras.optimizers import Adam

# get reproducible results
from numpy.random import seed
seed(1)
import tensorflow
tensorflow.random.set_seed(2)


# constants: game results
# note(s):
# - the order for player1, higher is better
# - result is float
GAME_RESULT_PLAYER1_WON = 2.
GAME_RESULT_DRAW = 1.
GAME_RESULT_PLAYER2_WON = 0.

# tuner parameters
class TunerParameter:
    def __init__(self, name, val):
        self.name = name
        self.val = val
        self.args = (name, val)

class TunerParameters:
    def __init__(self, pars=None):
        self.t_pars = []
        for p in pars:
            tpar = TunerParameter(p[0], p[1])
            self.t_pars.append(tpar)
            setattr(self, p[0], tpar)
    def get_pars(self):
        return self.t_pars

# subclassed hyperband tuner to add hyperparameters (here, batch_size)
class HyperbandTuner(kt.tuners.Hyperband):
    def __init__(self, *args, **kwargs):
        self.t_pars = None
        if 't_pars' in kwargs:
            self.t_pars = kwargs.pop('t_pars')
        super(HyperbandTuner, self).__init__(*args, **kwargs)

    def run_trial(self, trial, *args, **kwargs):
        if self.t_pars is not None:
            for tpar in self.t_pars:
                kwargs[tpar.name] = trial.hyperparameters.Choice(tpar.name, tpar.val)
        return super(HyperbandTuner, self).run_trial(trial, *args, **kwargs)

# tuner class
class Tuner:

    def __init__(
        self,
        input_shape=None,
        t_pars=None,
        t_dir='tuner_dir',
    ):
        self.input_shape = input_shape
        self.t_dir = t_dir
        self.t_pars = t_pars

    def get_model(self, hp):
        # model params: layer_input 
        model_params_layer_input = dict(
            units=hp.Choice(*self.t_pars.layer_input_units.args),
            activation='relu',
            input_shape=self.input_shape,
        )

        '''
        # model params: layer_hidden_1
        model_params_layer_hidden_1 = dict(
            units=hp.Choice(*self.t_pars.layer_hidden_1_units.args),
            activation='relu',
            input_shape=self.input_shape,
        )
        '''

        # model params: compile
        model_params_compile = dict(
            #optimizer=Adam(learning_rate=hp.Choice(*self.t_pars.learning_rate.args)),
            optimizer='adam', 
            loss='mean_absolute_error',
            metrics=['mean_absolute_error'],
            #metrics=['mean_absolute_error', 'val_mean_absolute_error'],
            #metrics=['mean_absolute_error', 'val_mae'],
        )

        # model
        model = Sequential()
        model.add(LSTM(**model_params_layer_input))
        model.add(Dense(
            units=1,
        ))
        # compile
        model.compile(**model_params_compile)

        return model

    def get_model_path(
        self,
        model_num,
    ):
        return os.path.join(self.t_dir, 'best_model_' + str(model_num))

    def tune(
        self,
        X,
        y,
        model=None,
        model_num=0,
        use_early_stopping=False,
        search_split_size=0.33,
        fit_verbose=0,
        plot_loss=False,
    ):
        if model is None:
            model = self.get_model

        # use subclassed HyperBand tuner
        tuner = HyperbandTuner(
            model,
            objective=kt.Objective('val_mean_absolute_error', direction='min'),
            #allow_new_entries=True,
            tune_new_entries=True,
            hyperband_iterations=2,
            max_epochs=260,
            directory=self.t_dir,
            project_name='model_' + str(model_num),
            t_pars=[self.t_pars.batch_size],
        )

        print('\n{}\ntuner.search_space_summary()\n{}\n'.format('-'*60, '-'*60))
        tuner.search_space_summary()

        callbacks = []
        if use_early_stopping:
            callbacks = [EarlyStopping('val_loss', patience=3)]
        print('\n{}\ntuner.search()\n{}\n'.format('-'*60, '-'*60))
        tuner.search(
            X,
            y,
            validation_split=search_split_size,
            shuffle=False,
            callbacks=callbacks
        )

        print('\n{}\ntuner.results_summary()\n{}\n'.format('-'*60, '-'*60))
        tuner.results_summary()

        print('\n{}\nbest_hps\n{}\n'.format('-'*60, '-'*60))
        best_hps = tuner.get_best_hyperparameters()[0]
        best_hps_dict = {}
        for tpar in self.t_pars.get_pars():
            try:
                best_val = best_hps[tpar.name]
                print('- {} = {}'.format(tpar.name, best_val))
                best_hps_dict[tpar.name] = best_val
            except:
                pass
        # add tuner epochs
        best_hps_dict['epochs'] = best_hps['tuner/epochs']

        # get best model
        h_model = tuner.hypermodel.build(best_hps)

        print('\n{}\nh_model.summary()\n{}\n'.format('-'*60, '-'*60))
        h_model.summary()

        # train best model
        epochs = 50
        print('training ...')    
        history = h_model.fit(
            X,
            y,
            epochs=epochs,
            workers=4,
            use_multiprocessing=True,
            verbose=fit_verbose,
        )

        if plot_loss:
            fig_title = 'Loss - model[{}]'.format(model_num)
            fig = go.Figure()
            fig.add_trace(go.Scattergl(y=history.history['mean_absolute_error'], name='Train'))
            ##fig.add_trace(go.Scattergl(y=history.history['val_mean_absolute_error'], name='Valid'))
            #fig.add_trace(go.Scattergl(y=history.history['loss'], name='Loss'))
            fig.update_layout(height=500, width=1200, title=fig_title, xaxis_title='Epoch', yaxis_title='Mean Absolute Error')
            fig.show()

        # save best model
        h_model.save(self.get_model_path(model_num))

        print('\n{}\nh_model.evaluate()\n{}\n'.format('-'*60, '-'*60))
        h_eval_dict = h_model.evaluate(X, y, return_dict=True)
        print('h_eval_dict = {}'.format(h_eval_dict))

        return dict(
            h_model=h_model,
            h_eval_dict=h_eval_dict,
            best_hps_dict=best_hps_dict,
        )

    def load_model(
        self,
        model_num,
    ):
        try:
            return load_model(self.get_model_path(model_num))
        except:
            return None

class DataUtils:

    @classmethod
    def get_model_boards_and_results(cls, game_results):
        # generate datasets
        model_boards_and_results = {}
        for m in range(9):
            model_boards_and_results[m] = []
        for game_result in game_results:
            result = game_result.result
            for i, board in enumerate(game_result.boards):
                if i == 0:
                    # skip board0
                    continue
                # model_boards_data
                if i < 4:
                    # store for all models
                    for m in range(2, 9):
                        model_boards_and_results[m].append((board, result))
                else:
                    # store for models[i - 1]
                    for m in range(i - 1, 9):
                       model_boards_and_results[m].append((board, result))
        return model_boards_and_results

    @classmethod
    def get_model_data(cls, model_boards_and_results):
        model_data = {}
        for model_num in model_boards_and_results:
            boards_and_results = model_boards_and_results[model_num]
            boards = []
            results = []
            for board, result in boards_and_results:
                boards.append(board)
                results.append(result)
            boards_len = len(boards)
            results_len = len(results)
            print('model[{}]: boards_len = {}, results_len = {}'.format(model_num, boards_len, results_len))
            print('model[{}]:'.format(model_num))
            #print('- boards = {}'.format(boards))
            #print('- results = {}'.format(results))

            if boards_len == 0:
                continue

            # create data for model
            # player1 models: 2, 4, 6, 8
            # player2 models: 3, 5, 7
            # but we don't about this here
            # 2 = player1 won
            # 1 = draw
            # 0 = player2 won
            # this means:
            # - player1 looks for the maximum
            # - player2 looks for the minimum
            dataset = []
            for i, board in enumerate(boards):
                result = results[i]
                flattened_board_and_result = []
                for row in board:
                    print('row = {}'.format(row))
                    flattened_board_and_result.extend(row)
                flattened_board_and_result.append(result)
                dataset.append(flattened_board_and_result)
            dataset = np.array(dataset, dtype=object)
            print('dataset = {}'.format(dataset))
            X_in, y_in = cls.split_sequences(dataset, n_steps_in, n_steps_out)
            print('X_in = {}'.format(X_in))
            print('y_in = {}'.format(y_in))

            X_in_shape = X_in.shape
            print('X_in_shape = {}'.format(X_in_shape))

            model_data[model_num] = dict(
                X_in=X_in,
                y_in=y_in,
            )

        return model_data

    # split a multivariate sequence into samples
    @classmethod
    def split_sequences(cls, sequences, n_steps_in, n_steps_out):
        X, y = list(), list()
        for i in range(len(sequences)):
            # find the end of this pattern
            end_ix = i + n_steps_in
            out_end_ix = end_ix + n_steps_out-1
            # check if we are beyond the dataset
            if out_end_ix > len(sequences):
                break
            # gather input and output parts of the pattern
            seq_x, seq_y = sequences[i:end_ix, :-1], sequences[end_ix-1:out_end_ix, -1]
            X.append(seq_x)
            y.append(seq_y)
        if n_steps_out == 1:
            y1 = []
            for v in y:
                y1.append(v[0])
            y = y1
        return np.array(X), np.array(y)

    @classmethod
    def split_datasets(cls, X, y):
        return train_test_split(X, y, test_size=0.33, shuffle=False, random_state=42)


class GameResult:
    def __init__(
        self,
        boards=None,
        moves=None,
        result=None,
    ):
        self.boards = boards
        self.moves = moves
        self.result = result


class GamePlay:
    
    def __init__(
        self,
        n_steps_in=None,
        n_steps_out=None,
        n_features=None,
    ):
        self.n_steps_in = n_steps_in
        self.n_steps_out = n_steps_out
        self.n_features = n_features
        self.player_value2displays = {0: ' ', 1: 'X', 2: 'O'}

    def get_empty_board(self):
        return [[0, 0, 0], [0, 0, 0], [0, 0, 0]]

    def show_boards(self, boards, show_board_num=True, horizontal=True):
        boards_strs = []
        for i, board in enumerate(boards):
            strs = self.show_board(board, board_num=i, return_strs=horizontal)
            boards_strs.append(strs)
        if horizontal:
            for i in range(len(strs)):
                line_parts = []
                for b in boards_strs:
                    line_parts.append(b[i])
                print('{}'.format('  '.join(line_parts)))

    def show_board(self, board, board_num=None, return_strs=False):
        strs = []    
        if board_num is not None:
            strs.append('board: {:<6}'.format(board_num))
        ds = '+---+---+---+'
        strs.append('{}'.format(ds))
        for r in range(3):
            row_items = []
            for c in range(3):
                row_items.append(self.player_value2displays[board[r][c]])
            dr = '| {} |'.format(' | '.join(row_items))
            strs.append('{}'.format(dr))
            strs.append('{}'.format(ds))
        if return_strs:
            return strs
        for s in strs:
            print(s)

    def get_available_moves(self, board):
        moves = []
        for r in range(3):
            for c in range(3):
                if board[r][c] == 0:
                    moves.append([r, c])
        return moves

    def player_won(self, player, board):
        # row, col win
        for r in range(3):
            if board[r][0] == player and board[r][1] == player and board[r][2] == player:
                return True
        for c in range(3):
            if board[0][c] == player and board[1][c] == player and board[2][c] == player:
                return True
        # diag win
        if board[0][0] == player and board[1][1] == player and board[2][2] == player:
                return True
        if board[2][0] == player and board[1][1] == player and board[0][2] == player:
                return True
        return False

    def get_other_player(self, player):
        return 2 if player == 1 else 1

    def get_inverted_board(self, board):
        board = copy.deepcopy(board)
        inverted_board = self.get_empty_board()
        for r in range(3):
            for c in range(3):
                if board[r][c] == 1:
                    inverted_board[r][c] = 2
                elif board[r][c] == 2:
                    inverted_board[r][c] = 1
        return inverted_board
        
    def get_deeplearning_move(self, h_model, player, board, dbg=False):
        # get moves
        available_moves = self.get_available_moves(board)
        if len(available_moves) == 1:
            return available_moves[0]
        board = copy.deepcopy(board)

        move_vs = []
        X_pred = []
        for i, move in enumerate(available_moves):
            next_board = copy.deepcopy(board)
            self.make_move(player, next_board, move)
            # create X_pred
            x_pred = [
                self.board2vector(board),
                self.board2vector(next_board),
            ]
            if dbg: print('x_pred = {}'.format(x_pred))
            X_pred.append(x_pred)
            y = [0, move[0], move[1]]
            move_vs.append(y)

        X_pred_len = len(X_pred)
        if X_pred_len > 0:
            if dbg: print('X_pred = {}'.format(X_pred))
            X_pred = np.array(X_pred).reshape(X_pred_len, self.n_steps_in, self.n_features)
            predictions = h_model.predict(X_pred)
            if dbg: print('for X_Pred = {}, predictions = {}'.format(X_pred, predictions))
            for i, y in enumerate(predictions):
                move_vs[i][0] = y[0]
            if dbg: print('move_vs = {}'.format(move_vs))
            # get maximum or minimum
            if player == 1:
                reverse = True
            else:
                reverse = False    
            sorted_move_vs = move_vs.sort(key=operator.itemgetter(0), reverse=reverse)
            sorted_move_vs = move_vs
            if dbg: print('sorted_move_vs = {}'.format(sorted_move_vs))
            #sys.exit()
            move = [sorted_move_vs[0][1], sorted_move_vs[0][2]]
            if dbg: print('player = {}, y = {} -> selected move = {}'.format(player, y, move))
            #sys.exit()
            return move    

        # random move
        if dbg: print('player = {}, dont know, move = {}'.format(player, move))
        move = random.choice(available_moves)
        return move

    def get_random_move(self, board):
        moves = self.get_available_moves(board)
        return random.choice(moves)

    def make_move(self, player, board, move):
        board[move[0]][move[1]] = player
        return board

    def play_game(self, player1_type=0, player2_type=0, h_models=[], show_game=False):
        # player_type: 0 = random, 1 = dl
        # first board is empty
        board = self.get_empty_board()
        boards = [board]
        moves = []
        player = 1
        i = 0
        while True:
            if player == 1:
                if player1_type == 0 or len(moves) < 2:
                    print('rd move: i = {}, player = {}'.format(i, player))
                    move = self.get_random_move(board)
                else:
                    print('nn move: i = {}, player = {}'.format(i, player))
                    move = self.get_deeplearning_move(h_models[i], player, board)
            else:
                if player2_type == 0 or len(boards) <= 2:
                    print('rd move: i = {}, player = {}'.format(i, player))
                    move = self.get_random_move(board)
                else:
                    print('nn move: i = {}, player = {}'.format(i, player))
                    move = self.get_deeplearning_move(h_models[i], player, board)

            # make the move
            self.make_move(player, board, move)

            if show_game:
                print('player = {}, made move = {}'.format(player, move))
                self.show_board(board)

            # store for players
            boards.append(copy.deepcopy(board))
            moves.append(copy.deepcopy(move))

            # check won, lost, draw
            if self.player_won(player, board):
                # player won
                if player == 1:
                    result = GAME_RESULT_PLAYER1_WON
                else:
                    result = GAME_RESULT_PLAYER2_WON
                return GameResult(
                    boards=boards,
                    moves=moves,
                    result=result,
                )

            other_player = self.get_other_player(player)
            if self.player_won(other_player, board):
                # other_player won 
                if other_player == 1:
                    result = GAME_RESULT_PLAYER1_WON
                else:
                    result = GAME_RESULT_PLAYER2_WON
                return GameResult(
                    boards=boards,
                    moves=moves,
                    result=result,
                )

            if len(self.get_available_moves(board)) == 0:
                # draw
                return GameResult(
                    boards=boards,
                    moves=moves,
                    result=GAME_RESULT_DRAW,
                )

            # change player
            player = other_player
            i += 1

    def play_games(self, n_games, player1_type=0, player2_type=0, h_models=None, show_games=False):
        player_wins = [0, 0]
        draws = 0
        total_moves = 0
        result_to_text = {
            0: 'lost',
            1: 'draw',
            2: 'won',
        }
        for i in range(n_games):
            print('playing game {} of {} ...'.format(i, n_games))
            game_result = self.play_game(
                player1_type=player1_type,
                player2_type=player2_type,
                h_models=h_models,
                show_game=False,
            )
            self.show_boards(game_result.boards)
            total_moves += len(game_result.moves)
            last_move = game_result.moves[-1]
            if game_result.result == GAME_RESULT_DRAW:
                draws += 1
            elif game_result.result == GAME_RESULT_PLAYER1_WON:
                player_wins[0] += 1
            elif game_result.result == GAME_RESULT_PLAYER2_WON:
                player_wins[1] += 1

        if player_wins[1] == 0:
            won_div_lost_fmt = '{:<8}'.format('')
        else:
            won_div_lost = player_wins[0]/player_wins[1]
            won_div_lost_fmt = '{:8.2f}'.format(won_div_lost)
        moves_avg = total_moves/n_games
        print('summary for player1 after {} games\n--------'.format(n_games))
        sep_line = '+-{}-+-{}-+-{}-+-{}-+-{}-+-{}-+'.format('-'*5, '-'*5, '-'*5, '-'*8, '-'*9, '-'*9)
        print(sep_line)
        print('| {:<5} | {:<5} | {:<5} | {:<8} | {:<9} | {:<9} |'.format('won', 'lost', 'draw', 'won/lost', 'avg_moves', 'perc draw'))
        print(sep_line)
        print('| {:5d} | {:5d} | {:5d} | {} | {:9.1f} | {:8.1f}% |'.format(player_wins[0], player_wins[1], draws, won_div_lost_fmt, moves_avg, (100*draws)/n_games))
        print('')
        print('')

    def board2vector(self, board):
        b = []
        for r in range(3):
            for c in range(3):
                b.append(board[r][c])
        return b

    def get_game_result_signature(self, game):
        sig = ''
        for board in game.boards:
            for r in range(3):
                for c in range(3):
                    sig += str(board[r][c])
        for move in game.moves:
            sig += str(move[0]) + str(move[1])
        sig += str(game.result)
        return sig

    def generate_datasets(self, n_games):
        # game_result signatures to prevent duplicate data
        sigs = []
        won_count = 0
        lost_count = 0
        draw_count = 0
        # generate game data
        game_results = []
        game_count = 0
        while won_count < n_games or lost_count < n_games or draw_count < n_games:
            if game_count > 255000:
                break

            print('playing game {}'.format(game_count))
            game_count += 1
            game_result = self.play_game()
            boards_len = len(game_result.boards)
            moves_len = len(game_result.moves)
            result = game_result.result
            
            print('boards_len = {}, moves_len = {}, result = {}'.format(boards_len, moves_len, result))
            if boards_len != (moves_len + 1):
                print('boards_len != moves_len:')
                print('game_result.boards = {}'.format(game_result.boards))
                print('game_result.moves = {}'.format(game_result.moves))
                sys.exit()

            # skip identical games
            sig = self.get_game_result_signature(game_result)
            if sig in sigs:
                continue
            # distribute won, draw, lost
            if result == GAME_RESULT_DRAW:
                if draw_count >= n_games:
                    continue
                draw_count += 1
            elif result == GAME_RESULT_PLAYER1_WON:
                if won_count >= n_games:
                    continue
                won_count += 1
            elif result == GAME_RESULT_PLAYER2_WON:
                if lost_count >= n_games:
                    continue
                lost_count += 1

            sigs.append(sig)
            game_results.append(game_result)

        print('game_count = {}'.format(game_count))
        print('len(sigs) = {}'.format(len(sigs)))
        print('won_count = {}, lost_count = {}, draw_count = {}'.format(won_count, lost_count, draw_count))
        input('Press ENTER to continue ...')
        return game_results

# start

# hyperparameters name and value, are the inputs for hp.Choice()
t_pars = TunerParameters(
    pars=[
        ('layer_input_units', [64, 128]),
        #('layer_hidden_1_units', [64, 128]),
        ('batch_size', [8, 32]),
    ],
)

# lstm
n_steps_in = 2
n_steps_out = 1
n_features = 9

# other
tuner_dir = 'tuner_dir_tictactoe'
won_draw_lost_count = 5000
#won_draw_lost_count = 20000

# train or use
must_train = True
#must_train = False

game_play = GamePlay(
    n_steps_in=n_steps_in,
    n_steps_out=n_steps_out,
    n_features=n_features,
)


h_models = [None]*9
if must_train:
    game_results = game_play.generate_datasets(won_draw_lost_count)
    model_boards_and_results = DataUtils.get_model_boards_and_results(game_results)
    model_data = DataUtils.get_model_data(model_boards_and_results)

    model_data_lens = [None]*9
    print('model_data:')
    for model_num in model_data:
        data = model_data[model_num]
        print('model_num = {}'.format(model_num))
        X_in = data['X_in']
        y_in = data['y_in']
        print('- X_in = {}'.format(X_in))
        print('- y_in = {}'.format(y_in))
        model_data_lens[model_num] = len(X_in)
        #continue

        # split into train and test
        X_train, X_test, y_train, y_test = DataUtils.split_datasets(X_in, y_in)

        # convert
        X_in = np.asarray(X_in).astype(int)
        y_in = np.asarray(y_in).astype(int)
        X_train = np.asarray(X_train).astype(int)
        y_train = np.asarray(y_train).astype(int)
        print('type(X_train) = {}, len(X_train) = {}, X_train = {}'.format(type(X_train), len(X_train), X_train))
        print('type(y_train) = {}, len(y_train) = {}, y_train = {}'.format(type(y_train), len(y_train), y_train))

        n_features = X_train.shape[2]
        print('n_features = {}'.format(n_features))

        tuner = Tuner(
            input_shape=(n_steps_in, n_features),
            t_pars=t_pars,
            t_dir=tuner_dir,
        )

        tune_result = tuner.tune(
            X_train,
            y_train,
            model_num=model_num,
            #plot_loss=False,
            plot_loss=True,
        )

        h_model = tune_result['h_model']

        # add to our list of models
        h_models[model_num] = h_model

        h_eval_dict = tune_result['h_eval_dict']
        print('h_model = {}'.format(h_model))
        print('h_eval_dict = {}'.format(h_eval_dict))

        # debug: show a prediction for the h_model
        X = [
            [
                [ [0, 0, 0], [1, 0, 0], [0, 0, 0] ],
                [ [0, 0, 0], [1, 2, 0], [0, 0, 0] ],
            ],
            [
                [ [0, 0, 0], [0, 1, 0], [0, 0, 0] ],
                [ [0, 0, 0], [0, 1, 2], [0, 0, 0] ],
            ]
        ]
        X_pred = np.array(X).reshape(2, n_steps_in, n_features)
        predictions = h_model.predict(X_pred)
        print('h_model: for X_pred = {}, predictions = {}'.format(X_pred, predictions))

        # debug: show a prediction for the previously saved h_model
        loaded_h_model = tuner.load_model(model_num)
        predictions = loaded_h_model.predict(X_pred)
        print('loaded_h_model: for X_pred = {}, predictions = {}'.format(X_pred, predictions))

    # here we have trained and saved the models 
    print('model_data_lens = {}'.format(model_data_lens))
    input('Press ENTER to continue ...')

# load saved models
if h_models[2] is None:
    # load saved
    print('loading saved models ...')
    h_models = [None]*9
    for model_num in range(9):
        print('loading saved model {}  ...'.format(model_num))
        tuner = Tuner(
            input_shape=(n_steps_in, n_features),
            t_pars=t_pars,
            t_dir=tuner_dir,
        )
        loaded_h_model = tuner.load_model(model_num)
        if loaded_h_model is None:
            continue

        '''
        # debug: show a prediction for the previously saved h_model
        X = [
            [
                [ [0, 0, 0], [1, 0, 0], [0, 0, 0] ],
                [ [0, 0, 0], [1, 2, 0], [0, 0, 0] ],
            ],
            [
                [ [0, 0, 0], [0, 1, 0], [0, 0, 0] ],
                [ [0, 0, 0], [0, 1, 2], [0, 0, 0] ],
            ]
        ]
        X_pred = np.array(X).reshape(2, n_steps_in, n_features)
        predictions = loaded_h_model.predict(X_pred)
        print('loaded_h_model: for X_pred = {}, predictions = {}'.format(X_pred, predictions))
        sys.exit()
        '''
        h_models[model_num] = loaded_h_model

# debug: show models
for model_num in range(9):
    print('h_models[{}] = {}'.format(model_num, h_models[model_num]))

print('play some games')

# play games and see results, player[1|2]_type: 0 = RD (random move), 1 = NN (neural network move)
game_play.play_games(
    1000,
    player1_type=1,
    player2_type=1,
    h_models=h_models, 
    show_games=True,
)

Mejorar el rendimiento

Podemos intentar mejorar el rendimiento de las siguientes maneras:

añadir más datos de entrenamiento
añadir más opciones hyperparameter
añadir más opciones hyperparameter
añadir más hyperparameter a los modelos, como la tasa de aprendizaje
añadir capas adicionales a los modelos
omitir los primeros movimientos reduciendo los datos de entrenamiento en

Sería bueno implementar estadísticas que también muestren cuántas partidas se completan en 5, 6, 7, 8 o 9 movimientos.

Resumen

Usar Deep Learning de esta manera no es la solución para resolver este juego. Yo sabía esto cuando empecé, pero todavía quería probar esto primero antes de buscar en el aprendizaje por refuerzo. Hay muchas mejoras posibles y seguramente probaré algunas en el futuro. El uso de Keras Tuner para afinar los hyperparameters de nuevo hizo mi vida más fácil. Este fue un buen proyecto, pero también un montón de trabajo. Sin embargo, aprendí mucho.

Enlaces / créditos

How many tic tac toe games are there?
https://tictactoebeast.com/post/how-many-tic-tac-toe-games-are-there

How many Tic-Tac-Toe (noughts and crosses) games are possible?
http://www.se16.info/hgb/tictactoe.htm

How to Develop LSTM Models for Time Series Forecasting
https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting

Part 4 — Neural Network Q Learning, a Tic-Tac-Toe player that learns — kind of
https://medium.com/@carsten.friedrich/part-4-neural-network-q-learning-a-tic-tac-toe-player-that-learns-kind-of-2090ca4798d

Tic-Tac-Toe and Reinforcement Learning
https://medium.com/swlh/tic-tac-toe-and-deep-neural-networks-ea600bc53f51

Un intento de resolver el Tic-Tac-Toe usando Keras y LSTM

Sobre el juego Tic-Tac-Toe

Lo que sabemos del juego Tic-Tac-Toe

Algunas preguntas y respuestas

¿Cómo se representa el tablero?

¿Cómo se representa la jugada?

¿Regresión o clasificación y cómo usarla?

MLP (Precepción Multicapa) o LSTM (Memoria de Largo Plazo) o algo más?

¿Cada movimiento tiene su propio modelo?

¿Deben tener ambos jugadores su propio modelo?

Datos de entrenamiento

Datos de juegos únicos

Un modelo para cada jugada

Rendimiento

1. RD vs RD

2. NN vs RD, tren con 1000 x ganados-empate-perdidos

3. NN vs RD, entrenar con 2000 x ganadas-empate-perdidas

4. NN vs RD, tren con 5000 x ganadas-empatadas-perdidas

5. NN vs RD, entrenar con 20000 x ganadas-empate-perdidas

6. RD vs NN, entrenar con 20000 x ganadas-empate-perdidas

7. NN vs NN, entrenar con 20000 x ganado-empate-perdido

El código

Mejorar el rendimiento

Resumen

Enlaces / créditos

Leer más

Deep Learning Machine Learning

Deje un comentario

Comentarios

Deje una respuesta.

Recientes

Más vistos

Etiquetas