Tic-tac-toe and Connect-4 with Mini-Max (2023)

Branko Blagojevic

·

follow

Published in

ml-everything

·

Reading time 8 minutes

·

January 29, 2018

--

NOTE: I'm moving this blog toApart from. Subscribe there to receive new posts or read others

There is a lot of talk about machine learning techniques being used to beat people in games. Apart fromDeep blue(Chess) 1990, doAlphaGo(Go) Recently, computers seem to perform well compared to their human counterparts. Even games like pokerYou're not quite suresince taking over the machine.

In this series, I want to cover some of the basic machine learning methods used in games. Future posts will be about reinforcement learning, but in this first part we will only focus on comprehensive search and table lookup. While impractical for larger games, some form of table lookup remains at the heart of many reinforcement learning algorithms.

Let's start with Tic-tac-toe and try to build a bot that at least can play itand also chicken.

Everything below is also availableHerepod belowDio 1 – Tic-tac-toe Minimax.ipynb

Tic-tac-toe has 9 possible digits with 3 possible values ​​for each digit (X, O or blank). Excluding illegal setups (e.g. all X's), there are 9 factorials (362,880) possible games and 3^9 (19,683) possible boards. Actual numbers are lower (by a factor of 100 or more) because some boards are impossible and you can treat symmetrically similar boards as the same, but we ignore this because even the upper bound of the state space is managed by the processor.

Since the tic-tac-toe state space is so small, we can simply run out of space and design a lookup table based on the playing field and the person you are playing for. Really,xkcd beat me to itbut let's do one anyway.

Let's start with the game board. As for tic-tac-toe boards, it's pretty boring. It provides a user interface with a panel using a scroll function and spaces are represented by integers from 1 to 9.

Below is an implementation you can use to play against a bot and see how it works. It's pretty simple and just requires a game object that can print, accept moves, and have legal players. The bot needs to have a function called play that accepts a game object and returns the move.

Now about the strategy. Since tic-tac-toe is a zero-sum game (what's good for me is bad for my opponent, and vice versa), we can use a mini-max decision rule to exhaust all possible moves. Mini-Max is basically asking: What is my best move (MAX) if my opponent follows my move with my worst move (MINI) and I follow him with my best move (MAX) and so on. Below is an example of a Mini-Max on a Tic-Tac-Toe board.

Tic-tac-toe and Connect-4 with Mini-Max (3)

Countries with a border are states selected to this level to return to a higher level. So, reading from bottom to top, there's only one selection in that branch, so it's an ascending state. At the level above (MIN level) we have two options and since these are the minimum values, we choose the worse of the two. The level above is the MAX level and we have three options, two of which return -1 and the other +1. Therefore, we choose +1 because it is the largest value.

Of course there's moreconcisely and elegantlyOne way to write a generic mini-max rule, but below is a slightly more specific mini-max implementation meant for our array:

One quick, slightly odd note about this forum is that it's about games that existMark's estate. This means you can find everything you need to know in state representation. There is no path dependency. This does not apply to many games. For example, in many games, an object moves whose direction is not clear in the image. So if an object is coming towards you or away from you, your actions may be different, but you can't tell the direction from a single image. Later we will create themes to solve this problem.

How it's working? Well, tic-tac-toe is an unwinnable game, so at least our bot shouldn't lose (human is X).

Choose player: ['X', 'O'] x
¦ ¦
---+---+---
¦ ¦
---+---+---
¦ ¦
Entry train: 5
O ¦ ¦
---+---+---
¦ X ¦
---+---+---
¦ ¦
Entry train: 2
O ¦ X ¦
---+---+---
¦ X ¦
---+---+---
¦ O ¦
Entry train: 6
O ¦ X ¦
---+---+---
O ¦ X ¦ X
---+---+---
¦ O ¦
Entry move: 7
O ¦ X ¦ O
---+---+---
O ¦ X ¦ X
---+---+---
X ¦ O ¦
Entry train: 9
O ¦ X ¦ O
---+---+---
O ¦ X ¦ X
---+---+---
X ¦ O ¦ X
It's a tie

Good! Let's see if the bot can win (human is O).

Select player: ['X', 'O'] o
X ¦ ¦
---+---+---
¦ ¦
---+---+---
¦ ¦
Entry train: 2
X ¦ O ¦
---+---+---
X ¦ ¦
---+---+---
¦ ¦
Entry train: 3
X ¦ O ¦ O
---+---+---
X ¦ X ¦
---+---+---
¦ ¦
Entry train: 6
X ¦ O ¦ O
---+---+---
X ¦ X ¦ O
---+---+---
X ¦ ¦
you lost :-(

Something interesting is going on here. It seems that our robot is in no rush to win. So if you are in a state where a winning move is available but the next move could also win, our bot may decide to wait until there are no more possible scenarios where it can lose. That's what happened after my first move.

The reason why there is no rush is because the way we wrote the decision rule does not have negative time values. To fix this, we can add a penalty for each time step. The value doesn't have to be large, but large enough that the rule favors game-ending actions. If the value is set too high, you can imagine a decision rule making a deliberate move to lose the game to avoid playing a penalty on future moves only.

In the example below, a penalty of 0.1 is applied for each time step. Without the time penalty, the bot wouldn't care what action it takes, even though the action itself would take the longest. With a time penalty, the bot is forced into one of two game-changing actions.

Tic-tac-toe and Connect-4 with Mini-Max (4)

Here is a time value bot. Everything is the same except for the time_penalty entry and a few lines of comments (lines 29:31).

Let's see how it works:

Select player: ['X', 'O'] o
X ¦ ¦
---+---+---
¦ ¦
---+---+---
¦ ¦
Entry train: 2
X ¦ O ¦
---+---+---
X ¦ ¦
---+---+---
¦ ¦
Entry train: 3
X ¦ O ¦ O
---+---+---
X ¦ ¦
---+---+---
X ¦ ¦
you lost :-(

Yes, it's better. It is needed for a quick kill. But something else is going on that is weird. The first move of the bot is the upper left corner. Any clever tic-tac-toe player knows that this is an amateur move. Of course, you can still guarantee that you won't lose, but the odds of winning are lower. Why doesn't the bot play a more traditional midfielder?

The problem is that the bot assumes the opponent's perfect play. What we might want to do is tell the bot to consider a suboptimal play, even if it's just a little, to move our decisions into a path that can increase the likelihood of winning if the opponent makes a mistake. There are several ways to do this, but a simple way is to weight all sub-optimal moves by a certain percentage (e.g. a sub-optimal weight of 0.1 means that the value of the move with optimal reward is 90%, and with average-optimal reward is 10%).

Here is the implementation below. Again, everything is the same except accepting the sub-optimal weight and the annotated part (verses 38:41).

How it's working?

Select player: ['X', 'O'] o
¦ ¦
---+---+---
¦ X ¦
---+---+---
¦ ¦
Entry train: 1
O ¦ X ¦
---+---+---
¦ X ¦
---+---+---
¦ ¦
Entry train: 3
O ¦ X ¦ O
---+---+---
¦ X ¦
---+---+---
¦ X ¦
you lost :-(

Not bad…

The beauty of these robots is that they are quite generic so we can create and apply our own games. However, I'd like to point out that since it's based on exhaustion, the range of the stats should be very small. I created a game class for connect-4 that follows the same basic tic-tac-toe board format that allows us to use our robots and game features.

You really can't use Mini-Max on Connect-4 because the state space is too large. The standard board is 6 (height) x 7 (width) and measures approx4.531.985.219.092possible positions. So unless you need a good excuse to reboot, I don't recommend running it with any of the bots we've written. However, we can limit the number of steps forward that our bot can take into account. So in the end he just gives up. Below I've edited our bot class to take a limit argument and stop the search when the number of steps exceeds this limit. It evaluates the state at that point and simply returns the value (lines 23:25).

The next step would be to get the final estimate of the array in the last (last) step. But I'll leave it at that.

Many reinforcement learning algorithms boil down to this basic assumption. Get a board representative and search. That's it. The board can be a tic-tac-toe board, a 17x17 go board, or a (210,160,3) board representing the pixel values ​​of an Atari game. We take the plot of this board, manipulate the plot a lot, and try to predict an estimated future reward based on a series of actions. And then we choose the action with the highest reward, least risk, or other considerations. Much of the work is complicated with what we call state, either passing it through a neural network or considering multiple frames. The second part is cleaning, which means we don't allow our bots to follow implausible, suboptimal paths. So often a bot plays against another bot (or game logic itself), we check what happened based on status, and run a search to try to better predict what actually happened based on status.

Future posts will do just that, dealing with more complex games by manipulating pixel values. For interested:A free confirmation learning guide is available onlineIt's really accurate and a great source as long as you don't mind more math notation and pseudocode (and no, I don't mean python). There is also onegreat refresher courseon YouTube by DeepMind (the Google team that made AlphaGo), which I recommend, as well as other resources if you're just looking. Enjoy.

NOTE: I'm moving this blog toApart from. Subscribe there to receive new posts or read others

Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated: 05/17/2023

Views: 6203

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.