The first major conquest of artificial intelligence was chess. The game has a dizzying number of possible combinations, but it was relatively tractable because it was structured by a set of clear rules. An algorithm could always have perfect knowledge of the state of the game and know every possible move that both it and its opponent could make. The state of the game could be evaluated just by looking at the board.
But many other games aren’t that simple. If you take something like Pac-Man, then figuring out the ideal move would involve considering the shape of the maze, the location of the ghosts, the location of any additional areas to clear, the availability of power-ups, etc., and the best plan can end up in disaster if Blinky or Clyde makes an unexpected move. We’ve developed AIs that can tackle these games, too, but they have had to take a very different approach to the ones that conquered chess and Go.
At least until now. Today, however, Google’s DeepMind division published a paper describing the structure of an AI that can tackle both chess and Atari classics.
The algorithms that have worked on games like chess and Go do their planning using a tree-based approach, in which they simply look ahead to all the branches that stem from different actions in the present. This approach is computationally expensive, and the algorithms rely on knowing the rules of the game, which allows them to project the current game status forward into possible future game states.
Other games have required algorithms that don’t really care about the state of the game. Instead, the algorithms simply evaluate what they “see”—typically, something like the position of pixels on a screen for an arcade game—and choose an action based on that. There’s no internal model of the state of the game, and the training process largely involves figuring out what response is appropriate given that information. There have been some attempts to model a game state based on inputs like the pixel information, but they’ve not done as well as the successful algorithms that just respond to what’s on-screen.
The new system, which DeepMind is calling MuZero, is based in part on DeepMind’s work with the AlphaZero AI, which taught itself to master rule-based games like chess and Go. But MuZero also adds a new twist that makes it substantially more flexible.
That twist is called “model-based reinforcement learning.” In a system that uses this approach, the software uses what it can see of a game to build an internal model of the game state. Critically, that state isn’t prestructured based on any understanding of the game—the AI is able to have a lot of flexibility regarding what information is or is not included in it. The reinforcement learning part of things refers to the training process, which allows the AI to learn how to recognize when the model it’s using is both accurate and contains the information it needs to make decisions.
The model it creates is used to make a number of predictions. These include the best possible move given the current state and the state of the game as a result of the move. Critically, the prediction it makes is based on its internal model of game states—not the actual visual representation of the game, such as the location of chess pieces. The prediction itself is made based on past experience, which is also subject to training.
Finally, the value of the move is evaluated using the algorithms predictions of any immediate rewards gained from that move (the point value of a piece taken in chess, for example) and the final state of the game, such as the win or lose outcome of chess. These can involve the same searches down trees of potential game states done by earlier chess algorithms, but in this case, the trees consist of the AI’s own internal game models.
If that’s confusing, you can also think of it this way: MuZero runs three evaluations in parallel. One (the policy process) chooses the next move given the current model of the game state. A second predicts the new state that results, and any immediate rewards from the difference. And a third considers past experience to inform the policy decision. Each of these is the product of training, which focuses on minimizing the errors between these predictions and what actually happens in-game.
Obviously, the folks at DeepMind would not have a paper in Nature if this didn’t work. MuZero took just under a million games against its predecessor AlphaZero in order to reach a similar level of performance in chess or shogi. For Go, it surpassed AlphaZero after only a half-million games. In all three of those cases, MuZero can be considered far superior to any human player.
But MuZero also excelled at a panel of Atari games, something that had previously required a completely different AI approach. Compared to the previous best algorithm, which doesn’t use an internal model at all, MuZero had a higher mean and median score in 42 out of the 57 games tested. So, while there are still some circumstances where it lags behind, it’s now made model-based AI’s competitive in these games, while maintaining its ability to handle rule-based games like chess and Go.
Overall, this is an impressive achievement and an indication of how AIs are growing in sophistication. A few years back, training AIs at just one task, like recognizing a cat in photos, was an accomplishment. But now, we’re able to train multiple aspects of an AI at the same time—here, the algorithm that created the model, the one that chose the move, and the one that predicted future rewards were all trained simultaneously.
Partly, that’s the product of the availability of greater processing power, that makes playing millions of games of chess possible. But partly it’s a recognition that this is what we need to do if an AI is ever going to be flexible enough to master multiple, distantly related tasks.