Which Games Are Useful For Testing Artificial General Intelligence?

It is very hard to make progress on artificial intelligence without having a good AI problem to work on. And it is impossible to verify that your software is intelligent without testing it on a relevant problem. For those who work on artificial general intelligence, the attempt to make AI that is generally intelligent as opposed to "narrow AI" for specific tasks, it is crucial to have reliable and accurate benchmarks of general intelligence.

I have previously written about why games are ideal as intelligence tests for AI. Here'd I like to go into more depth about what sort of games we would like to use to test AI, specifically AGI (artificial general intelligence). These are the properties I think games used to test AGI should have:


  • They should be good games. Well-designed games are more entertaining and/or immersive because they challenge our brains better; according to converging theories from game design, developmental psychology and machine learning the fun in playing largely comes from learning the game while playing. A game that is well-designed for humans is therefore probably a better AI benchmark.
  • They should challenge a broad range of cognitive skills. Classical board games largely focus on a rather narrow set of reasoning and planning skills. Video games can challenge a broader set of cognitive skills, including not only reasoning and planning but also e.g. perception, timing, coordination, attention, and even language and social skills.
  • * Most importantly, they should not be one game. Developing AI for a single game has limited value for general AI, as it is very easy to "overfit" your solution to particular game by implementing various domain-specific solutions (or, as they're usually called, hacks). In the past, we've seen this development over and over with AI developed for particular games (though occasionally something of great general value appears out of research on a particular game, such as the Monte Carlo Tree Search (MCTS) algorithm being invented to play Go). Therefore it is important that AI agents are tested on many different games as part of the same benchmark. Preferably these would games that the AI developer does not even know about when developing the AI.


So let's look at how the main game-based AI benchmarks stack up against these criteria.

To begin with, there are a number of game-based AI benchmarks based on individual games. A pretty big number, in fact. The annual IEEE Conference on Computational Intelligence and Games hosts a number of game-based AI competitions, where the software can also be used offline as a benchmark. And of course, classic board games such as Chess, Checkers and Go have long been used as AI benchmarks. An interesting recent addition is Microsoft's Project Malmo, which uses Minecraft as the base for an AI sandbox/benchmark.

But these are all focused on individual games, and therefore not well suited to benchmark general intelligence. Let's talk about general game playing frameworks.

General Game Playing Competition


First we have the General Game Playing Competition and its associated software. This competition has been going since 2005, initiated by Michael Genesereth. For the competition, a game description language was developed for encoding the games; this language is a logic programming language similar to Prolog, and allows the definition of in theory any turn-based game with discrete world state. (Initially these games could not have any hidden information, but that restriction has since been overcome with new versions of the language.) In practice, almost all games defined in this language are fairly simple in scope, and could very broadly be described as board games.

To compete in the General Game Playing competition, you submit an agent that can play any game defined in this language. The agents have access to the full game description, and typically a large part of the agent development goes into analyzing the game to find useful ways of playing it. The actual game-playing typically uses MCTS or some closely related algorithm. New games (or variations of old games) are used for each competition, so that competitors cannot tune their AIs to specific games. However, the complexity of developing games in the very verbose game description language limits the number and novelty of these games.

Arcade Learning Environment


The second entry in our list is the Arcade Learning Environment (ALE). This is a framework built on an emulator of the classic Atari 2600 game console from 1977 (though there are plans to include emulation of other platforms in the future). Marc Bellemare and Michael Bowling developed the first version of this framework in 2012, but opted to not organize a competition based on it. Agents can interface to the ALE framework directly and play any of several dozen games; in principle, any of the several hundred released Atari 2600 games can be adapted to work with the framework. Agents are only given a raw image feed for input, plus the score of the game. To play a game in the ALE framework, your agent therefore has to decipher the screen in some way to find out what all the colorful pixels mean.

Three different games in the ALE framework.


Most famously, the ALE framework was used in Google DeepMind's Nature paper from last year where they showed that they could train convolutional deep networks to play many of the classic Atari games. Based only on rewards (score and winning/losing) these neural networks taught themselves to play games as complex as Breakout and Space Invaders. This was undeniably an impressive feat. Figuring out what action to take based only on the screen input is far from a trivial transform, and the analogue to the human visuomotor loop suggests itself. However, each neural network was trained using more than a month of game time, which is clearly more than would be expected of e.g. a human learner to learn to play a single game. It should also be pointed out that the Atari 2600 is a simple machine with only 128 bytes of RAM, typically 2 kilobytes of ROM per game and no random number generator (because it has no system clock). Why does it take so long time to learn to play such simple games?

Also note that the networks trained for each of these games was only capable of playing the specific game it was trained on. To play another game, a new network needs to be trained. In other words, we are not talking about general intelligence here, more like a way of easily creating task-specific narrow AI. Unfortunately the ALE benchmark is mostly used in this way; researchers train on a specific game and test their trained AI's performance on the same game, instead of on some other game. Overfitting, in machine learning terms. As only a fixed number of games are available (and developing new games for the Atari 2600 is anything but a walk in the park) it is very hard to counter this by enforcing that researchers test their agents on new games.

General Video Game AI Competition


Which brings us to the third and final entry on my list, the General Video Game AI Competition (GVGAI) and its associated software. Let me start by admitting that I am biased when discussing GVGAI. I was part of the group of researchers that defined the structure of the Video Game Description Language (VGDL) that is used in the competition, and I'm also part of the steering committee for the competition. After the original concepts were defined at a Dagstuhl meeting in 2012, the actual implementation of the language and software was done first by Tom Schaul and then mostly by Diego Perez. The actual competition ran for the first year in 2014. A team centered at the University of Essex (but also including members of my group at NYU) now contributes to the software, game library and competition organization.

The "Freeway" game in the GVGAI framework.

The basic idea of GVGAI is that agents should always be tested on games they were not developed for. Therefore we develop ten new games each time the competition is run; we currently have a set of 60 public games, and after every competition we release ten new games into the public set. Most of these games are similar to (or directly based on) early eighties-style arcade games, though some are puzzle games and some have more similarities to modern 2D indie games.

In contrast to ALE, an agent developed for the GVGAI framework gets access to the game state in a nicely parsed format, so that it does not need to spend resources understanding the screen capture. It also gets access to a simulator, so it can explore the future consequences of each move. However, in contrast to both ALE and GGP, agents do not currently get any preparation time, but needs to start playing new games immediately. In contrast to GGP, GVGAI bots do also not currently get access to the actual game description - they must explore the dynamics of the game by attempting to play it. This setup advantages different agents than the ALE framework. While the best ALE-playing agents are based on neural networks, the best GVGAI agents tend to be based on MCTS and similar statistical tree search approaches.

The game "Run" in GVGAI.

The GVGAI competition and framework is very much under active development, and in addition to the planning track of the competition (with the rules described above), there is now a two-player track and a learning track is in the works, where agents get time to adapt to a particular game. We also just ran the level generation track for the first time, where competitors submit level generators rather than game-playing agents, and more tracks are being discussed. Eventually, we want to be able to automatically generate new games for the framework, but this research yet has some way to go.

To sum up, is any of these three frameworks a useful benchmark for artificial general intelligence? Well, let us acknowledge the limitations first. None of them test things skills such as natural language understanding, story comprehension, emotional reasoning etc. However, for the skills they test, I think they each offer something unique. GGP is squarely focused on logical reasoning and planning in a somewhat limited game domain. ALE focuses on perception and to some extent planning in a very different game domain, and benefits from using the original video games developed for human players. I would like to argue that GVGAI tests the broadest range of cognitive skills through having the broadest range of different games, and also the best way of preventing overfitting through the simplicity of creating new games for the framework. But you should maybe take this statement with a pinch of salt as I am clearly biased, being heavily involved in the GVGAI competition. In any case, I think it is fair to say that using any of these frameworks clearly beats working on a single game if you are interested in making progress on AI in general, as opposed to a solution for a particular problem. (But by all means, go on working on the individual games as well - it's a lot of fun.)


Comments