Jesse Hostetler, Dan Garmat, and I recently hacked on DeepMind's implementation of a neural network for the Arcade Learning Environment. The above video shows the agent learning to play a game called Breakout, in which the objective is to break bricks at the top of the screen using a ball that should not reach the bottom of the screen.

Our goal was to understand the network's model of the game with respect to the positions of the paddle and ball. For this purpose we compared the network's output from a true game state to a series of "hallucinated" game states we constructed with either an updated paddle or ball position. From here it is best to proceed by example.

An animation of where the network thinks the paddle is located

In the animation shown above, we removed the true paddle location from each frame. The white cloud at the bottom of the image shows where the network's output from the true image is consistent with the paddle being located in that position. As you can see, the network doesn't differentiate paddle position until the ball is near enough to the bottom of the screen for the paddle's location to affect the game. If the ball is far from the bottom of the screen, the paddle's position doesn't matter because the agent can always move the paddle to the ball before it is too late.

Hallucination of ball position 1
Hallucination of ball position 1

Similarly, we compared the network's output from a true image to a series of images with hallucinated ball positions. The images shown above give the true game state (left) and the hallucinated game state (right). Notice that the chaotic movements of the paddle on the left are generated by the network momentarily believing the ball is located in that vicinity.

These results demonstrate that the network learned a "simple" policy of placing the paddle under the ball at the last possible moment. I suspect if we model the bricks we will find the agent has no model of the game state in the top half of the screen.


Note: contact me if you want to continue this work. We had to hack through several implementation issues to generate this analysis. You will want our notes.