Training a Deep Reinforcement Learning Agent to Play Snake

Those of us who have ever used a Nokia mobile phone two decades ago will remember the Snake game that was first introduced on the Nokia 6110. An adaption of an arcade game from 1976, it eventually found itself on 400 million phones. Indeed, there is even a “World Snake Day” for nostalgic fans to remember this bygone era.

But can you train a deep reinforcement learning agent to play the game? Data scientist Hennie de Harder decided to find out and chronicled her journey of pitting an agent against a Python version of the game in a blog post on Towards Data Science.

There’s an ML agent for that

One of three basic machine learning paradigms, reinforcement learning is an area of machine learning concerned with software agents that take action based on maximizing predefined rewards.

By definition, deep reinforcement learning combines deep learning and reinforcement learning to simulate how humans learn from experience.

She describes it this way: “Deep Learning uses artificial neural networks to map inputs to outputs… The network exists of layers with nodes. The first layer is the input layer. Then the hidden layers transform the data with weights and activation functions. The last layer is the output layer, where the target is predicted. By adjusting the weights, the network can learn patterns and improve its predictions.”

Training the agent entails rewarding or penalizing it based on its actions, which was what de Harder focused on. Eating the apple will net 10 points, drawing closer to apple 1, moving away from the apple is negative 1, and dying by hitting the wall or itself would be a negative 100.

She also experimented with various tweaks to the state space, changing the feedback that the agent gets with regards to the status of the game to improve learning. That didn’t work out, however, with slower improvements with options other than the original.

Success! Sort of

It was evident that learning from the experience of previous plays allowed for fast learning after only 30 games. Indeed, taking this away saw the agent fare very poorly, catching just three apples even after 10,000 games. From the experiments, a batch size of four worked the best.

But while the high score of between 40 and 60 after just 50 games is much better than a random agent – and probably many of us – it is far from the maximum theoretical score of 399. The reason for this? The agent fails to see the entire game, which means that the propensity to eventually enclose upon itself and lose is high, increasing as the snake grows longer.

One possible improvement, suggest de Harder, would be to apply convolutional neural networks in the state space to give the agent visibility over the entire game instead of being limited nearby obstacles. But that is probably a whole new project by itself.

You can access the code on GitHub here.

Photo credit: Screenshot/playsnake.org