Reviewing Playing Atari with Deep Reinforcement Learning

Details: Category: Blog; By Stuart Mathews; 28.Sep; 24 September 2025; Last Updated: 01 October 2025; Hits: 324

Introduction

The paper reviewed in this article is "Playing Atari with Deep Reinforcement Learning" by Mnih et al. , which describes the first time Deep Neural Networks (DNNs) were integrated with Reinforcement Learning.

Q-Learning has been used as a solution to Markov Decision Processes (see Markov Decision Processes), which uses reinforcement learning (RL) to determine the best policy that an agent can use to reap the maximum rewards through its moves. See Rationalizing Q-Learning

It has been used to provide impressive learning capabilities in games like Go and Backgammon, but this successful strategy has not been transferable to some games and sometimes requires specific knowledge about the nature of the game in order to learn successfully.

This paper looks to use a non-linear function (neural network) to estimate and represent the Q-value function as opposed to a traditional linear function.

The traditional means up updating or improving the estimate of the Q-value function by means of the iterative Bellman Update is instead replaced with using gradient decent to teach/improve the neural network operating in the capacity of an estimating Q-value function (but through Experience Replay does not use sequential experienced samples)

The approach I've used to structure my review process is outlined in Research Review Process.

Research question

Can past reinforcement learning success in specific games (Backgammon/Go etc) be improved using (CNNs based on sensory observations) and be more generalisable to more games.

Research aim

The paper seeks to describe and show that it is possible to use DNN with image data (sensory data) with RL specifically by teaching an automated player/agent to choose the best moves in multiple games to demonstrate the effectiveness of an algorithm that uses a DNN with Q-Learning to achieve this.

Problem

How to use a DNN with reinforcement learning.

There a inherent challenges to using Deep Neural Networks in reinforcement learning that would need to be overcome.

In RL there aren't copious amounts of hand-labelled training data, which would otherwise be used with CNN supervised learning.
DNNs require training samples to not be correlated - they expect distinct, independent training samples however training on just-in-time real-time data produce training samples that are very similar to each other.
RL usually acts on a scalar reward and not images (e.g. Bellman Update)
The delay between action and resulting rewards can be thousands of timesteps long compared to the direct association between input and labels in neural networks

Type of research

This is primarily quantitative/empirical research.

Mode of enquiry

Scientific. The paper used experiments, a systematic process of driving the experiments (algorithm) and a systematic process of obtaining images/observations and processing them.

The paper also uses methods that use formal models to represent games such as Markov Decision Processes and Reinforcement learning such as Q-Learning.

Solution

Using a combination of Q-Learning, Experience Replay and CNNs as an approach to implementing deep learning using a traditional reinforcement.

Use Experience Replay to mitigate the problem of highly correlated sample input from real-time environment.
Replace the traditional learning method of the Q-value function (bellman Update) with a Neural network, which is then called a Q-Network or QDN (Q-Deep-Network)
Teaching the NN is done it by minimizing the predicated on the actual experienced reward of the agent (actually, not true, its the reward that is used to determine an actual Q-Value based on the experienced reward, and its the difference between the DQN's predicted Q-vlaue and this that is backpropagated into the network)
A mapping based on a history of move-results (images) as input images/observations is used to predict the reward.

Methodology

Experimental, design-science based research, i.e, experimental development and performance evaluation of an artefact.

Research Methods

Real-time experimental game-play and simulation
Reinforcement learning (Machine learning)
Algorithm development
Mathematical Modeling

Primarily experimental methods that yield quantitative measurement of performance of experiment (rewards as the game learning progresses).

Seven popular ATARI games – Beam Rider, Breakout, Enduro, Pong, Q*bert, Seaquest, Space Invaders

An virtual emulator of various games is controlled programmatically using a algorithm that instructs the player to move based on the value of past moves (history of move-results) represented only as sequences of past move actions and their resulting screens.

The process has access to the history of results of moves (rewards and resulting images in the emulator) and these are randomly sampled via Experience Replay mechanism to feed a neural network to estimate the Q-value function of action values. The improvements are based on feeding prior moves that resulted in their rewards and seeing if the NN can predict the same reward as the agent experienced.

The algorithm that is developed can replace the traditional Bellman Update for a linear Q-value function with gradient decent to improve the QDN's estimation of the Q-Value based what the agent's actual experiences.

Experiments are setup to play (by controlling a player/agent) a variety of games using the Atari 2600 Arcade Emulator and the algorithm that drives it. ("...it learned from nothing but the video input, the reward and terminal signals, and the set of possible actions—just as a human player would")

Research techniques

Neural Networks (Convolutional Neural Network)
Gradient decent network updates instead of iterative Bellman Update
Q-Learning reinforcement learning (to solve Markov Decision Process)
Experience Replay (dealing with real-time samples)
Frame-skipping

Data

The actual output screens (images) rendered by playing game moves by capturing frames before and after thr agent moves.

These are transformed into more simplified representations (greyscale, reduced in size). This serves to augment the data by simplifying it to make the CNN learn better from it:

Initially the input is 210x160 8-bit colour images that are transformed to greyscale and then reduced in size to 84x110 and further to a square 84x84 that is required for the convolution kernel setting used by the CNN. ("high dimensional visual input (210 × 160 RGB video at 60Hz").

The last 4 frames are captured before making a move resulting in an input to the NN as 84x84x4

The association is between a history of observations captured as moves and their resulting screens are used to infer Q-Value (based on the reward for the move that the agent made) and any difference is propagated back into the Q-network's weights via backprop and gradient decent.

Information

Four frames prior to agent move are recorded/observed to allow for the neural network to 'see' a change in direction in the input images
Prior RL techniques such as Sarsa, Contingency and HNeat used hand crafted features from the games
QDNs outperform prior RL algorithms on all seven games

Knowledge

QDNs are superior at learning compared to classic RL algorithms such as Q-Learning and SARSA
Experience Replay as a mechanism can mitigate the problems of correlated states in real-time environments.
1. Experience replay is also crucial in sampling/re-using past experiences multiple times to avoid forgetting and to maintain recollection of infrequent, rare, but otherwise high-value past experiences to re-train adapting Q-Network
The Bellman update can be replaced with gradient decent learning on QDN by minimizing the the difference in experienced reward vs predicted reward
You can learn the next move based on the history of past moves represented as images over time.

Correlation vs Causation

The average accumulated rewards were used to track the effectives of the agents over the long term, i.e. over multiple episodes of learning
As time increases, i.e. more episodes were used for training the rate of accumulated rewards increased suggesting that the Q-Value function was becoming better as the games were being played.
As the learning was based on actual experiences of the agent, these measurements of reward were used to improve the q-value function (Q-Network)

Literature review

Referenced papers

Figure: Referenced paper chronology

Citations

Retrospectively, this is an influential paper as it is the first approach to integrating reinforcement learning with deep neural networks.
The paper appears to be highly cited on google scholar, Archiv. No representation in other notable journals such as IEEE, ACM etc.
The papers are all by DeepMind researchers who are well respected in this field.

Reasoning method

Deduction. The authors reasoned theoretical they could replace the Linear Q value function with a neural network provided they could somehow adapt the RL learning process which is inherently real-time and provides highly correlated samples of state with DNNs which show potential but aren’t easily usable under those real-time conditions. They also reason that there has been improvements to hardware since the last time RL was used.

They conduct experiments and from the experimental results, the evidence suggests that the algorithm learns to play the games better that all previous approaches in 6 of the games with 3 beating even expert players suggesting the algorithm and using a CNN objectively improves in RL environments.

Subjectivity/Objectivity

Construct Validity

Strictly speaking, while the research suggests that they use sensory data, its certainly not sensory data that is sensed from the agent's perspective, i.e. what the agent sees or hears etc. Its more an generalised observation of the environment as a result of an agent action. This might be misleading as sensory data is usually what the thing that is concerned with it senses, not an external observer. This is the player's observation, ie what the player/algorithm sees not the agent.

Internal Validity

No obvious flaws

Research Correctness

There are about 24 referenced papers, most around the time of the publication which suggests it was referring to state-of-the-art papers at the time.
The creating of a systematic algorithm means this work is repeatable and can be reproduced.
A standard process for obtaining image data from playing games is used through the Emulator environment (Atari 2600 emulator) which has been used in prior research.
The learning uses a model for a CNN that does not change parameters across the experiments. This is true of the algorithm also.
Only what a human player would be expected to input (moves) is required and the learning approach only observes the results

Research technique

No obvious flaws

Research techniques vs research question

TBD

Conclusion vs methods

The approach is biased to simple games and does not appear to handle more complex game that require more advanced strategies
The paper itself is 12 years old so it is likely that while the approach is useful for understanding how the first integration of DNNs (CNNs specifically) in a online learning capacity was achieved - its unlikely to be optimal.

External Validity

This approach uses data that is specific to Atari type games, that is 8-bit graphics etc. Its not clear that this can be generalized to more advanced higher-bit graphics and resolutions such as 24-bit. Unlike Atari games in this emulator, not all games have a simple concept of a score so this would be difficult to use those those types of games. They are instead measured by other means that might not be readably obtainable for an emulator adding to the fact that this approach is tied to relatively simple Atari games. It does however show the algorithm can learning well using game images/screens.
Game Complexity. Its not clear if this approach will work for games that are more complex such as RTS games where not only are there multiple actions (into the hundreds) the rewards aren't directly evident based on immediate rewards (which Q-learning is based on) as they can happen only near the end of the game. Also in games the environment can change more deterministically based on smarter enemies that change their behaviour each time. They quote the aim to to show "Learn successful control policies from raw video data in complex RL environments"...

Data Validity

Data objectivity

Screen image are sensory representations of the what happens in the game, so images are an appropriate way to assess the value of a move that resulted in a screen.
The data is produced using a consistent approach - a controlled sampling of the last 4 game frames prior to each move/decision via controlling the emulator.

Data subjectivity (specificity/narrowness)

The data is restricted to 8-bit images
Only certain games are used in the research (Seven popular ATARI games – Beam Rider, Breakout, Enduro, Pong, Q*bert, Seaquest, Space Invaders)

Data vs Research Question

How can a DNN be used in traditional Reinforcement learning?
1. Image data is used to map the results of player's actions (observations as screen images) to labels which in this case is the predicted q-value that represents the value of those actions. This makes sense.
2. In the past image output from the Atari emulator has been used in other RL resarch notable TD-Gammon

Summary of general risks to validity

Only works for Atari games with image data coming from Atari 2600 emulator
Only single sensory source of information used - output from emulator.
Complex games do not learn strategies well.
The design of the CNN is simple and is likely to be out of date.

Credibility concerns

None

Relevance, Contribution, Originality and Novelty

Superior learning over past algorithms (Q-Learning and SARSA)
CNN can be used in RL(online learning) and learn the reward from a randomly sampled mini-batch of experiences (Experience Replay)

Implications & Contributions

First time DNNs integrated into RL using Q-Learning

Opinion

Useful introduction to integrating CNN with RL as a means to bridge the gap between a traditional offline model (CNN) and the online, model-free learning.
Unlikely performant and likely superced by recent approaches.
Suffers from learning from games that are move advanced.

Research Review

Projects

Login

Twitter