Understanding Q-Networks

Details: Category: Blog; By Stuart Mathews; 25.Sep; 25 September 2025; Last Updated: 01 October 2025; Hits: 196

Q-Learning is used as a solution to solve Markov Decision Processes (MPDs) (see Markov Decision Processes), i.e., its goal is to determine (learn) the best policy, i.e., moves, that an agent can take to reap the maximum rewards, which in most cases results in reaching the goal in the most optimal way possible.

In this way, optimization is a key aspect and gives rise to the equation that is used to achieve this in Q-Learning, the Bellman Optimality equation.

In Q-Learning, the Bellman Optimality equation is used to iteratively (that is with every move) re-evaluate and re-compute the value of each move that the agent takes based on the reward seen by taking that move. The value of a move is considered its Q-Value and q-values of all moves is represented by the Q-Function. Initially these values are random and not true indication of the value of moves. See Rationalizing Q-Learning to see how the values of moves are determined in the "Visualizing Q-values section".

The Bellman Optimality equation teaches the Q-Function to be more accurate in how it values the moves that the agent takes based on how good that move is (Q-Value). The agent interprets the value of the move by the rewards it sees as a result of taking the move. The Bellman optimality equation updates the value of the move and so helps to refine the Q-Function over time (see Rationalizing Q-Learning for a description this equation):

\[ Q(s,a) = Q(s,a) + \alpha( r + \gamma\max_{a'}{Q(s',a')} - Q(s,a)) \\ \tag{Bellman Optimality Equation} \\ \]

As the Q-value function is updated iteratively, past iteration's results are built-upon by leaving traces of their updates in the Q-value function so that they can be used in the following iterations. This is how past results are readily available in the Q-function, i.e the Q-function mutates over time.

In this way Q-Learning is an algorithmic model of Reinforcement Learning (RL) which is a discipline of Machine Learning, which in turn is a branch of Artificial Intelligence (AI).

In Q-Networks (first described here and is reviewed here), the Q-Value function exists, i.e., it holds the value of any move and is continually evolving as the agent iteratively update it, however instead of using the Bellman Update to iteratively update the Q-Function from the agent's sequential moves, a Convolutional Neural Network (CNN) is used to estimate the Q-Value function, and therefore the model of move values.

However it is not possible to simply drop in a CNN because MDPs are simulations of real-time decision making that occurs step-by-step as the agent moves and there is no explicit notion of training images (as required for CNNs) in MDPs an RL (only scalar rewards and an accumulative progression of such rewards)

The problem with an agent making sequential moves for a CNN, is that if you were to represent each move as an image, the sampled sequential images (that represent each move) are going to look very similar to one another, i.e., they will be slightly different from the last image - meaning they are highly correlated which is known to lead to unstable network updates (poor learning). This is exactly why during training, its best to provide variation in the training data so that these variations can be used to better generalize using the differences seen in training data. This is also why when training a CNN, one typically augments the training input to more readily differentiate the data in each input sample from other samples (blurring, rotation etc). So sampling data in real-time and then passing that to the CNN is also a problem.

Therefore Q-Networks deviate from the Bellman Update not only in as much as its replaces it, but also that it does not refine/learn the Q-Function based on what the agent has just recently done (the moves it just took) due to the correlated nature of sequential images, but rather learns instead from random results that occurred previously in time, i.e., it keeps a history of its past moves results (and associated screen images) and learns from those while its saving what it currently experiencing to its history of moves in real-time.

This means the Q-Network is not learning about every experience as-it-experiences-it, which is in distinct contrast to Q-Learning. This learning from the move history is a random sampling strategy is called Experience Replay.

In real-time the agent adds its current experience to the Replay Buffer as a 4-tuple as the current state (\(\phi\)), the move taken (\(a\)), the rewards received (\(r\)) and the resulting state (\(\phi_{n+1}\)).

During training/learning while adding its current experiences to the Replay Buffer, it also randomly pulls samples from this buffer to learn from. The key is the random samples are not sequential/consecutive and so are less likely to be correlated. For example, here is a specific sample taken theoretically from the Replay buffer for learning purposes:

\[ (\phi_j, a_j;r_j;\phi_{j+1}) \\ \tag{a random experience result sample from Replay Buffer} \\\]

From this sample, the algorithm passes in the sampled "before" state \(\phi_j\) (which is the move history) as the input to the CNN (acting as the Q-value function) in order to compute/predict the value of the move i.e., its Q-value: \(Q(\phi_j\, a_j\)).

If the CNN had leant well, the predicted value would be similar to the real value of the move, i.e., what the Bellman equation would calculate it as: \(y_i= r_j + \gamma Max_{a'}Q(\phi_{t+1,a';\theta}) \)

That is, it considers the immediate reward of the move (\(r_j\)) and the best possible action from that resulting state \(\phi_{t+1}\) to calculate the move's corresponding Q-value and sees if the CNN can produce a value that is comparable. The squared differenced between what the CNN produced and what it should be (\(y_i\)) is back propagated as weight updates in the network accordingly to refine it:

\[ (y_i - Q(\phi_j, a_j, \theta))^2 \\ \tag{Squared difference loss function} \\ \]

In this way, the Q-Network (CNN) is taught in real-time by updating it using gradient decent and backpropagation as opposed it iterative updates to estimate the Q-Value function as would be the case with the Bellman Optimality equation.

Here an drawing of the architecture to better rationalize the above description:

Figure: Architecture of playing Atari with a Q-Network

In a nutshell, the Q-Network learns the value of possible moves from possible initial states. The agent then consults an evolving Q-function for the best moves. The policy/strategy in MDPs is always to maximize the discounted accumulated reward, and this is achieved if the agent takes only moves that have the highest Q-value and taken (by consulting the Q-function).

When the Q-Function learns well, it ultimately converges to a version of itself called the optimal Q-Value function that defines the highest q-values for states (through past learning) and if those states are visted by the agent, the agent will consult the Q-function and implicitly take the highest q-value move and therefore will reach a sequence of moves that gets it to its reward (of maximum discounted accumulative reward).

Q-Networks can be briefly summarized as:

Trying to solve the goal of a MDP (see Markov Decision Processes)
1. Maximize the discounted accumulated reward
I used in online learning (dynamic, experienced dataset)
The agent consults an evolving Q-Function for the best moves
1. The value of a move was determined by calculating its Q-Value (see Visualizing Q-Values in Understanding Q-Learning)
Q-Function learns the best moves by determining their Q-Values while the agent moves (Explores)
Q-Function is replaced by a Q-Network which conceptually is the same thing but is a CNN instead of a linear function
Experience Replay is used to overcome the challenges that training a CNN faces in online learning (real-time/dynamic environments)

From a philosophical stand-point:

There is an implicit goal and reward: make moves that have the highest Q-Value in order to progress to maximizing discounted accumulated reward.
Decision-making that underlies the policy/strategy in Q-learning is simple, no other factors, conditions, situations are considered - only maximizing discounted accumulated reward.
In Q-Learning the agent 'explores' by initially randonly trying moves and then following a route of the best moves according to the Q-Value function
Sensory observation is based observing/seeing the last 4 frames/screens of the emulator.
The algorithm that represents the agent is not flexible, it has only one way to value
This is an model-free, online learning appraoch
'Knowledge' is represented as the evolving Q-Function. This represents prior iterative experiences that contribute to the current state of knowledge (Q-Values). In many ways it appears that anything historical which influcnes the present is likely to be knowledge.
Failing to expectations are implicitly represented as a prompt for learnings, i.e., minimizing the loss function between expected and actual outcomes.

From a personal research stand-point:

My first introduction to a mechanism of deep learning with reinforcement learning
Ties together importatnt dependant concepts such as Q-Learning, Q-Values, Online learning, model-free learning etc
First view of a formal depiction of modeling game playing as an online learning approch using Q-Learning and Markov Decision Processes.

A few notable limitations of QDNs:

QDNs require "extraordinary high number of training examples", with training taking from a few hours to a few days. (2002, A Systematic Study of Deep Q-Networks and Its variations)

Projects

Login

Twitter