Understanding Q-Learning

Details: Category: Blog; By Stuart Mathews; 14.Sep; 14 September 2025; Last Updated: 27 September 2025; Hits: 519

I've been recently trying to piece together my learning around how reinforcement learning is actually implemented algorithmically.

The fundamental idea is that you'd like to simulate decision-making in an agent, but specifically that the decision-making process involves having the agent learn (and then make) moves that are the most beneficial to it. This, therefore, is an admirable and useful decision-making process to try and simulate. This also means that the agent must learn what 'good' moves are.

Generally, the definition of a 'good' move is a reward. The learning part involves trying moves and evaluating their resulting rewards, and making note of them for future visits. This means there is storage involved and that reinforcement learning is iterative, meaning the moves are trialled multiple times.

This trial-and-error is usually referred to as the agent performing many state transitions (moves) over time, and these are collectively called episodes. How the agent actually selects its next move depends on a policy, like only taking the best move each time (greedy policy).

The theory above is definable in a practical approach called Q-Learning.

At the heart of it is the Bellman equation:

\[ Q(s,a) = Q(s,a) + \alpha( r + \gamma\max_{a'}{Q(s',a')} - Q(s,a)) \\ \tag{Bellman equation} \\ \]

Fundamentally, this equation updates the value of any particular move the agent decides to take:

Before taking any next move, the value of all possible next moves must be calculated.
The value of any next move considers the value of the next best move after that (see figure below)
The reason why any move is taken depends on the policy, but usually it's because the policy favours selecting the most valuable move, i.e that will lead to the most rewarding state (greedy policy)
The move that is taken has its value updated. This incorporates the reason for the move into the value it updates (usually because it results in a higher reward), i.e the reward is a part of its updated value.
As the agent moves state-by-state, it updates the value of the move that led it to that state, and can look at what the previous episodes stored for it and the possible next states (Q value function, i.e Q(s,a) ). This means the value of the current move that the agent might be considering now had likely been updated before through prior iterations, such that its value represents those iterations and so influences the agent now as the agent might decide to take that particular action (from that state) because its current value has increased over time by past updates through past iterations.
The agent might not know how valuable the next move is if that move has never been followed before (Q-value will be 0); therefore, it requires previous episodes (visits) to tell it about the findings they had by updating the Q-value for that move.
It's not recursive but relies on iteration (episodes/visits) to provide information (updated Q-Values) about the moves previously taken.
The policy is actually how the agent decides which move to take, e.g, greedy will only make the most valuable moves, i.e with the highest Q-Value.That is, the Q value function is consulted to determine what the next step might be.

Other important notes:

Q-learning is model-free, meaning it does not require a model of state transitions, such as a state transition matrix that is used by a Markov chain (to define what can be done in the environment and with what probability), but instead learns to take actions based on experience.
Q-Learning is an approach to solving MDPs (Markov Decision Processes)

Visualizing Q-Values

Here is a visual example of determining the value of moving down (action) from state S, where the next possible moves from that point, i.e Right and Down moves (actions) from State C are considered:

Figure: Determining the value of a move (Q-value)

As can be seen, the value of moving from State S down to state C determines first what the value of state C is based on what the best possible reward being in state C could provide. In this way, the value of both moves from state C to, ie to F and to E are considered in valuing the move from state S to C. This behaviour is encoded in the definition of the previous Bellman optimality equation:

\[ Q(s,a) = Q(s,a) + \alpha( r + \gamma\max_{a'}{Q(s',a')} - Q(s,a)) \\ \tag{Bellman equation} \\ \]

That is, we update the value of taking the move/action \(a\) from state \(s\) i.e its (\(Q(s,a)\) by taking its present value \(Q(s,a) \) (which might have been updated through past iterations to indicate the value of this move from this state) plus the reward \(r\) of making that move, plus the value of the next step(states)'s best possible move from that point \( max_{a'}{Q(s',a')} \) which is dampened by discount \(\gamma\) for revisting rewards.

\( max_{a'}{Q(s',a')} \) means that find for each possible action \(a'\) the value of the next/result state \(s'\) that is the largest. This would, as illustrated above, consider the next state's best possible move thereafter.

When considering the goal of an MDP as described in Markov Decision Processes, the process of updating the Q-values over time is a process of learning to become the policy that maximises the expected discounted accumulated reward. In other words, the Q-Values at the start of the learning process it used, will not provide the best path, but over time, it evolves through updates to what we call the optional Q-value function \(Q*(s,a)\), which does. This means we are learning \(Q*(s,a)\) from an initial poor \(Q(s,a)\) which evolves through updates (see Bellman equation) to become the policy that achieves the maximum discounted accumulated reward.

Reinforcement Learning

Projects

Login

Twitter

Visualizing Q-Values