- Details
- Category: Blog
- By Stuart Mathews
- Hits: 30
Q-Learning is used as a solution to solve Markov Decision Processes (MPDs) (see Markov Decision Processes), i.e., its goal is to determine (learn) the best policy, i.e., moves, that an agent can take to reap the maximum rewards, which in most cases results in reaching the goal in the most optimal way possible.
In this way, optimization is a key aspect and gives rise to the equation that is used to achieve this in Q-Learning, the Bellman Optimality equation.
The Bellman Optimality equation is used to iteratively (that is with every move) re-evaluate and re-compute the value of each move that the agent takes based on the reward seen by taking that move. The value of a move is considered its Q-Value and q-values of all moves is represented by the Q-Function. Initially these values are random and not true indication of the value of moves. See Rationalizing Q-Learning to see how the values of moves are determined in the "Visualizing Q-values section".
The Bellman Optimality equation teaches the Q-Function to be more accurate in how it values the moves that the agent takes based on how good the agent suggests that move was for it. The agent interprets the value of the move by the rewards it sees as a result of taking the move. The Bellman optimality equation updates the value of the move and so helps to refine the Q-Function over time (see Rationalizing Q-Learning for a description this equation):
\[ Q(s,a) = Q(s,a) + \alpha( r + \gamma\max_{a'}{Q(s',a')} - Q(s,a)) \\ \tag{Bellman Optimality Equation} \\ \]
As the Q-value function is updated iteratively, past iteration's results are built-upon by leaving traces of their updates in the Q-value function so that they can be used in the following iterations. This is how past results are readily available in the Q-function, i.e the Q-function mutates over time.
In this way Q-Learning is an algorithmic model of Reinforcement Learning (RL) which is a discipline of Machine Learning, which is a branch of Artificial Intelligence (AI).
In Q-Networks (first described here), the Q-Value function exists, i.e., it holds the value of any move and is continually evolving as the agent iteratively update it, however instead of using the Bellman Update to iteratively update the Q-Function from the agent's sequential moves, a Convolutional Neural Network (CNN) is used instead.
However it is not possible to simply drop in a CNN because MDPs are simulations of real-time decision making that occurs step-by-step as the agent moves and there is no explicit notion of training images (as required for CNNs) in MDPs an RL.
The problem with an agent making sequential moves for a CNN, is that if you were to represent each move as an image, the sampled sequential images (that represent each move) are going to look very similar to one another, i.e., they will be slightly different from the last image - meaning they are highly correlated which is known to lead to unstable network updates (poor learning). This is exactly why during training, its best to provide variation in the training data so that these variations can be used to better generalize using the differences seen in training data. This is also why when training a CNN, one typically augments the training input to more readily differentiate the data in each input sample from other samples (blurring, rotation etc). So sampling data in real-time and then passing that to the CNN is also a problem.
Therefore Q-Networks deviate from the Bellman Update not only in as much as its replaces it, but also that it does not refine/learn the Q-Function based on what the agent has just recently done (the moves it just took) due to the correlated nature of sequential images, but rather learns instead from random results that occurred previously in time, i.e., it keeps a history of its past moves results (and associated screen images) and learns from those while its saving what it currently experiencing to its history of moves in real-time.
This means the Q-Network is not learning about every experience as-it-experiences-it, which is in distinct contrast to Q-Learning. This learning from the move history is a random sampling strategy is called Experience Replay.
In real-time the agent adds its current experience to the Replay Buffer as a 4-tuple as the current state (\(\phi\)), the move taken (\(a\)), the rewards received (\(r\)) and the resulting state (\(\phi_{n+1}\)).
During training/learning while adding its current experiences to the Replay Buffer, it also randomly pulls samples from this buffer to learn from. The key is the random samples are not sequential/consecutive and so are less likely to be correlated. For example, here is a specific sample taken theoretically from the Replay buffer for learning purposes:
\[ (\phi_j, a_j;r_j;\phi_{j+1}) \\ \tag{a random experience result sample from Replay Buffer} \\\]
From this sample, the algorithm passes in the sampled "before" state \(\phi_j\) (which is the move history) as the input to the CNN (acting as the Q-value function) in order to compute/predict the value of the move i.e., its Q-value: \(Q(\phi_j\, a_j\)).
If the CNN had leant well, the predicted value would be similar to the real value of the move, i.e., what the Bellman equation would calculate it as: \(y_i= r_j + \gamma Max_{a'}Q(\phi_{t+1,a';\theta}) \)
That is, it considers the immediate reward of the move (\(r_j\)) and the best possible action from that resulting state \(\phi_{t+1}\) to calculate the move's corresponding Q-value and sees if the CNN can produce a value that is comparable. The squared differenced between what the CNN produced and what it should be (\(y_i\)) is back propagated as weight updates in the network accordingly to refine it:
\[ (y_i - Q(\phi_j, a_j, \theta))^2 \\ \tag{Squared difference loss function} \\ \]
In this way, the Q-Network (CNN) is taught in real-time by updating it using gradient decent and backpropagation as opposed it iterative updates to estimate the Q-Value function as would be the case with the Bellman Optimality equation.
Here an drawing of the architecture to better rationalize the above description:
Figure: Architecture of playing Atari with a Q-Network
In a nutshell, the Q-Network learns the value of possible moves from possible initial states. The agent then consults an evolving Q-function for the best moves. The policy/strategy in MDPs is always to maximize the discounted accumulated reward, and this is achieved if the agent takes only moves that have the highest Q-value and taken (by consulting the Q-function).
When the Q-Function learns well, it ultimately converges to a version of itself called the optimal Q-Value function that defines the highest q-values for states (through past learning) and if those states are visted by the agent, the agent will consult the Q-function and implicitly take the highest q-value move and therefore will reach a sequence of moves that gets it to its reward (of maximum discounted accumulative reward).
Q-Networks can be briefly summarized as:
- Trying to solve the goal of a MDP (see Markov Decision Processes)
- Maximize the discounted accumulated reward
- The agent consults an evolving Q-Function for the best moves
- Q-Function learns the best moves by determining their Q-Values while the agent moves (Explores)
- Q-Function is replaced by a Q-Network which conceptually is the same thing but is a CNN instead of a linear function
- Expereince Replace is used to overcome the challenges that training a CNN faces in online learning (real-time/dynamic environments)
- Details
- Category: Blog
- By Stuart Mathews
- Hits: 415
I've been reading papers presenting various approaches to Deep Learning using CNNs (Convolutional Neural Networks) and DBNs (Deep Belief Networks) to yield important solutions to existing problems. These are collectively called Deep Neural Networks (DNNs). Other approaches are using AE (Autoencoders) to simplify input for better learning. These are collectively referred to as Deep Learning approaches.
Generally, these approaches are good for learning and approximating functions, i.e the mapping of inputs to outputs without explicitly defining how inputs map or correspond to outputs. These associations are automatically learnt from the input and associated with known outputs such that when only the data is provided, the output is as expected.
Most applications of DNN look to identify/label/predict correctly the outcome from merely using the input (after a long training process where features about inputs were learnt in order to make these predictions)
As I'm interested in modelling behaviour in artificial entities, I've also started reading about approaches to modelling decision-making, such as using Bayesian Networks. To this end, a fundamental learning approach to modelling behaviour is Q-Learning, which is a type of Reinforcement learning algorithm/approach which solves a generic type of learning problem where the problem can be modelled as a Markov Decision Process (MDP).
MDPs define learning problems as being composed of 8 components, and two of the components are that the goal of the problem is to find the best way (often represented as being 'the policy') to make a series of decisions such that all the decisions add up (literally) to the best outcome. The other 7 components are a 1) definition of all the states that can be transitioned to, 2) the types of actions that can be taken (these lead to new states), 3) the probabilities that states can be transitioned to given a previous state and an action (basically a definition of the possible outcomes for a state if any of the available moves were taken), 4) a reward function that defines the value of a transition (taking an action that leads to a new state), 5) a penality constant that dampens the value of rewards as time goes on, 6) that the states adhere to the Markov property, i.e that the next state that can be transition to is only dependnant on the current state (so no move history is considered to select the next move). These components can be visualised in the following diagram:
Figure: A Markov Decision Process
As can be seen in the diagram, the 8 components of an MDP are listed underneath an example of a decision-making process.
Each block is a state that composes the overall list of states required of an MDP. The possible actions are up, down, left and right as indicated by the directional arrows. The optimal policy is a function that specifies what choices should be taken from state A, in such a way that the maximum discounted cumulative reward, which in this case is the optimal route of decisions (in blue), toward achieving the goal state of state G.
That maximum discounted accumulated reward is represented as the goal and can be described thusly:
\[ G_t = E [ \sum_{k=0}^{\infty} \gamma^kr_{t+k+1}] \tag{Goal} \\ \]
In the above equation, the goal \(G_t\) is to accumulate a maximum discounted reward over \(k\) moves, where each \(t\) is the associated time-step (where a decision was made) that resulted in a reward \(r\).
The Markov property can be articulated in Math as:
\[ P(s_{t+1}|s_t, a_t) = P(s_{t+1}|s_t, a_t, s_{t-1}, a_{t-1}...) \tag{The Markov Property } \]
In English, this means that the next state \(s_{t+1}\) given (depends on) only the current state \(s_t \) and the next action \(a_t\) in the current state.
The point is that using these components, we aim to find the policy and, in most cases, learn an estimate of the policy such that listening to the policy in the future will tell us the route at each time step, which is expected to lead to the ultimate prize: the maximum discounted accumulated reward. Q-Leaning is one way to solve this problem using reinforcement learning to learn/estimate/approximate a policy that achieves the maximum discounted cumulative reward.
In a nutshell, a MDP is a model of a game, it defines the environment (rules), possible moves (actions), rewards, policy(srategy) and a goal:
Figure: Illustration of a Markov Decision Process
FIN
Side note:
A Markov chain (MC) is a model for representing state transitions, much like describing the optimal path in blue above. However, it doesn't include in the model anything about a set of possible actions that lead to those states; it merely indicates a mapping of which state moves to which following state. In this way, it is a fixed model of decisions to be taken. The definition of this mapping is referred to as the State Transition Matrix which is represented as \(P[i,j] = P(s_{t+1} = j | s_t = i)\), which means it gives the mapping of the probability of transitioning from the next state \(s_{t+1}\) given the state \(i\) while the the MDP Transition Probabilities (one of the components in a MDP) include the possible actions, i.e: \( P(s'|s,a) \), which means, that the probabiliy of next state \(s'\) given both the current state \(s\) and the possible move in the current state \(a\)
- Details
- Category: Blog
- By Stuart Mathews
- Hits: 147
I've been recently trying to piece together my learning around how reinforcement learning is actually implemented algorithmically.
The fundamental idea is that you'd like to simulate decision-making in an agent, but specifically that the decision-making process involves having the agent learn (and then make) moves that are the most beneficial to it. This, therefore, is an admirable and useful decision-making process to try and simulate. This also means that the agent must learn what 'good' moves are.
Generally, the definition of a 'good' move is a reward. The learning part involves trying moves and evaluating their resulting rewards, and making note of them for future visits. This means there is storage involved and that reinforcement learning is iterative, meaning the moves are trialled multiple times.
This trial-and-error is usually referred to as the agent performing many state transitions (moves) over time, and these are collectively called episodes. How the agent actually selects its next move depends on a policy, like only taking the best move each time (greedy policy).
The theory above is definable in a practical approach called Q-Learning.
At the heart of it is the Bellman equation:
\[ Q(s,a) = Q(s,a) + \alpha( r + \gamma\max_{a'}{Q(s',a')} - Q(s,a)) \\ \tag{Bellman equation} \\ \]
Fundamentally, this equation updates the value of any particular move the agent decides to take:
- Before taking any next move, the value of all possible next moves must be calculated.
- The value of any next move considers the value of the next best move after that (see figure below)
- The reason why any move is taken depends on the policy, but usually it's because the policy favours selecting the most valuable move, i.e that will lead to the most rewarding state (greedy policy)
- The move that is taken has its value updated. This incorporates the reason for the move into the value it updates (usually because it results in a higher reward), i.e the reward is a part of its updated value.
- As the agent moves state-by-state, it updates the value of the move that led it to that state, and can look at what the previous episodes stored for it and the possible next states (Q value function, i.e Q(s,a) ). This means the value of the current move that the agent might be considering now had likely been updated before through prior iterations, such that its value represents those iterations and so influences the agent now as the agent might decide to take that particular action (from that state) because its current value has increased over time by past updates through past iterations.
- The agent might not know how valuable the next move is if that move has never been followed before (Q-value will be 0); therefore, it requires previous episodes (visits) to tell it about the findings they had by updating the Q-value for that move.
- It's not recursive but relies on iteration (episodes/visits) to provide information (updated Q-Values) about the moves previously taken.
- The policy is actually how the agent decides which move to take, e.g, greedy will only make the most valuable moves, i.e with the highest Q-Value.That is, the Q value function is consulted to determine what the next step might be.
Other important notes:
- Q-learning is model-free, meaning it does not require a model of state transitions, such as a state transition matrix that is used by a Markov chain (to define what can be done in the environment and with what probability), but instead learns to take actions based on experience.
- Q-Learning is an approach to solving MDPs (Markov Decision Processes)
Visualizing Q-Values
Here is a visual example of determining the value of moving down (action) from state S, where the next possible moves from that point, i.e Right and Down moves (actions) from State C are considered:
Figure: Determining the value of a move (Q-value)
As can be seen, the value of moving from State S down to state C determines first what the value of state C is based on what the best possible reward being in state C could provide. In this way, the value of both moves from state C to, ie to F and to E are considered in valuing the move from state S to C. This behaviour is encoded in the definition of the previous Bellman optimality equation:
\[ Q(s,a) = Q(s,a) + \alpha( r + \gamma\max_{a'}{Q(s',a')} - Q(s,a)) \\ \tag{Bellman equation} \\ \]
That is, we update the value of taking the move/action \(a\) from state \(s\) i.e its (\(Q(s,a)\) by taking its present value \(Q(s,a) \) (which might have been updated through past iterations to indicate the value of this move from this state) plus the reward \(r\) of making that move, plus the value of the next step(states)'s best possible move from that point \( max_{a'}{Q(s',a')} \) which is dampened by discount \(\gamma\) for revisting rewards.
\( max_{a'}{Q(s',a')} \) means that find for each possible action \(a'\) the value of the next/result state \(s'\) that is the largest. This would, as illustrated above, consider the next state's best possible move thereafter.
When considering the goal of an MDP as described in Markov Decision Processes, the process of updating the Q-values over time is a process of learning to become the policy that maximises the expected discounted accumulated reward. In other words, the Q-Values at the start of the learning process it used, will not provide the best path, but over time, it evolves through updates to what we call the optional Q-value function \(Q*(s,a)\), which does. This means we are learning \(Q*(s,a)\) from an initial poor \(Q(s,a)\) which evolves through updates (see Bellman equation) to become the policy that achieves the maximum discounted accumulated reward.
More Articles …
- Architecture for a single simulation
- Contextualizing Artificial Intelligence and Psychology
- A Philosophical Introduction
- Avenues of research
- Research Terminology
- Models of social learning
- Reflective Thinking and Cognition
- Detecting situations from experiences
- Modeling Observations
- Exploratory and behavioural Inclinations based on Fear
Page 1 of 180