Understanding Markov Decision Processes

Details: Category: Blog; By Stuart Mathews; 17.Sep; 17 September 2025; Last Updated: 02 November 2025; Hits: 647

I've been reading papers presenting various approaches to Deep Learning using CNNs (Convolutional Neural Networks) and DBNs (Deep Belief Networks) to yield important solutions to existing problems. These are collectively called Deep Neural Networks (DNNs). Other approaches are using AE (Autoencoders) to simplify input for better learning. These are collectively referred to as Deep Learning approaches.

Generally, these approaches are good for learning and approximating functions, i.e the mapping of inputs to outputs without explicitly defining how inputs map or correspond to outputs. These associations are automatically learnt from the input and associated with known outputs such that when only the data is provided, the output is as expected.

Most applications of DNN look to identify/label/predict correctly the outcome from merely using the input (after a long training process where features about inputs were learnt in order to make these predictions)

As I'm interested in modelling behaviour in artificial entities, I've also started reading about approaches to modelling decision-making, such as using Bayesian Networks. To this end, a fundamental learning approach to modelling behaviour is Q-Learning, which is a type of Reinforcement learning algorithm/approach which solves a generic type of learning problem where the problem can be modelled as a Markov Decision Process (MDP).

According to Wikipedia, MDPs, "...provide a simplified representation of key elements of artificial intelligence challenges".

MDPs define learning problems as being composed of 8 components, and two of the components are that the goal of the problem is to find the best way (often represented as being 'the policy') to make a series of decisions such that all the decisions add up (literally) to the best outcome. The other 7 components are a 1) definition of all the states that can be transitioned to, 2) the types of actions that can be taken (these lead to new states), 3) the probabilities that states can be transitioned to given a previous state and an action (basically a definition of the possible outcomes for a state if any of the available moves were taken), 4) a reward function that defines the value of a transition (taking an action that leads to a new state), 5) a penality constant that dampens the value of rewards as time goes on, 6) that the states adhere to the Markov property, i.e that the next state that can be transition to is only dependnant on the current state (so no move history is considered to select the next move). These components can be visualised in the following diagram:

Figure: A Markov Decision Process

As can be seen in the diagram, the 8 components of an MDP are listed underneath an example of a decision-making process.

Each block is a state that composes the overall list of states required of an MDP. The possible actions are up, down, left and right as indicated by the directional arrows. The optimal policy is a function that specifies what choices should be taken from state A, in such a way that the maximum discounted cumulative reward, which in this case is the optimal route of decisions (in blue), toward achieving the goal state of state G.

That maximum discounted accumulated reward is represented as the goal and can be described thusly:

\[ G_t = E [ \sum_{k=0}^{\infty} \gamma^kr_{t+k+1}] \tag{Goal} \\ \]

In the above equation, the goal \(G_t\) is to accumulate a maximum discounted reward over \(k\) moves, where each \(t\) is the associated time-step (where a decision was made) that resulted in a reward \(r\).

The Markov property can be articulated in Math as:

\[ P(s_{t+1}|s_t, a_t) = P(s_{t+1}|s_t, a_t, s_{t-1}, a_{t-1}...) \tag{The Markov Property } \]

In English, this means that the next state \(s_{t+1}\) given (depends on) only the current state \(s_t \) and the next action \(a_t\) in the current state.

The point is that using these components, we aim to find the policy and, in most cases, learn an estimate of the policy such that listening to the policy in the future will tell us the route at each time step, which is expected to lead to the ultimate prize: the maximum discounted accumulated reward. Q-Leaning is one way to solve this problem using reinforcement learning to learn/estimate/approximate a policy that achieves the maximum discounted cumulative reward.

In a nutshell, a MDP is a model of a game, it defines the environment (rules), possible moves (actions), rewards, policy(srategy) and a goal:

Figure: Illustration of a Markov Decision Process

FIN

Side note:

A Markov chain (MC) is a model for representing state transitions, much like describing the optimal path in blue above. However, it doesn't include in the model anything about a set of possible actions that lead to those states; it merely indicates a mapping of which state moves to which following state. In this way, it is a fixed model of decisions to be taken. The definition of this mapping is referred to as the State Transition Matrix which is represented as \(P[i,j] = P(s_{t+1} = j | s_t = i)\), which means it gives the mapping of the probability of transitioning from the next state \(s_{t+1}\) given the state \(i\) while the the MDP Transition Probabilities (one of the components in a MDP) include the possible actions, i.e: \( P(s'|s,a) \), which means, that the probabiliy of next state \(s'\) given both the current state \(s\) and the possible move in the current state \(a\)

Reinforcement Learning

Projects

Login

Twitter

Understanding Markov Decision Processes