Since Revisiting the Derivative, and in contrast to Understanding Q-Learning, I've been learning about Policy Gradient Methods, which are, are closer to how Deep Neural Networks are trained. That is, they use iterative updates based on calculating the gradient of a loss function. This is in contrast to Q-Learning, which uses iterative value-based updates using the Bellman optimality equation.

For example, the traditional approach to training a neural network is to let it predict an output as accurately as possible, then test that output against the ground truth using a loss function to measure how far the two are apart. The key part is that the loss function is designed to use the same parameters as the neural network. This means the loss function and the neural network are dependent on the same common parameters. If you take the gradient of the loss function with respect to those parameters, they indicate how the parameters need to change to improve the loss function. We don't want to improve the loss function so we invert the changes required to the parameters to worsen the loss function, that is, to make the loss function report less of a loss. But crucially, we can take those parameters (that make less of a loss in the loss function) and take them back to the neural network, and theoretically the neural network should use those parameters to make the loss function act like we exepct, to worsen it. This way, we have taught the neural network how to worsen the loss function and improve the neural network's ability to worsen the loss function! 

The above process is called gradient descent, that is, it inverts the gradient of the loss function (which would normally allow us to ascend the loss function) such that we descend the loss function. 

This principle is carried over the reinforcement learning (RL), which does not need to have a neural network that predicts outputs for the loss function to test, but instead has a policy function that makes predictions, so they are similar in this respect. In the context of RL, the prediction is the action to take given the state. In the neural network, it was the prediction given the image (for example, in a CNN). 

The goal usually in RL is to make predictions (select actions) that are good, and many good actions over a period of time accumulate theoretically to a very successful trajectory, so much so, it might be the best moves that yield the best end result. Numerically, we can represent all the good actions as accumulating rewards and the maximum reward is likely the summation of all the good moves (rewards) over time. 

The policy, when given a current state, will tell you what move to make from that state. You normally don't have the luxury of knowing what the policy function is. You need to learn it, in the same way a DNN/CNN for example needs to learn. The DDN will learn how to minimize the loss function using gradient decent, while the policy gradient-based reinforcement learning algorithm will aim to improve the maximum expected accumulated reward or objective function.

The gradient-based idea works the same in RL. You take the gradient of the reward/objective function and propagate that to the policy function that created the initial prediction that was used to evaluate it against the reward/objective function. Interestingly enough, the Q-Value is very important as it is used to compare the predicted value/action the policy generated against the best possible action (which is what the Q-Value estimates). So policy-based gradient methods are dependent on Q-values:

\[ \nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}log\pi_{\theta}(a_t|s_t)]Q^{\pi_{\theta}}(s_t|a_t)] \]

Which means you can calculate the gradient of the objective function \(J\) (which aims to maximise the total discounted reward), with respects to the action \(a_t\) predicted/sampled by the policy \( \pi_{\theta}\) by taking in turn it's gradient, and multiplying it by the Q-Value that represents the best possible action). In these equations, \(\theta\) is used to designate the parameters that are used.

Take it for granted that the above equation does, in fact, provide the gradient of the objective function when you have the log of the gradient of the policy function that generated the action and the Q-Value, which is the best estimated action (those mathematicians are clever, aren't they?)

The Q-Value function is, however, often unknown; that is, we don't know what the best possible action is, but we/policy took an action regardless. We obviously need to know the Q-Value function, as indicated above, to determine how to generate the gradient that can improve the reward/objective function. So we need to estimate it.  Different RL algorithms do that differently:

  1. Use total accumulated discounted reward/return (REINFORCE algorithm)
  2. Use the learnt critic \(Q_w(s,a)\) or the vlaue function \( V_w(s) \) (Actor-critic algorithm)
  3. Use the advantage estimate \( \hat{A}_t \) (A2c, A3C algorithms)
  4. Use the advantage estimate \( \hat{A}_t \) (PPO/TRPO)

 Either way, you're updating the policy function like you would do the neural network by using the gradient of the function that evaluates the prediction against the ground truth. 

FIN

NB: A little off topic, but an important consideration, is that learning is required for adaptation, meaning if you're interested in studying the adaptation of something, making it learn is a precursor to adaptation.