- Details
- Category: Blog
- By Stuart Mathews
- Hits: 105
Since Policy Gradient Methods, I've been curious about how LLMs are taught/trained.
It turns out that folks use reinforcement learning to train LLMs (Large Language Models) too, and it amounts, very similarly, to the Policy Gradient Methods I recently discussed.
For example, an LLM predicts the next token given the prior tokens and therefore, there must be a way to evaluate the generated token against what the next token should be, i.e there must be a loss function or a reward function. Let's explore.
Preliminaries
LLMs can be formally described thusly:
\[ p_{\theta}(x_{t+1}|x_{1:t}) \]
where \( p_{\theta}\) is the LLM which uses \(\theta\) parameters, and which generates/predicts the next token (\(x_{t+1}\)) given the previous sequence of tokens (\(x_{1:t} \)). So, in a nutshell, it generates the next token given the prior tokens.
If you want to generate tokens continuously (autoregressive LLM), it merely keeps doing so by iteratively sampling the next token (\(x_{t+1}\)) from the LLM, i.e:
\[ x_{t+1} \sim p_{\theta}(\cdot|x_{1:t}) \]
If instead you want to do conditional generation, that is, use a prompt that conditions how the LLM will generate its next token, the following model describes that:
\[ p_{\theta}(x_{1:n}|c) = \prod_{t=q}^n p_{\theta}(x_t|c, x_{1:t-1})\]
This means (I think) the generated tokens \(x_{1:n}\) are conditioned on the context c, which is a sequence of prior input tokens and can be considered as the prompt to the LLM. Internally, this is the product probabilities (what LLMs really output) of all prior tokens given the prompt/context/c. It must be said that the above interpretation and my knowledge of exactly how this is done are not very well-defined, so tread lightly here! Anyway, the interesting bit is next.
Treating LLM token generation as an MDP
You can define an LLM as a Markov Decision Process (MDP), that is, it can be formulated as a set of states, actions, transition probabilities, a reward function and a discount factor (all the necessary components of an MDP).
This can be represented by the form \(MDP=(S,A,P, R,\gamma)\). As the LLM generates a new token, a new state is reached and the token generated is considered the action which transitions the LLM to the next state:
\[ [c, x_{t-1}],[c, x_t], [c, x_{t+1}]... \]
This represents the transition of states based on the LLM generating actions as new tokens that produce resulting states. The next action is \(x_{1:t+1}\) and the prior tokens are the context \(c\) at that point.
The reward is based on the generated token \(x\) and the context it was generated under, namely, \( R(c,x)\) (the reward function).
Now we have all the components necessary to frame this as a MDP problem, with the goal to maximise the accumulated discounted return (as is always the goal with an MDP).
If the LLM generates actions, then it can be modelled as a policy that would select actions in reinforcement learning (RL). Therefore like RL you can aim to optimise the objective function (and therefore the accumulated discounted total reward):
\[ L_{\theta}(c)= \mathbb{E}_{x \sim p_{\theta}}[R(c,x)] \]
where \(c\) is he context/prompt or prior tokens for conditioning and \(x\) is the generated token.
We can then optimise the objective function by obtaining the gradient, which the REINFORCE algorithm specifies can be done this way:
\[ \nabla_{\theta} L_{\theta}(c) = \mathbb{E}_{x \sim p_{\theta}(\cdot|c)}[\hat{A}(c,x)\nabla_{\theta}\log p_{\theta}(x|c)] \]
Here, the \(\hat{A}\) is the advantage estimate, which estimates the Q-Value, which also lowers the variance of the gradient estimate (\(\nabla_{\theta}\log p_{\theta}(x|c)\)) to make the gradient updates less dramatic.
The key is that during reinforcement learning using policy gradient methods (such as the above), each subsequent action is sampled from the policy, in this case, the LLM \(p_{\theta}\) and the gradients with respect to the actions (parameters) are calculated. The above uses the REINFORCE algorithm to determine the reward/objecrtive gradient by using the policy gradient, meaning this is an on-policy reinforcement learning approach.
Note that on-policy means you're using/sampling actions from the very policy you're trying to improve, while offline policy sampling is taking actions from another policy (often called the behaviour policy), which is used to evaluate and improve the target policy. Q-Learning is off-policy.
The useful thing about off-policy learning is that you can explore using an exploratory behaviour policy or watch others' actions to dictate your action, and then use the outcome to update your target policy. With on-policy, you're learning ONLY from your own actions. SARSA is on-policy.
So you now have an LLM that acts as a policy that continually samples/generates actions/tokens, which when assessed by the reward function(that also depends on the token), can be used to generate the function gradients of the objective/reward function, and can then be fed back to the LLM(policy) via gradient acent to improve the reward function in the future via the LLM(policy).
- Details
- Category: Blog
- By Stuart Mathews
- Hits: 52
Since Revisiting the Derivative, and in contrast to Understanding Q-Learning, I've been learning about Policy Gradient Methods, which are, are closer to how Deep Neural Networks are trained. That is, they use iterative updates based on calculating the gradient of a loss function. This is in contrast to Q-Learning, which uses iterative value-based updates using the Bellman optimality equation.
For example, the traditional approach to training a neural network is to let it predict an output as accurately as possible, then test that output against the ground truth using a loss function to measure how far the two are apart. The key part is that the loss function is designed to use the same parameters as the neural network. This means the loss function and the neural network are dependent on the same common parameters. If you take the gradient of the loss function with respect to those parameters, they indicate how the parameters need to change to improve the loss function. We don't want to improve the loss function so we invert the changes required to the parameters to worsen the loss function, that is, to make the loss function report less of a loss. But crucially, we can take those parameters (that make less of a loss in the loss function) and take them back to the neural network, and theoretically the neural network should use those parameters to make the loss function act like we exepct, to worsen it. This way, we have taught the neural network how to worsen the loss function and improve the neural network's ability to worsen the loss function!
The above process is called gradient descent, that is, it inverts the gradient of the loss function (which would normally allow us to ascend the loss function) such that we descend the loss function.
This principle is carried over the reinforcement learning (RL), which does not need to have a neural network that predicts outputs for the loss function to test, but instead has a policy function that makes predictions, so they are similar in this respect. In the context of RL, the prediction is the action to take given the state. In the neural network, it was the prediction given the image (for example, in a CNN).
The goal usually in RL is to make predictions (select actions) that are good, and many good actions over a period of time accumulate theoretically to a very successful trajectory, so much so, it might be the best moves that yield the best end result. Numerically, we can represent all the good actions as accumulating rewards and the maximum reward is likely the summation of all the good moves (rewards) over time.
The policy, when given a current state, will tell you what move to make from that state. You normally don't have the luxury of knowing what the policy function is. You need to learn it, in the same way a DNN/CNN for example needs to learn. The DDN will learn how to minimize the loss function using gradient decent, while the policy gradient-based reinforcement learning algorithm will aim to improve the maximum expected accumulated reward or objective function.
The gradient-based idea works the same in RL. You take the gradient of the reward/objective function and propagate that to the policy function that created the initial prediction that was used to evaluate it against the reward/objective function. Interestingly enough, the Q-Value is very important as it is used to compare the predicted value/action the policy generated against the best possible action (which is what the Q-Value estimates). So policy-based gradient methods are dependent on Q-values:
\[ \nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}log\pi_{\theta}(a_t|s_t)]Q^{\pi_{\theta}}(s_t|a_t)] \]
Which means you can calculate the gradient of the objective function \(J\) (which aims to maximise the total discounted reward), with respects to the action \(a_t\) predicted/sampled by the policy \( \pi_{\theta}\) by taking in turn it's gradient, and multiplying it by the Q-Value that represents the best possible action). In these equations, \(\theta\) is used to designate the parameters that are used.
Take it for granted that the above equation does, in fact, provide the gradient of the objective function when you have the log of the gradient of the policy function that generated the action and the Q-Value, which is the best estimated action (those mathematicians are clever, aren't they?)
The Q-Value function is, however, often unknown; that is, we don't know what the best possible action is, but we/policy took an action regardless. We obviously need to know the Q-Value function, as indicated above, to determine how to generate the gradient that can improve the reward/objective function. So we need to estimate it. Different RL algorithms do that differently:
- Use total accumulated discounted reward/return (REINFORCE algorithm)
- Use the learnt critic \(Q_w(s,a)\) or the vlaue function \( V_w(s) \) (Actor-critic algorithm)
- Use the advantage estimate \( \hat{A}_t \) (A2c, A3C algorithms)
- Use the advantage estimate \( \hat{A}_t \) (PPO/TRPO)
Either way, you're updating the policy function like you would do the neural network by using the gradient of the function that evaluates the prediction against the ground truth.
FIN
NB: A little off topic, but an important consideration, is that learning is required for adaptation, meaning if you're interested in studying the adaptation of something, making it learn is a precursor to adaptation.
- Details
- Category: Blog
- By Stuart Mathews
- Hits: 125
Since reading a survey on Agentic AI, I've formulated a basic blueprint of what I feel it is.
Agentic AI basically concerns creating AI workers that are:
- Autonomous, Adaptable and goal-driven
- Use a combination of reinforcement learning (RL) and goal-oriented architectures/approaches
- Implements adaptive control strategies and techniques
- Designed to be resilient to change (and the unknown)
- Implements smart, opportunistic learning such as RL, imitation, pattern knowledge, priorities, social interactions, self-supervision, uncertainty, real-time and dynamic and real-time datasets and environments.
- Long-term management of strategies concerning goals, tasks, context, priority and any other smarts such as patterns, cause and effect analysis, focus and attention.
- Management of complexity and overload
- Modular, flexible, composable, combinable, generalizable and evolvable concepts.
- Redefinable concepts of value
- Drawing from knowledge and historical events (cause and effect)
- Teaching methods
- Training/Learning methods, transfer learning
- Self-reflection (Performance, Quality, Goals, Learning)
- Safety and security, and ethical decisions
- Characterising a situation (defining its goals, priorities, subjects, interactions, attribution of cause and effects, etc)
- Monitoring
- Hardware and resources, and scalability
- Understanding changes in human behaviour (i.e the changing of human goals, etc.)
This makes Agentic AI very commensurate with simulating NPC behaviour in real-time environments such as computer games.
More Articles …
- Revisiting the Derivative
- Reviewing Playing Atari with Deep Reinforcement Learning
- Understanding Q-Networks
- Understanding Markov Decision Processes
- Understanding Q-Learning
- Architecture for a single simulation
- Contextualizing Artificial Intelligence and Psychology
- A Philosophical Introduction
- Avenues of research
- Research Terminology
Page 1 of 182
