Blog - Stuart Mathews

Details: Category: Blog; By Stuart Mathews; 28.Nov; 28 November 2025; Last Updated: 28 November 2025; Hits: 96

Autonomy

I read an interesting paper recently, which highlights the utility of doing literature reviews coupled with the application of mathematics to help generalise or simplify a large body of work (literature review).

The paper is entitled, "LLM-based Agentic Reasoning Frameworks: A Survey from Methods to Scenarios" by Zhao et al., and can be found here.

For example, when considering the types of approaches, the survey paper (essentially a literature review) describes (most?) agentic workflows as a configuration of agents using one of three design variations - either Single-Agent, Tool-based, or Multi-Agent. Presumably, hybrid configurations are plausible too.

Single-agent designs single-handedly work to incorporate technologies and strategies to optimise the cognitive, reasoning and general decision-making of itself. Tool-based agents strategically select and utilise external tooling, while Multi-agent designs base their operation on sharing and communicating.

All types of agentic AI workflows usually model reasoning as the basis of their operation, while some use more specific operations based on how their workflow is composed. For example, some might use task decomposition, tool selection, strategic selection of actions (decision-making), or produce intermediate outputs that are then refined to produce the final output. Some use memory, reflection and planning, etc.

If we ignore the specific technologies that are used within the reasoning process, agentic ai applications can be concisely represented in a hierarchical, tree-like structure:

Single Agents (only care about producing optimal reasoning results themselves)
1. Approaches:
  1. Prompt Engineering and In-Context learning - few-shot examples and Chain of Thoughts (CoT)
  2. Self improvements
    1. Reflection/experience
    2. Interactive optimisation - refining outputs
    3. Interactive learning - environmental feedback
Tool-based agents (use external tools)
1. Approaches:
  1. Tool selection - autonomous, rule-based, learned selection
  2. Tool Integration - API, plugin, middleware/RAG
  3. Tool utilisation - sequential, parallel, iterative reuse
Multi-Agents (shares/communicates results, goals, etc.)
1. Organization architecture
  1. Centralised - centralised coordinating agents with subordinate task worker agents
  2. Decentralised - peer-to-peer communication
  3. Hierarchical - task decomposition to subordinate workers with task aggregation at higher levels
2. Interaction Dynamics
  1. Competition - optimise individual agents' goals
  2. Co-operation - optimise a shared goal
  3. Negotiation - compromise between individual and shared goals

This is the bird's-eye view that is afforded by doing a literature review, which in itself is very helpful and means you need not formulate this by going through the same papers the authors did - effectively using their labour and results (thanks!).

This, however, can be further simplified.

For example, an agentic workflow can be seen as a multi-step process, e.g, over k steps that does not stop until the terminating condition Q is met, and each step generally consists of 2 operations, i.e., 1) producing an output \(a_k()\) and 2) updating the context based on that output \(a^{\prime}_k()\).

It should be noted that the context (\(C_k\)) is generally maintained over time and augmented based on the output of an action \(a_k\) in each reasoning step. The context is usually the initial input from the user. For example, if the basis of reasoning within the agentic AI workflow was the use of a Large Language Model (LLM), this context would be the initial prompt. This initial context is represented as \(C_0\) and \(P_U\) represents the user prompt for the LLM:

\(C_0 \leftarrow Init(P_U)\)

Once the inital context is initialized, we can start the k-step multi-step process:

while \(¬Q(C_k,k)\) do:

\( y_{k+1} = a_k(C_k, g, T) \\ \tag{1. Generate output \(y_{k+1}\) at step \(k\)} \\ \)

\( C_{k+1} = a^{\prime}_k(C_k, y_{k+1}, g, T) \\ \tag{2. Generate context \(C_{k+1}\) at step \(k\)} \\ \)

\(k \leftarrow k+1 \\ \tag{move to the next step} \\ \)

end

Here \(y_{k+1}\) represents the output and \(a_k\) is the action at step \(k\) that produces it, given the current context \(C_k\), the goal \(g\) and the tool, \(t\).

The really useful things about this, is that it can be extended to express many agentic AI workflows (see below).

Prompt engineering (single agent)

\[ C_0 \leftarrow Init(P_U, P*) \\ \tag{augment the inital input using prompt engineering (P*)} \]

Iterative optimization (single agent)

\[ C_0 \leftarrow Init(P_U, S), Q \rightarrow Q = y \models S \]

Initialise initial context with an optimisation target S (AKA the Standard).

This is used in iterative optimisation where the loop continually checks the result, y to see if it matches the Standard S and terminates if it does, or otherwise keeps refining the output towards reaching S.

Autonomous tool-based selection

\[ y_{k+1} \leftarrow t_{k+1} = a_k(C_k, g, T), T \in \{t_0, t_1,...t_n\} \]

Where the selection of the tool, \(t\), from the available tools in set \(T\) are considered, irrespective of how that selection was made. For examplem it could have been using an LLM to select the tool using prompt engineering.

Rule-based tool selection

\[ t_{k+1} = a_k(C_k, g, T, R) \]

Here, the rules, \(R\), condition which tool is selected.

Context update via Reflection

\[ C_{k+1} = a_{reflect}(C_k, y_{k+1}, g ,t) \]

This shows that a reflective action/process is used to generate the evolving context \(C_{k+1}\), considering the output \(y_{k+1}\) generated and the current goal and tool.

Interactive learning through goal update/adaptation:

\[ g_{k+1} \leftarrow a_k(\{(C_i, y_i)\}^{k}_{i=1}, g_k, T)\]

This shows that the goal is updated, based on the history (of all context updates and outputs), given the current goal \(g_k\) and tool, \(t\)

Multi-agent co-operative interaction:

\[ g^{ri} \leftarrow a^{ri}_{reflect}(C^{ri}_{k}, g^{ri}, G, t^{ri}) \]

This shows that the goal of the \(i^{th}\) role-based agent \(g^{ri}\) is based on a process of reflection over the current context of the agent \(C^{ri}_{k}\) at step \(k\), its own goal \(g^{ri}\) and a goal that is shared among all agents \(G\)

Multi-agent negotiation:

\[ g^{ri} \leftarrow a^{ri}_{reflect}(C^{ri}_{k}, g^{ri}, Y_k, G, t^{ri}) \]

This shows that the agents' goals are updated based on all agents' outputs (\(Y_k)\)

Multi-agent Competition:

\[ g^{ri} \leftarrow a^{ri}_{reflect}(C^{ri}_{k}, g^{ri}, Y_k, t^{ri}), Y \in \{y^{r1}_k, y^{r2},..., y^{rn}_k\} \]

This shows that each agent updated its goal based on its history of outputs, and reasonably, it does not have a shared goal \(\G\)) with other agents.

Role-based agent initialisation

\[ C_0 \leftarrow P_U \rightarrow C^{ri}_0 \leftarrow (P_Um r_i) \\ \tag{ role-based agent initialization} \\ \]

Details: Category: Blog; By Stuart Mathews; 02.Nov; 02 November 2025; Last Updated: 02 November 2025; Hits: 167

Deep Learning

Since Policy Gradient Methods, I've been curious about how LLMs are taught/trained.

It turns out that folks use reinforcement learning to train LLMs (Large Language Models) too, and it amounts, very similarly, to the Policy Gradient Methods I recently discussed.

For example, an LLM predicts the next token given the prior tokens and therefore, there must be a way to evaluate the generated token against what the next token should be, i.e there must be a loss function or a reward function. Let's explore.

Preliminaries

LLMs can be formally described thusly:

\[ p_{\theta}(x_{t+1}|x_{1:t}) \]

where \( p_{\theta}\) is the LLM which uses \(\theta\) parameters, and which generates/predicts the next token (\(x_{t+1}\)) given the previous sequence of tokens (\(x_{1:t} \)). So, in a nutshell, it generates the next token given the prior tokens.

If you want to generate tokens continuously (autoregressive LLM), it merely keeps doing so by iteratively sampling the next token (\(x_{t+1}\)) from the LLM, i.e:

\[ x_{t+1} \sim p_{\theta}(\cdot|x_{1:t}) \]

If instead you want to do conditional generation, that is, use a prompt that conditions how the LLM will generate its next token, the following model describes that:

\[ p_{\theta}(x_{1:n}|c) = \prod_{t=q}^n p_{\theta}(x_t|c, x_{1:t-1})\]

This means (I think) the generated tokens \(x_{1:n}\) are conditioned on the context c, which is a sequence of prior input tokens and can be considered as the prompt to the LLM. Internally, this is the product probabilities (what LLMs really output) of all prior tokens given the prompt/context/c. It must be said that the above interpretation and my knowledge of exactly how this is done are not very well-defined, so tread lightly here! Anyway, the interesting bit is next.

Treating LLM token generation as an MDP

You can define an LLM as a Markov Decision Process (MDP), that is, it can be formulated as a set of states, actions, transition probabilities, a reward function and a discount factor (all the necessary components of an MDP).

This can be represented by the form \(MDP=(S,A,P, R,\gamma)\). As the LLM generates a new token, a new state is reached and the token generated is considered the action which transitions the LLM to the next state:

\[ [c, x_{t-1}],[c, x_t], [c, x_{t+1}]... \]

This represents the transition of states based on the LLM generating actions as new tokens that produce resulting states. The next action is \(x_{1:t+1}\) and the prior tokens are the context \(c\) at that point.

The reward is based on the generated token \(x\) and the context it was generated under, namely, \( R(c,x)\) (the reward function).

Now we have all the components necessary to frame this as a MDP problem, with the goal to maximise the accumulated discounted return (as is always the goal with an MDP).

If the LLM generates actions, then it can be modelled as a policy that would select actions in reinforcement learning (RL). Therefore like RL you can aim to optimise the objective function (and therefore the accumulated discounted total reward):

\[ L_{\theta}(c)= \mathbb{E}_{x \sim p_{\theta}}[R(c,x)] \]

where \(c\) is he context/prompt or prior tokens for conditioning and \(x\) is the generated token.

We can then optimise the objective function by obtaining the gradient, which the REINFORCE algorithm specifies can be done this way:

\[ \nabla_{\theta} L_{\theta}(c) = \mathbb{E}_{x \sim p_{\theta}(\cdot|c)}[\hat{A}(c,x)\nabla_{\theta}\log p_{\theta}(x|c)] \]

Here, the \(\hat{A}\) is the advantage estimate, which estimates the Q-Value, which also lowers the variance of the gradient estimate (\(\nabla_{\theta}\log p_{\theta}(x|c)\)) to make the gradient updates less dramatic.

The key is that during reinforcement learning using policy gradient methods (such as the above), each subsequent action is sampled from the policy, in this case, the LLM \(p_{\theta}\) and the gradients with respect to the actions (parameters) are calculated. The above uses the REINFORCE algorithm to determine the reward/objecrtive gradient by using the policy gradient, meaning this is an on-policy reinforcement learning approach.

Note that on-policy means you're using/sampling actions from the very policy you're trying to improve, while offline policy sampling is taking actions from another policy (often called the behaviour policy), which is used to evaluate and improve the target policy. Q-Learning is off-policy.

The useful thing about off-policy learning is that you can explore using an exploratory behaviour policy or watch others' actions to dictate your action, and then use the outcome to update your target policy. With on-policy, you're learning ONLY from your own actions. SARSA is on-policy.

So you now have an LLM that acts as a policy that continually samples/generates actions/tokens, which when assessed by the reward function(that also depends on the token), can be used to generate the function gradients of the objective/reward function, and can then be fed back to the LLM(policy) via gradient acent to improve the reward function in the future via the LLM(policy).

Details: Category: Blog; By Stuart Mathews; 02.Nov; 02 November 2025; Last Updated: 02 November 2025; Hits: 111

Since Revisiting the Derivative, and in contrast to Understanding Q-Learning, I've been learning about Policy Gradient Methods, which are, are closer to how Deep Neural Networks are trained. That is, they use iterative updates based on calculating the gradient of a loss function. This is in contrast to Q-Learning, which uses iterative value-based updates using the Bellman optimality equation.

For example, the traditional approach to training a neural network is to let it predict an output as accurately as possible, then test that output against the ground truth using a loss function to measure how far the two are apart. The key part is that the loss function is designed to use the same parameters as the neural network. This means the loss function and the neural network are dependent on the same common parameters. If you take the gradient of the loss function with respect to those parameters, they indicate how the parameters need to change to improve the loss function. We don't want to improve the loss function so we invert the changes required to the parameters to worsen the loss function, that is, to make the loss function report less of a loss. But crucially, we can take those parameters (that make less of a loss in the loss function) and take them back to the neural network, and theoretically the neural network should use those parameters to make the loss function act like we exepct, to worsen it. This way, we have taught the neural network how to worsen the loss function and improve the neural network's ability to worsen the loss function!

The above process is called gradient descent, that is, it inverts the gradient of the loss function (which would normally allow us to ascend the loss function) such that we descend the loss function.

This principle is carried over the reinforcement learning (RL), which does not need to have a neural network that predicts outputs for the loss function to test, but instead has a policy function that makes predictions, so they are similar in this respect. In the context of RL, the prediction is the action to take given the state. In the neural network, it was the prediction given the image (for example, in a CNN).

The goal usually in RL is to make predictions (select actions) that are good, and many good actions over a period of time accumulate theoretically to a very successful trajectory, so much so, it might be the best moves that yield the best end result. Numerically, we can represent all the good actions as accumulating rewards and the maximum reward is likely the summation of all the good moves (rewards) over time.

The policy, when given a current state, will tell you what move to make from that state. You normally don't have the luxury of knowing what the policy function is. You need to learn it, in the same way a DNN/CNN for example needs to learn. The DDN will learn how to minimize the loss function using gradient decent, while the policy gradient-based reinforcement learning algorithm will aim to improve the maximum expected accumulated reward or objective function.

The gradient-based idea works the same in RL. You take the gradient of the reward/objective function and propagate that to the policy function that created the initial prediction that was used to evaluate it against the reward/objective function. Interestingly enough, the Q-Value is very important as it is used to compare the predicted value/action the policy generated against the best possible action (which is what the Q-Value estimates). So policy-based gradient methods are dependent on Q-values:

\[ \nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}log\pi_{\theta}(a_t|s_t)]Q^{\pi_{\theta}}(s_t|a_t)] \]

Which means you can calculate the gradient of the objective function \(J\) (which aims to maximise the total discounted reward), with respects to the action \(a_t\) predicted/sampled by the policy \( \pi_{\theta}\) by taking in turn it's gradient, and multiplying it by the Q-Value that represents the best possible action). In these equations, \(\theta\) is used to designate the parameters that are used.

Take it for granted that the above equation does, in fact, provide the gradient of the objective function when you have the log of the gradient of the policy function that generated the action and the Q-Value, which is the best estimated action (those mathematicians are clever, aren't they?)

The Q-Value function is, however, often unknown; that is, we don't know what the best possible action is, but we/policy took an action regardless. We obviously need to know the Q-Value function, as indicated above, to determine how to generate the gradient that can improve the reward/objective function. So we need to estimate it. Different RL algorithms do that differently:

Use total accumulated discounted reward/return (REINFORCE algorithm)
Use the learnt critic \(Q_w(s,a)\) or the vlaue function \( V_w(s) \) (Actor-critic algorithm)
Use the advantage estimate \( \hat{A}_t \) (A2c, A3C algorithms)
Use the advantage estimate \( \hat{A}_t \) (PPO/TRPO)

Either way, you're updating the policy function like you would do the neural network by using the gradient of the function that evaluates the prediction against the ground truth.

FIN

NB: A little off topic, but an important consideration, is that learning is required for adaptation, meaning if you're interested in studying the adaptation of something, making it learn is a precursor to adaptation.

Projects

Login

Twitter

Agentic AI approaches and Generalizing Reasoning and Autonomy

Teaching Large Language Models

Policy Gradient Methods

More Articles …