Sidebar Menu

Projects

  • Dashboard
  • Research Project
  • Milestones
  • Repository
  • Tasks
  • Time Tracking
  • Designs
  • Forum
  • Users
  • Activities

Login

  • Login
  • Webmail
  • Admin
  • Downloads
  • Research

Twitter

Tweets by stumathews
Stuart Mathews
  • Home
  • Blog
  • Code
  • Running
  • Gaming
  • Research
  • About
    • Portfolio
    • Info

Defining and Conceptualizing Agentic AI

Details
Category: Blog
By Stuart Mathews
Stuart Mathews
03.Oct
03 October 2025
Last Updated: 03 October 2025
Hits: 74
  • Causality
  • Situation Detection
  • Experience
  • Autonomy
  • Self-aware
  • Characterising the unknown
  • Behavioral Adaptation
  • Exploration

Since reading a survey on Agentic AI, I've formulated a basic blueprint of what I feel it is.

Agentic AI basically concerns creating AI workers that are:

  1. Autonomous, Adaptable and goal-driven
  2. Use a combination of reinforcement learning (RL) and goal-oriented architectures/approaches
  3. Implements adaptive control strategies and techniques
  4. Designed to be resilient to change (and the unknown)
  5. Implements smart, opportunistic learning such as RL, imitation, pattern knowledge, priorities, social interactions, self-supervision, uncertainty, real-time and dynamic and real-time datasets and environments.
  6. Long-term management of strategies concerning goals, tasks, context, priority and any other smarts such as patterns, cause and effect analysis, focus and attention.
  7. Management of complexity and overload
  8. Modular, flexible, composable, combinable, generalizable and evolvable concepts.
  9. Redefinable concepts of value
  10. Drawing from knowledge and historical events (cause and effect)
  11. Teaching methods
  12. Training/Learning methods, transfer learning
  13. Self-reflection (Performance, Quality, Goals, Learning)
  14. Safety and security, and ethical decisions
  15. Characterising a situation (defining its goals, priorities, subjects, interactions, attribution of cause and effects, etc)
  16. Monitoring
  17. Hardware and resources, and scalability
  18. Understanding changes in human behaviour (i.e the changing of human goals, etc.)

Defining and conceptualizing Agentic AI

This makes Agentic AI very commensurate with simulating NPC behaviour in real-time environments such as computer games.

Revisiting the Derivative

Details
Category: Blog
By Stuart Mathews
Stuart Mathews
03.Oct
03 October 2025
Last Updated: 05 October 2025
Hits: 125
  • Math

Since Understanding Q-Networks, I've been rationalising some past concepts that I've pursued learning and feel like revisiting the derivative.

The derivative is fundamentally a description of the nature of how a function works, what its characteristics are, and so allows us to peek into how its behaviour is defined. 

For example, \( f(x) \to y\) represents a function whose outward behaviour is the mapping of inputs x to outputs y. You can peek into how it does this if you can see how it represents its means of doing this mapping. For example, we can represent its work as a table:

f(x) = y, i.e., x is the input to the function f, and it produces y
x y
1 5
3 3
... ...

The argument x is considered the primitive (and is the input), and so the derivative tells you about the nature of the primitive argument, x, considering its relationship with f() and y

Back to the derivative, this tells us specifically about the relationship between x and y, which are the input and output of the function, and therefore fundamentally will help describe more about the behaviour of the function concerning how x and y relate.

Really, the derivative is a description of the relationship between x and y. Specifically, the derivative of the relationship between x and y, which can only really be described as how x affects y and vice versa. More specifically, in order to talk about this relationship, one needs to consider putting it in the context of how, when x changes, y correspondingly changes. In this way, it makes sense to describe it as how a minimal change in x affects the associated change in y (remember, both input (x) and output (y) are absolutely dependent on one another). That is how to generally define how x affects y, and the derivative indicates or shows this.

The derivative, which we've only discussed in theory, is written in math notation like this \(f'(x) = \frac{dy}{dx}\), i.e the derivative of the function(x) is a ratio (read: relationship) between how the y varies and how the x varies.  The d represents the idea of 'change', so the relationship between the change in x and the corresponding change in y.

It can also be written as \(\lim \limits_{\Delta x \to 0} \frac{f(x_0 + b) - f(x_0)}{\Delta x} \) which is another way of saying the same thing, i.e, the ration/relationship between a change in x and the corresponding change in y, but specifically this mentions that it wants to represent the amount of change that x experiences to be vanishingly small such that the relationship between such a small change can be also seen in a correspondingly small change in y. This is done so that we can reason about the most minute change in x causes a correspondingly minute change in y, and so we can start thinking that this would represent a fundamental characteristic of how x corresponds to y (this is why we minimise x to capture the essence of a change in x). 

Intuitively:

A change in y only occurs because there was a corresponding change in x; therefore, quantify the change from its before state to its after state as how its dependency (x) changed:

Y before and after

Figure: Two points of function representing a before and after state of y

\(f(x_0 + b) \) is the before state and \(f(x_0)\) is the after state of y, while \(\Delta x\) is the change of difference between what x was at the before state and what it was at the after state.

Geometrically:

Visualizing the derivative

Figure: Visualising the derivative

The relationship between the hypotenuse and the adjacent side is what we want to quantify based on what we've been describing the derivative to be. This is exactly the value concerned when considering the angle of \(\tan \alpha\)

You can calculate different derivatives of a function at various points of on the function (see graph above depicting the function), depending on which two points on the function you choose (corresponds to two y-values and their corresponding x-values) before minimizing \(\Delta x\) so that the two points converge to a point where \(\Delta x \to 0 \) which shows where the tangent line would touch the function. The derivative, through its expression of change in x and y, indicates the tangent and the tangent can tell us about the characteristics of the function:

Multiple derivatives

Figure: multiple derivatives

  • The hypotenuse is formed by the two y-values (hypotenuse is the tangent!)
  • The x-axis forms the adjacent side (when considering the tan of \(\alpha \)).

So the derivative tells you something about the nature of two points on the function and shows you something about the nature/form of the underlying function. What it tells you is that the function does between/with two inputs (over time), so the nature of the function between these two inputs can be exposed.

The derivative tells you about the nature of the curve by using two input values (x):

  1. Positive derivative value, like 4: function is increasing in value
  2. Negative derivative value like -2: the function is decreasing in value
  3. 0-value derivatives: indicate a local minima or maxima (saddle points) in the function
  4. 2nd derivative: indicates the concavity of the function, ie upward smile or downward frown form of the function/graph.

The first two points collectively are described as a description of the function's rate of change, i.e., how quickly (magnitude of the value) and in what direction the function is increasing with respect to the input variable, x.

In physics:

  1. The derivative of position is velocity
  2. The derivative of velocity is acceleration

Partial Derivatives

Partial derivatives describe the same thing in multivariate functions (or functions with multiple input variables), such as e.g., \(f(x,z) = y\); however, as multiple input variables (x and z) influence how the function can change, each has a partial influence on how the function changes with respect to that variable. Therefore, you have a partial derivative with respect to each of the input variables:

\( \frac{\partial f}{\partial x}\) is a partial derivative in respect to the change x has on the multivariate function, the other being \(\frac{\partial f}{\partial z}\). 

Like the derivative, there are multiple partial derivatives.

For example, in a single-variable function (derivatives),  there can be numerous derivatives at (for argument's sake) every point on the single-variable function. That's it, along the curve, you can have different rates of change that the function is experiencing, corresponding to how the x-value at that point causes the function to increase or decrease, which represents a positive or negative derivative at that point. This is the same idea for partial derivatives - you can have a partial derivative (for argument's sake) at each point.

A partial derivative is partial because it is only concerned with the change that occurs with respect to a single variable. The other variable also influences the function the but it gets its own partial derivative, and each partial derivative is defined to exclude the influence of the other variable. For example for \(f(x,z) = y\) you have a series of partial derivatives along the x axis and a series along the z-axis - represented symbollically as \( \frac{\partial f}{\partial x}\) and \( \frac{\partial f}{\partial z}\)

At a point in a multivariate function, i.e, where the input variables are used to produce that point, you therefore have two partial derivatives at that point, i.e, \( \frac{\partial f}{\partial x_i}\) and \( \frac{\partial f}{\partial z_i}\), which if you combine them becomes what is called the gradient at that point, i.e, \( \nabla f = [\frac{\partial f}{\partial x_i},  \frac{\partial f}{\partial z_i}]\) for more generally \( \nabla f = [\frac{\partial f}{\partial x},  \frac{\partial f}{\partial z}]\).

Also, like the derivative and the partial derivative, which can occur along the points of the function, the gradient also appears at every point in the function and the specific gradient at a specific point in a multi-variate function is indicated by this: \(\nabla f(x_0,y_0) = [ \frac{\partial f}{\partial x} | (x_0, y_0); \frac{\partial f}{\partial y}|(x_0,y_0)]\)

The gradient

While the derivative in single-variable functions and the partial derivative in multi-variable functions describe how the function increases or decreases at a particular point in the function, the gradient describes the combined influence of both (if two variables) partial derivatives at that point and in doing so indicates the direction of the steepest ascent, that is where the function is increasing the most from that point.

In other words:

The gradient of a multi-varible function expresses how the function changes with respect to input variables. A change in the function caused by a specific value of a variable is expressed as the partial derivative with respect to that variable at that point.

As the multi-variate function has two variables, x and z, changes in each affect the change in the function overall. Each is called a partial derivative. 

If we combine the partial derivatives that exist in unison at apoint in the function (each partial derivative exists there) then we call that the function gradient at that point which is represented as \( \nabla f = [\frac{\partial f}{\partial x},  \frac{\partial f}{\partial z}]\) and this repreesnts the direction and the rate of the steepest local change, meaning where the function increases most rapidly (direction) and how fast it increases in that direction (magnitude). Together, this represents the direction of the steepest ascent - the place where the function is increasing.

The gradient is specific to a point on the function, so there are many gradients. Each shows in which direction and how fast the function increases from/at that point; 

The gradient is a vector of the partial derivatives, one for each (with respect to) input variable:

\(\nabla_{x_n} f\ = [\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \frac{\partial f}{\partial x_3}, ... \frac{\partial f}{\partial x_n}]\) assuming there are n variables (and therefore n parital derivatives) in the multi-variate function f

Reviewing Playing Atari with Deep Reinforcement Learning

Details
Category: Blog
By Stuart Mathews
Stuart Mathews
28.Sep
24 September 2025
Last Updated: 01 October 2025
Hits: 159
  • Research Review

Introduction

The paper reviewed in this article is "Playing Atari with Deep Reinforcement Learning" by Mnih et al. , which describes the first time Deep Neural Networks (DNNs) were integrated with Reinforcement Learning. 

Q-Learning has been used as a solution to Markov Decision Processes (see Markov Decision Processes), which uses reinforcement learning (RL) to determine the best policy that an agent can use to reap the maximum rewards through its moves. See Rationalizing Q-Learning

It has been used to provide impressive learning capabilities in games like Go and Backgammon, but this successful strategy has not been transferable to some games and sometimes requires specific knowledge about the nature of the game in order to learn successfully.

This paper looks to use a non-linear function (neural network) to estimate and represent the Q-value function as opposed to a traditional linear function.

The traditional means up updating or improving the estimate of the Q-value function by means of the iterative Bellman Update is instead replaced with using gradient decent to teach/improve the neural network operating in the capacity of an estimating Q-value function (but through Experience Replay does not use sequential experienced samples)

The approach I've used to structure my review process is outlined in Research Review Process.

  1. Research question
  2. Research aim
    1. Problem
  3. Type of research
  4. Mode of enquiry
  5. Solution
  6. Methodology
    1. Research Methods
    2. Research techniques
  7. Data
  8. Information
  9. Knowledge
  10. Correlation vs Causation
  11. Literature review
    1. Referenced papers
    2. Citations
  12. Reasoning method
  13. Subjectivity/Objectivity 
    1. Construct Validity
    2. Internal Validity
      1. Research Correctness
      2. Research technique
      3. Research techniques vs research question
      4. Conclusion vs methods
    3. External Validity
    4. Data Validity
      1. Data objectivity
      2. Data subjectivity (specificity/narrowness)
        1. Data vs Research Question
    5. Summary of general risks to validity
      1. Credibility concerns
  14. Relevance, Contribution, Originality and Novelty
  15. Implications & Contributions
    1. Opinion

Research question

Can past reinforcement learning success in specific games (Backgammon/Go etc) be improved using (CNNs based on sensory observations) and be more generalisable to more games.

Research aim

The paper seeks to describe and show that it is possible to use DNN with image data (sensory data) with RL specifically by teaching an automated player/agent to choose the best moves in multiple games to demonstrate the effectiveness of an algorithm that uses a DNN with Q-Learning to achieve this.

Problem

How to use a DNN with reinforcement learning. 

There a inherent challenges to using Deep Neural Networks in reinforcement learning that would need to be overcome.

  1. In RL there aren't copious amounts of hand-labelled training data, which would otherwise be used with CNN supervised learning.

  2. DNNs require training samples to not be correlated - they expect distinct, independent training samples however training on just-in-time real-time data produce training samples that are very similar to each other.

  3. RL usually acts on a scalar reward and not images (e.g. Bellman Update)

  4. The delay between action and resulting rewards can be thousands of timesteps long compared to the direct association between input and labels in neural networks

Type of research

This is primarily quantitative/empirical research.

Mode of enquiry

Scientific. The paper used experiments, a systematic process of driving the experiments (algorithm) and a systematic process of obtaining images/observations and processing them.

The paper also uses methods that use formal models to represent games such as Markov Decision Processes and Reinforcement learning such as Q-Learning.

Solution

Using a combination of Q-Learning, Experience Replay and CNNs as an approach to implementing deep learning using a traditional reinforcement.

  1. Use Experience Replay to mitigate the problem of highly correlated sample input from real-time environment.

  2. Replace the traditional learning method of the Q-value function (bellman Update) with a Neural network, which is then called a Q-Network or  QDN (Q-Deep-Network)

  3. Teaching the NN is done it by minimizing the predicated on the actual experienced reward of the agent (actually, not true, its the reward that is used to determine an actual Q-Value based on the experienced reward, and its the difference between the DQN's predicted Q-vlaue and this that is backpropagated into the network)

  4. A mapping based on a history of move-results (images) as input images/observations is used to predict the reward.

Methodology

Experimental, design-science based research, i.e, experimental development and performance evaluation of an artefact.

Research Methods

  • Real-time experimental game-play and simulation
  • Reinforcement learning (Machine learning)
  • Algorithm development
  • Mathematical Modeling

Primarily experimental methods that yield quantitative measurement of performance of experiment (rewards as the game learning progresses).

Seven popular ATARI games – Beam Rider, Breakout, Enduro, Pong, Q*bert, Seaquest, Space Invaders

An virtual emulator of various games is controlled programmatically using a algorithm that instructs the player to move based on the value of past moves (history of move-results) represented only as sequences of past move actions and their resulting screens.

The process has access to the history of results of moves (rewards and resulting images in the emulator) and these are randomly sampled via Experience Replay mechanism to feed a neural network to estimate the Q-value function of action values. The improvements are based on feeding prior moves that resulted in their rewards and seeing if the NN can predict the same reward as the agent experienced.

The algorithm that is developed can replace the traditional Bellman Update for a linear Q-value function with gradient decent to improve the QDN's estimation of the Q-Value based what the agent's actual experiences.

Experiments are setup to play (by controlling a player/agent) a variety of games using the Atari 2600 Arcade Emulator and the algorithm that drives it. ("...it learned from nothing but the video input, the reward and terminal signals, and the set of possible actions—just as a human player would")

Research techniques

  1. Neural Networks (Convolutional Neural Network)

  2. Gradient decent network updates instead of iterative Bellman Update
  3. Q-Learning reinforcement learning (to solve Markov Decision Process)

  4. Experience Replay (dealing with real-time samples)

  5. Frame-skipping

Data

The actual output screens (images) rendered by playing game moves by capturing frames before and after thr agent moves.

These are transformed into more simplified representations (greyscale, reduced in size). This serves to augment the data by simplifying it to make the CNN learn better from it:

Initially the input is 210x160 8-bit colour images that are transformed to greyscale and then reduced in size to 84x110 and further to a square 84x84 that is required for the convolution kernel setting used by the CNN. ("high dimensional visual input (210 × 160 RGB video at 60Hz"). 

The last 4 frames are captured before making a move resulting in an input to the NN as 84x84x4

The association is between a history of observations captured as moves and their resulting screens are used to infer Q-Value (based on the reward for the move that the agent made) and any difference is propagated back into the Q-network's weights via backprop and gradient decent.

Information

  1. Four frames prior to agent move are recorded/observed to allow for the neural network to 'see' a change in direction in the input images 
  2. Prior RL techniques such as Sarsa, Contingency and HNeat used hand crafted features from the games
  3. QDNs outperform prior RL algorithms on all seven games

Knowledge

  1. QDNs are superior at learning compared to classic RL algorithms such as Q-Learning and SARSA
  2. Experience Replay as a mechanism can mitigate the problems of correlated states in real-time environments. 

    1. Experience replay is also crucial in sampling/re-using past experiences multiple times to avoid forgetting and to maintain recollection of infrequent, rare, but otherwise high-value past experiences to re-train adapting Q-Network

  3. The Bellman update can be replaced with gradient decent learning on QDN by minimizing the the difference in experienced reward vs predicted reward

  4. You can learn the next move based on the history of past moves represented as images over time.

Correlation vs Causation

  1. The average accumulated rewards were used to track the effectives of the agents over the long term, i.e. over multiple episodes of learning

  2. As time increases, i.e. more episodes were used for training the rate of accumulated rewards increased suggesting that the Q-Value function was becoming better as the games were being played.

  3. As the learning was based on actual experiences of the agent, these measurements of reward were used to improve the q-value function (Q-Network)

Literature review

Referenced papers

Figure: Referenced paper chronology

 

Citations

  1. Retrospectively, this is an influential paper as it is the first approach to integrating reinforcement learning with deep neural networks.

  2. The paper appears to be highly cited on google scholar, Archiv. No representation in other notable journals such as IEEE, ACM etc.

  3. The papers are all by DeepMind researchers who are well respected in this field.

Reasoning method

Deduction. The authors reasoned theoretical they could replace the Linear Q value function with a neural network provided they could somehow adapt the RL learning process which is inherently real-time and provides highly correlated samples of state with DNNs which show potential but aren’t easily usable under those real-time conditions. They also reason that there has been improvements to hardware since the last time RL was used.

They conduct experiments and from the experimental results, the evidence suggests that the algorithm learns to play the games better that all previous approaches in 6 of the games with 3 beating even expert players suggesting the algorithm and using a CNN objectively improves in RL environments.

Subjectivity/Objectivity 

Construct Validity

Strictly speaking, while the research suggests that they use sensory data, its certainly not sensory data that is sensed from the agent's perspective, i.e. what the agent sees or hears etc. Its more an generalised observation of the environment as a result of an agent action. This might be misleading as sensory data is usually what the thing that is concerned with it senses, not an external observer. This is the player's observation, ie what the player/algorithm sees not the agent.

Internal Validity

No obvious flaws

Research Correctness

  1. There are about 24 referenced papers, most around the time of the publication which suggests it was referring to state-of-the-art papers at the time.

  2. The creating of a systematic algorithm means this work is repeatable and can be reproduced.

  3. A standard process for obtaining image data from playing games is used through the Emulator environment (Atari 2600 emulator) which has been used in prior research.

  4. The learning uses a model for a CNN that does not change parameters across the experiments. This is true of the algorithm also.

  5. Only what a human player would be expected to input (moves) is required and the learning approach only observes the results

Research technique

No obvious flaws

Research techniques vs research question

TBD

Conclusion vs methods

  • The approach is biased to simple games and does not appear to handle more complex game that require more advanced strategies
  • The paper itself is 12 years old so it is likely that while the approach is useful for understanding how the first integration of DNNs (CNNs specifically) in a online learning capacity was achieved - its unlikely to be optimal.
  •  

External Validity

  1. This approach uses data that is specific to Atari type games, that is 8-bit graphics etc. Its not clear that this can be generalized to more advanced higher-bit graphics and resolutions such as 24-bit. Unlike Atari games in this emulator, not all games have a simple concept of a score so this would be difficult to use those those types of games. They are instead measured by other means that might not be readably obtainable for an emulator adding to the fact that this approach is tied to relatively simple Atari games. It does however show the algorithm can learning well using game images/screens.

  2. Game Complexity. Its not clear if this approach will work for games that are more complex such as RTS games where not only are there multiple actions (into the hundreds) the rewards aren't directly evident based on immediate rewards (which Q-learning is based on) as they can happen only near the end of the game. Also in games the environment can change more deterministically based on smarter enemies that change their behaviour each time. They quote the aim to to show "Learn successful control policies from raw video data in complex RL environments"...

Data Validity

Data objectivity

  1. Screen image are sensory representations of the what happens in the game, so images are an appropriate way to assess the value of a move that resulted in a screen.
  2. The data is produced using a consistent approach - a controlled sampling of the last 4 game frames prior to each move/decision via controlling the emulator.

Data subjectivity (specificity/narrowness)

  1. The data is restricted to 8-bit images

  2. Only certain games are used in the research (Seven popular ATARI games – Beam Rider, Breakout, Enduro, Pong, Q*bert, Seaquest, Space Invaders)

Data vs Research Question
  • How can a DNN be used in traditional Reinforcement learning?
    1. Image data is used to map the results of player's actions (observations as screen images) to labels which in this case is the predicted q-value that represents the value of those actions. This makes sense.
    2. In the past image output from the Atari emulator has been used in other RL resarch notable TD-Gammon

Summary of general risks to validity

  1. Only works for Atari games with image data coming from Atari 2600 emulator
  2. Only single sensory source of information used - output from emulator.
  3. Complex games do not learn strategies well.
  4. The design of the CNN is simple and is likely to be out of date.

Credibility concerns

None

Relevance, Contribution, Originality and Novelty

  1. Superior learning over past algorithms (Q-Learning and SARSA)
  2. CNN can be used in RL(online learning) and learn the reward from a randomly sampled mini-batch of experiences (Experience Replay)

Implications & Contributions

  • First time DNNs integrated into RL using Q-Learning

Opinion

  • Useful introduction to integrating CNN with RL as a means to bridge the gap between a traditional offline model (CNN) and the online, model-free learning.
  • Unlikely performant and likely superced by recent approaches. 
  • Suffers from learning from games that are move advanced.

More Articles …

  1. Understanding Q-Networks
  2. Understanding Markov Decision Processes
  3. Understanding Q-Learning
  4. Architecture for a single simulation
  5. Contextualizing Artificial Intelligence and Psychology
  6. A Philosophical Introduction
  7. Avenues of research
  8. Research Terminology
  9. Models of social learning
  10. Reflective Thinking and Cognition
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

Page 1 of 181