Home - Stuart Mathews

Details: Category: Blog; By Stuart Mathews; 02.Nov; 02 November 2025; Last Updated: 02 November 2025; Hits: 111

Since Revisiting the Derivative, and in contrast to Understanding Q-Learning, I've been learning about Policy Gradient Methods, which are, are closer to how Deep Neural Networks are trained. That is, they use iterative updates based on calculating the gradient of a loss function. This is in contrast to Q-Learning, which uses iterative value-based updates using the Bellman optimality equation.

For example, the traditional approach to training a neural network is to let it predict an output as accurately as possible, then test that output against the ground truth using a loss function to measure how far the two are apart. The key part is that the loss function is designed to use the same parameters as the neural network. This means the loss function and the neural network are dependent on the same common parameters. If you take the gradient of the loss function with respect to those parameters, they indicate how the parameters need to change to improve the loss function. We don't want to improve the loss function so we invert the changes required to the parameters to worsen the loss function, that is, to make the loss function report less of a loss. But crucially, we can take those parameters (that make less of a loss in the loss function) and take them back to the neural network, and theoretically the neural network should use those parameters to make the loss function act like we exepct, to worsen it. This way, we have taught the neural network how to worsen the loss function and improve the neural network's ability to worsen the loss function!

The above process is called gradient descent, that is, it inverts the gradient of the loss function (which would normally allow us to ascend the loss function) such that we descend the loss function.

This principle is carried over the reinforcement learning (RL), which does not need to have a neural network that predicts outputs for the loss function to test, but instead has a policy function that makes predictions, so they are similar in this respect. In the context of RL, the prediction is the action to take given the state. In the neural network, it was the prediction given the image (for example, in a CNN).

The goal usually in RL is to make predictions (select actions) that are good, and many good actions over a period of time accumulate theoretically to a very successful trajectory, so much so, it might be the best moves that yield the best end result. Numerically, we can represent all the good actions as accumulating rewards and the maximum reward is likely the summation of all the good moves (rewards) over time.

The policy, when given a current state, will tell you what move to make from that state. You normally don't have the luxury of knowing what the policy function is. You need to learn it, in the same way a DNN/CNN for example needs to learn. The DDN will learn how to minimize the loss function using gradient decent, while the policy gradient-based reinforcement learning algorithm will aim to improve the maximum expected accumulated reward or objective function.

The gradient-based idea works the same in RL. You take the gradient of the reward/objective function and propagate that to the policy function that created the initial prediction that was used to evaluate it against the reward/objective function. Interestingly enough, the Q-Value is very important as it is used to compare the predicted value/action the policy generated against the best possible action (which is what the Q-Value estimates). So policy-based gradient methods are dependent on Q-values:

\[ \nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}log\pi_{\theta}(a_t|s_t)]Q^{\pi_{\theta}}(s_t|a_t)] \]

Which means you can calculate the gradient of the objective function \(J\) (which aims to maximise the total discounted reward), with respects to the action \(a_t\) predicted/sampled by the policy \( \pi_{\theta}\) by taking in turn it's gradient, and multiplying it by the Q-Value that represents the best possible action). In these equations, \(\theta\) is used to designate the parameters that are used.

Take it for granted that the above equation does, in fact, provide the gradient of the objective function when you have the log of the gradient of the policy function that generated the action and the Q-Value, which is the best estimated action (those mathematicians are clever, aren't they?)

The Q-Value function is, however, often unknown; that is, we don't know what the best possible action is, but we/policy took an action regardless. We obviously need to know the Q-Value function, as indicated above, to determine how to generate the gradient that can improve the reward/objective function. So we need to estimate it. Different RL algorithms do that differently:

Use total accumulated discounted reward/return (REINFORCE algorithm)
Use the learnt critic \(Q_w(s,a)\) or the vlaue function \( V_w(s) \) (Actor-critic algorithm)
Use the advantage estimate \( \hat{A}_t \) (A2c, A3C algorithms)
Use the advantage estimate \( \hat{A}_t \) (PPO/TRPO)

Either way, you're updating the policy function like you would do the neural network by using the gradient of the function that evaluates the prediction against the ground truth.

FIN

NB: A little off topic, but an important consideration, is that learning is required for adaptation, meaning if you're interested in studying the adaptation of something, making it learn is a precursor to adaptation.

Details: Category: Blog; By Stuart Mathews; 25.Sep; 25 September 2025; Last Updated: 01 October 2025; Hits: 195

Q-Learning is used as a solution to solve Markov Decision Processes (MPDs) (see Markov Decision Processes), i.e., its goal is to determine (learn) the best policy, i.e., moves, that an agent can take to reap the maximum rewards, which in most cases results in reaching the goal in the most optimal way possible.

In this way, optimization is a key aspect and gives rise to the equation that is used to achieve this in Q-Learning, the Bellman Optimality equation.

In Q-Learning, the Bellman Optimality equation is used to iteratively (that is with every move) re-evaluate and re-compute the value of each move that the agent takes based on the reward seen by taking that move. The value of a move is considered its Q-Value and q-values of all moves is represented by the Q-Function. Initially these values are random and not true indication of the value of moves. See Rationalizing Q-Learning to see how the values of moves are determined in the "Visualizing Q-values section".

The Bellman Optimality equation teaches the Q-Function to be more accurate in how it values the moves that the agent takes based on how good that move is (Q-Value). The agent interprets the value of the move by the rewards it sees as a result of taking the move. The Bellman optimality equation updates the value of the move and so helps to refine the Q-Function over time (see Rationalizing Q-Learning for a description this equation):

\[ Q(s,a) = Q(s,a) + \alpha( r + \gamma\max_{a'}{Q(s',a')} - Q(s,a)) \\ \tag{Bellman Optimality Equation} \\ \]

As the Q-value function is updated iteratively, past iteration's results are built-upon by leaving traces of their updates in the Q-value function so that they can be used in the following iterations. This is how past results are readily available in the Q-function, i.e the Q-function mutates over time.

In this way Q-Learning is an algorithmic model of Reinforcement Learning (RL) which is a discipline of Machine Learning, which in turn is a branch of Artificial Intelligence (AI).

In Q-Networks (first described here and is reviewed here), the Q-Value function exists, i.e., it holds the value of any move and is continually evolving as the agent iteratively update it, however instead of using the Bellman Update to iteratively update the Q-Function from the agent's sequential moves, a Convolutional Neural Network (CNN) is used to estimate the Q-Value function, and therefore the model of move values.

However it is not possible to simply drop in a CNN because MDPs are simulations of real-time decision making that occurs step-by-step as the agent moves and there is no explicit notion of training images (as required for CNNs) in MDPs an RL (only scalar rewards and an accumulative progression of such rewards)

The problem with an agent making sequential moves for a CNN, is that if you were to represent each move as an image, the sampled sequential images (that represent each move) are going to look very similar to one another, i.e., they will be slightly different from the last image - meaning they are highly correlated which is known to lead to unstable network updates (poor learning). This is exactly why during training, its best to provide variation in the training data so that these variations can be used to better generalize using the differences seen in training data. This is also why when training a CNN, one typically augments the training input to more readily differentiate the data in each input sample from other samples (blurring, rotation etc). So sampling data in real-time and then passing that to the CNN is also a problem.

Therefore Q-Networks deviate from the Bellman Update not only in as much as its replaces it, but also that it does not refine/learn the Q-Function based on what the agent has just recently done (the moves it just took) due to the correlated nature of sequential images, but rather learns instead from random results that occurred previously in time, i.e., it keeps a history of its past moves results (and associated screen images) and learns from those while its saving what it currently experiencing to its history of moves in real-time.

This means the Q-Network is not learning about every experience as-it-experiences-it, which is in distinct contrast to Q-Learning. This learning from the move history is a random sampling strategy is called Experience Replay.

In real-time the agent adds its current experience to the Replay Buffer as a 4-tuple as the current state (\(\phi\)), the move taken (\(a\)), the rewards received (\(r\)) and the resulting state (\(\phi_{n+1}\)).

During training/learning while adding its current experiences to the Replay Buffer, it also randomly pulls samples from this buffer to learn from. The key is the random samples are not sequential/consecutive and so are less likely to be correlated. For example, here is a specific sample taken theoretically from the Replay buffer for learning purposes:

\[ (\phi_j, a_j;r_j;\phi_{j+1}) \\ \tag{a random experience result sample from Replay Buffer} \\\]

From this sample, the algorithm passes in the sampled "before" state \(\phi_j\) (which is the move history) as the input to the CNN (acting as the Q-value function) in order to compute/predict the value of the move i.e., its Q-value: \(Q(\phi_j\, a_j\)).

If the CNN had leant well, the predicted value would be similar to the real value of the move, i.e., what the Bellman equation would calculate it as: \(y_i= r_j + \gamma Max_{a'}Q(\phi_{t+1,a';\theta}) \)

That is, it considers the immediate reward of the move (\(r_j\)) and the best possible action from that resulting state \(\phi_{t+1}\) to calculate the move's corresponding Q-value and sees if the CNN can produce a value that is comparable. The squared differenced between what the CNN produced and what it should be (\(y_i\)) is back propagated as weight updates in the network accordingly to refine it:

\[ (y_i - Q(\phi_j, a_j, \theta))^2 \\ \tag{Squared difference loss function} \\ \]

In this way, the Q-Network (CNN) is taught in real-time by updating it using gradient decent and backpropagation as opposed it iterative updates to estimate the Q-Value function as would be the case with the Bellman Optimality equation.

Here an drawing of the architecture to better rationalize the above description:

Figure: Architecture of playing Atari with a Q-Network

In a nutshell, the Q-Network learns the value of possible moves from possible initial states. The agent then consults an evolving Q-function for the best moves. The policy/strategy in MDPs is always to maximize the discounted accumulated reward, and this is achieved if the agent takes only moves that have the highest Q-value and taken (by consulting the Q-function).

When the Q-Function learns well, it ultimately converges to a version of itself called the optimal Q-Value function that defines the highest q-values for states (through past learning) and if those states are visted by the agent, the agent will consult the Q-function and implicitly take the highest q-value move and therefore will reach a sequence of moves that gets it to its reward (of maximum discounted accumulative reward).

Q-Networks can be briefly summarized as:

Trying to solve the goal of a MDP (see Markov Decision Processes)
1. Maximize the discounted accumulated reward
I used in online learning (dynamic, experienced dataset)
The agent consults an evolving Q-Function for the best moves
1. The value of a move was determined by calculating its Q-Value (see Visualizing Q-Values in Understanding Q-Learning)
Q-Function learns the best moves by determining their Q-Values while the agent moves (Explores)
Q-Function is replaced by a Q-Network which conceptually is the same thing but is a CNN instead of a linear function
Experience Replay is used to overcome the challenges that training a CNN faces in online learning (real-time/dynamic environments)

From a philosophical stand-point:

There is an implicit goal and reward: make moves that have the highest Q-Value in order to progress to maximizing discounted accumulated reward.
Decision-making that underlies the policy/strategy in Q-learning is simple, no other factors, conditions, situations are considered - only maximizing discounted accumulated reward.
In Q-Learning the agent 'explores' by initially randonly trying moves and then following a route of the best moves according to the Q-Value function
Sensory observation is based observing/seeing the last 4 frames/screens of the emulator.
The algorithm that represents the agent is not flexible, it has only one way to value
This is an model-free, online learning appraoch
'Knowledge' is represented as the evolving Q-Function. This represents prior iterative experiences that contribute to the current state of knowledge (Q-Values). In many ways it appears that anything historical which influcnes the present is likely to be knowledge.
Failing to expectations are implicitly represented as a prompt for learnings, i.e., minimizing the loss function between expected and actual outcomes.

From a personal research stand-point:

My first introduction to a mechanism of deep learning with reinforcement learning
Ties together importatnt dependant concepts such as Q-Learning, Q-Values, Online learning, model-free learning etc
First view of a formal depiction of modeling game playing as an online learning approch using Q-Learning and Markov Decision Processes.

A few notable limitations of QDNs:

QDNs require "extraordinary high number of training examples", with training taking from a few hours to a few days. (2002, A Systematic Study of Deep Q-Networks and Its variations)

Details: Category: Blog; By Stuart Mathews; 08.Sep; 08 September 2025; Last Updated: 29 September 2025; Hits: 270

Since A Model of Belief and the Capacity to Know, I'm increasingly convinced that to understand how to artifically model human-like behaviour requires a methodical and fundamental understanding of human behaviour (in general), and, more importantly, what causes it. This, of course, is what Psychology pursues.

This might seem implicitly reasonable, but here I want to make it robustly evident.

To begin with, humans make decisions with remarkable flair; they are pretty good at considering the complexity of any particular situation and drawing from a repertoire of reasoning and experience to decide what to do next.

For example, a human can decide what to do even though they do not have a clear picture of all the facts; for example, they make educated guesses, draw from experience, extrapolate and postulate about past tendencies or observations. Can robots do this? This is what what AI researchers inevitably want to know and to provide models to help them (robots) do so.

If robots or artificial entities are capable of knowing how to do this (reason and therefore behave like humans do), then that means fundamentally we have been able to define how humans manufacture behaviours, which, as a consequence, is exactly what Psychology aims to do. There is an explicit connection, therefore, between studying how and why humans behave the way they do (Psychology) and simulating that know-how in artificial entities (Artificial Intelligence) for them to behave that way.

Technically, artificial means such as Bayesian networks, Neural networks, and Reinforcement learning can be used to simulate the kind of reasoning that predicates human behaviour. This is why these are part of the realm of artificial intelligence, and that is why there exists an inevitable connection between AI and Psychology.

Psychology aims to understand why human behaviours occur, while artificial intelligence aims to make those behaviours occur. Artificial intelligence then has a lot of work to do.

Humans also make other complex and interesting decisions, and AI researchers want to emulate those, too. For example, humans appear to think, evaluate, compare, predict and otherwise reason about what they will do next. It is primarily because of these aspects that we likely consider human behaviour to be intelligent. Therefore, this is why endeavours exist in AI to specifically simulate human reasoning or decision-making, for example, the Markov Decision Process.

For example, if humans are fearful, nervous or in a hurry, their individual behaviours would be different. Each person reasons about their individual situation and decides what is a reasonable response or behaviour to perform. Similarly, if a human is sad, tired or otherwise emotional, this might influence how they behave, and therefore understanding or defining that behaviour based on those considerations is what Psychology is concerned with.

Another example. A time-constrained person might put off lower-priority considerations or behaviours to make their goals more timeous. Indeed, a fearful person who is confronted in an alleyway by an armed would-be thief is more likely to hand over his/her wallet than risk injury, particularly if they value their life or need to care for small children. Similarly, a more confident person will do things differently from one who is less sure of himself/herself in any particular situation that requires a response. In this way, Psychology has a lot to offer Artificial Intelligence.

In AI, Bayesian networks can be used to simulate human-like decision-making to a certain degree, as they model an approach to dealing with uncertainty that humans seem to routinely deal with when they make decisions. Decision making is important because it is a precursor to human behaviour, i.e before any behaviour is performed, humans decide (make decisions) what they are going to do, i.e how they are going to respond.

AI, therefore, is interested in simulating that human behaviour creation process systemically, drawing on a (hopefully) very definable and robust set of steps that are repeatable and therefore robustly produce such human-like behaviours, despite the complexities that are inherent in real-life situations, and Psychology can help with that.

Details: Category: Blog; By Stuart Mathews; 20.Jul; 20 July 2025; Last Updated: 02 August 2025; Hits: 347

Since Thoughts on Reinforcement learning and after reading that paper on DQN and being a bit more sure about how reinforcement learning is implemented algorithmically (Bellman update), I started wondering about other unrelated things, like what a Bayesian networks is.

I've seen references to Bayesian networks in literature I've read without having an intuitive understanding of what it is and and how they work and, more importantly, what applicability they might have to me in general - because why not know? I've also felt this way about probability and Markov Chains, delving into aspects about probability distributions, Hidden Markov chains (HMM), Markov decision processes (MDP) and this ultimately lead me to Bayesian networks probably (no pun intended!) because it also has to do with probability. Also, I had recently conducted a research task for Brunel where I needed to review papers on types and applications of Deep Neural Networks (DNN) which are very much grounded in probability which is probably where this whole foray in learning about probability probably started from. However, I digress...

Why I've been interested Bayesian networks is because they are said to be usable to make intuitive decisions in machines/computers.

Specifically, they allow for decisions to be made in a way similar to how humans might make decisions by indirectly inferring certain situations despite not directly witnessing the situation, i.e they use other indications or conditions that the situation depends on (in certain degrees) as a means to suggest that the situation is occurring. They do this systematically, where humans do it more intuitively or perhaps even superstitiously.

This is interesting if you'd like to simulate decision making in a more human way within an artificial entity such as a game character or a robot, for example. The key is that it can be achieved through a systematic, well-definable process (which is what machines like, and what can be implemented as an algorithm) and it produces human-like behaviour (as a result of seemingly plausible human-like decision making process) which is what we'd like to achieve in a simulated artificial intelligent entity.

While I've suggested that bayesian networks can be used to help make decisions (I'm not going to explaining exactly how yet), they can also be used to learn and indicate/detect the probability that a past situation is currently occurring despite only knowing some aspects about the situation right now.

It learns by gaining more experience about the make up of historical situational data, i.e what the conditions were when situations occurred, and uses the frequency of certain situational aspects as a means to predict the situation when only some of the situational aspects are known at this very moment. This means that at this very moment, you can predict if currently the situations is occurring with only fragments of knowledge about the situation.

The more experience you gain of the conditions of the situations, the more accurate the prediction will be when only presented with some of the conditions. It might be challenging to realise the impact of this idea.

For example, these ideas are used by spam detection algorithms. They collect aspects/conditions about emails and ask you to add another condition which indicates if the situation is a spam email or not. As more instances of when those aspects/conditions are marked as spam accumulate in the historical data, this will increase the probability that those aspects/conditions lead to the probability (detection) as spam, specifically when you don't know its spam (but you know the other conditions), and you have historical data where some of these aspects have contributed to the situation of spam before (in the historical data).

Additionally, if you know more spam-related conditions/aspects about the spam email, the probability of the email being detected as spam increases, i.e as the more you know about the spam conditions, the more likely it will be detected as spam.

This is extremely useful/interesting and incidentally this is also how weather is predicted.

For example, they look at historical data and from it they work out the conditions that contribute to the probability of rain, they then take what they know about today's conditions and this determines how the conditions likely contribute to the probability of rain today. Again, if more knowledge about the conditions that cause rain is known, the more probable the prediction of rain will be.

There is more to be said about how Bayesian networks work, specifically how they are implemented algorithmically and mathematically but this will be reserved for a future article.

Details: Category: Code; By Stuart Mathews; 27.Feb; 27 February 2022; Last Updated: 04 January 2023; Hits: 5295

Recently, since Fading importance and the utility of lists, I've been thinking about the top-level approach/process that I use for doing development work, generally.

The main reason was that I was wondering how different other disciplines of development might approach doing work and how much commonality there might be in what I generally do. So I thought it would be interesting to outline my conceptual flow when approaching new work.

A typical strategy that I take to delivering software within the last couple of years, which has focused on being more agile, i.e, exposing and sharing issues more readily and making the strategy design process more collaborative, looks something like this:

My General Process for doing development work
*1. Responsibility to deliver* is the idea that you are given a task to do and you need to take ownership of getting it done.	This is less a stage and more a statement of fact really. I wanted to put this down because it's an important realization that as a developer, you need to take much of the initiative to develop a solution, and much of the solution comes from the developer's experience and expertise and knowledge. This is the basis of why a developer trains, is so that they can be given the task to deliver software. It's a personal responsibility first and foremost and this is why you were hired. Of course, as a developer, you will integrate with your team and share, explore and discuss solutions but ultimately you are there to carry out much of the construction work. It is a psychological element where the responsibility to implement a solution is given to the developer. This usually means that much of the burden of its delivery and issues are thought to be offloaded to the engineer. This inevitably starts the process in the developer's mind of what needs to be done and how they should think about doing it. It also suggests that the developer will be the main implementor of whatever the solution is determined to be and that the success of the solution lies with the developer. This is a fair amount of responsibility and inevitably much of it is assumed.
*2. Interpreting requirements* is about understanding what is being asked to be done.	Generally, this is a process of thinking and rationalizing the requirements and is likely to involve independent research, might extend into social collaboration and discussion if research proves to be less successful etc. I tend to draw a picture of what is being asked, as this allows me to put down all the elements of the requirements somewhere on the diagram. This helps me ensure that I'm making provision for all the elements in the requirements. My diagram becomes whatever the requirements prompt me to draw. It could be a mind map with arrows, it could be boxes with logical connections etc.
*3. Designing ideas* is about thinking about and exploring how the requirements can be realised creatively.	This would involve modelling ideas that would result in the delivery of the solution by meeting the requirements. This might expand into considering aspects of the design such as the feasibility, architecture, components, quality etc and of requirements and implementation details that are not defined or realised yet. This process naturally derives from the initial idea that formed in my head as I was interpreting the requirements. This might manifest as a series on conceptual ideas in my mind which I may or may not diagram. This is very much an exploratory thinking process. Typically I extend my initial diagrams with more arrows, boxes or created new diagrams altogether. Here are some examples: example 1 example 2
*4. Solution design* is about selecting a design from the ideas already explored.	This is about selecting from criteria that makes this design idea better than the rest. It can consider aspects of feasibility, quality, resources. It is likely to need considering the architecture, concerns and the formulation of the overall design theoretically. It usually considers what is good or time-efficient or other aspects. This will also usually entail considering what technology and technicalities need to be used and be considered. Example
*5. Technical solution implementation* is where the programming happens, and is about using specific technology and implementing it to translate the design into the technical solution that works to deliver the requirements.	This involves the technicalities of the technologies being used to develop the solution. This stage is likely to be the most transformative and subject to the most change and where most of the time will be spent. Coping with change: Here we are also dealing and coping with unexpected changes and problems during the technical implementation needs to occur. Responding and dealing with change: This stage is also where issues and changes must be responded to and reported back to the team and the strategy is readjusted in light of the changes/issues etc. Verifying and testing: This is where what you've implemented is verified. This is mostly through unit testing, integration testing and manual testing.

See Strategy for Designing Software Solutions for an illustration of the process

Sharing the responsibility with your group

The stages of this process initially are from the developer's (my) perspective who has been tasked with delivering the software solution. This means that at each step the developer is cognitively involved in rationalising the information and tasks at each step to develop a mental blueprint of the upcoming work that he/she needs to deliver. This means that much of the process is internalised by the individual developer, and any artefacts that are externalised/produced are from the developer such as diagrams, documentation etc is initially produced to aid their own understanding and solidify their development methodology.

Personal artefacts are the outcomes of the process carried out by the developer and can include designs/diagrams, documentation and code

While the process is initially a personal approach initiated and implemented by the developer, many outcomes can be exposed to the group, particularly unknowns and concerns that the developer might have picked up while defining their process.

A problem with personal exposure to the process is that the amount of responsibility and burden that is implicitly on the developer grows as the process is followed and more information is uncovered. This is especially true of discovered unknowns that need to be accounted for or potential concerns or shortcomings that the developer realises not only within the solutions but within their own personal skillset for example.

Holding onto these concerns, unknowns and problems can be psychologically taxing as they weigh down on the developer during subsequent stages of the process such as during Technical solution implementation for example. This is where group exposure to these issues can serve as a relief to these burdens, as it can distribute and share the responsibility of dealing with them, including sourcing potential solutions. This also allows for expectations to be readjusted as many issues are likely to impact the Technical solution implementation stage which is of principal concern for the individual developer as well as the group. This is likely to decrease the pressure that the developer faces when having to deliver the technical solution, as it involves more people (team) and distributes some of the burden such that technical solution implementation can proceed with more clarity with the impact that these issues will have on the technical implementation.

In this way, there needs to be a shared notion of the process where the personal process extends into the group and vice versa, however much of the process is initially heavily centred and focused around the engineer's personal ability to define it. It then needs to be fed back into the group to share and initiate feedback that serves both the developer and the group depending on the developer to deliver the software solution.

Group artefacts are resources that allow for exposure and feedback back to the developer's process and can meetings, design/documentation review, project/goal/strategy/planning tracking and code review.

Discussion

Generally, I find it useful to produce an artefact that represents my personal outcomes or thinking, such that they can be shared with the team, both for feedback but also for relieving the pressure from dealing with unknowns/uncertainties/issues etc. This includes taking a strategy that keeps track of my progress as outlined in Fading importance and the utility of lists which can help to quantify your efforts throughout the above process.

I would say a large portion of the process is similar and has parity with the typical software development process where requirement gathering proceeds analysis and which leads to design and then ultimately implementation. However, my process presents this slightly differently as more of a personal development process that is followed by an individual contributor (developer) of the overall software development lifecycle of a product or feature.

Also, in the process, there is an emphasis on the initiative that the developer must have when encountering issues, as the developer is often the first person to become aware of them, and failure to expediently expose issues can result in unnecessary pressure in dealing with them. So in this way, this process factors in these personal concerns.

The process tracks individual concerns and processes such as research and the need for understanding and idea modelling. It implies that requirements are already gathered and just need to be correctly understood, or if they are poorly gathered, then that it's the developer's responsibility to determine what the requirements are and then understand them in order to develop a solution to deliver it.

I think much of the business analyst role especially for more agile, smaller teams are now largely deprecated and much of this responsibility is falling on the individual experienced developers.

There is also generally the feeling in my mind that more responsibility is required of developers to deliver end-to-end solutions, and I suspect this is because the complexity of software is difficult to adequately distribute across multiple actors such as an analyst, project manager and other developers.

So more of these tasks are taken on by the individual developer and this requires more skills and better communication skills (which arguably the other roles mentioned previously would be more traditionally good at). However, it's arguably easier to distribute the development tasks to other developers, provided the tasks have been defined - which again is usually the work of another developer if indeed the tasks get distributed at all. In some cases, the developer is responsible for the entire end-to-end software solution due to the difficulty in communicating the complexity of the implementation to others in an adequate time, and so the work takes as long as it takes the developer who understands it to implement.

I think there is a concern in software projects, that due to the complexity of software projects, both theoretically but more problematically at a technical level, it is becoming difficult to speed up development by bringing in new developers as the time to transfer understanding of the complexity is now longer than the time required to deliver the entire solution. So in some cases, solution delivery is not being given strict deadlines but they are progressively being delivered piecemeal as they are being incrementally implemented by the developers who understand the solution/problem/complexity.

In the process outlined, understanding requirements, designing ideas and thinking about how to model and implement them, theoretically and technically is left to the individual developer to satisfy, requiring and drawing knowledge from past experience and training. This suggests that the quality of the developer is being more crucially being recognised as important for their ability to understand, contextualize, rationalize simplify and execute the delivery of solutions in light of the increasing complexities of software development generally.

An interesting discussion that is not touched upon here is how this process differs from the typical game development process, and how either it or this process can lean from each other. For example, is there a similarity to requiring research as part of a task or communicating unknowns to the team, and how is this carried out? I imagine there would be some similarity to non-game development but to what extent or how does it differ?

Software Engineering

Details: Category: Code; By Stuart Mathews; 10.Dec; 10 December 2021; Last Updated: 27 December 2021; Hits: 4404

Since, ISO27001, Machine-Learning and Game dev, I recently wanted to store some sensitive data (private key) in a string in C# and keep it in memory during the duration of an operation.

Due to the design of some 3rd party APIs that required a string representation of the private key, I decided on encrypting the string and only unencrypting it when I want to use it. In all other instances, the encrypted string would be copied or passed around, while the unencrypted string would not be.

I did some reading about System.Security.Cryptography's ProtectedMemory function, which allows you to encrypt a block of bytes of which needs to be a multiple of 16 bytes. The interesting thing about doing this is being able to encode the length of your sensitive string within the actual encrypted 16n byte block so that when you unencrypt that block, you can retrieve from it the length of the original string, and recover the original string. This is kind of what you do when you encode the length of a packet that you send down the network.

The implementation of an object that can encrypt a string, store it internally and decrypt upon request, is a ProtectedString:

using System;
using System.Security.Cryptography;
using System.Text;

namespace X.Y.Z
{
    /// <summary>
    /// Store string encrypted at rest.
    /// </summary>
    /// <remarks>You can copy this object freely</remarks>
    /// <remarks>Portable alternative to SecureString, using DPAPI</remarks>
    /// <remarks>Note SecureString is not recommended for new development</remarks>
    /// <remarks>https://docs.microsoft.com/en-us/dotnet/api/system.security.securestring</remarks>
    public class ProtectedString : IProtectedString
    {
       /// <summary>
       /// Secret area that is encrypted/decrypted
       /// </summary>
       private byte[] _secretData;

       private readonly object _lock = new();

       private bool IsProtected { get; set; }

       /// <summary>
       /// DPAPI access control for securing data
       /// </summary>
       private readonly MemoryProtectionScope _scope;

       /// <summary>
       /// Creates a ProtectedString
       /// </summary>
       /// <param name="sensitiveString">Sensitive string</param>
       /// <param name="scope">Scope of the protection</param>
       public ProtectedString(string sensitiveString = null,
                              MemoryProtectionScope scope = MemoryProtectionScope.SameProcess)
        {
            _scope = scope;

            // Store secret if provided and valid
            if(InputValid(sensitiveString))
                Set(sensitiveString);
        }

       private static bool InputValid(string sensitiveString)
       {
           return sensitiveString != null;
       }

       /// <inheritdoc />
        public void Set(string sensitiveString)
        {
            try
            {
                lock (_lock)
                {
                    if(!InputValid(sensitiveString))
                        throw new InvalidInputException();

                    // The secretData length should be a multiple of 16 bytes
                    
                    var secretDataLength = RoundUp(
                        sizeof(int) + // We will store the length of the
                                      // sensitiveString as the first sizeof(int) bytes in secretData
                        sensitiveString.Length, 16);

                    // Allocate array, all values set to \0 by .Net
                    _secretData = new byte[secretDataLength];

                    // Copy the length of the sensitiveString into the secretData
                    // first, starting at the first byte
                    BitConverter.GetBytes(sensitiveString.Length).CopyTo(_secretData, 0);

                    // Copy the sensitiveString itself after the bytes the above bytes
                    Encoding.ASCII.GetBytes(sensitiveString).CopyTo(_secretData, sizeof(int));

                    // Encrypt our encoded secretData using DPAPI
                    ProtectedMemory.Protect(_secretData, _scope);

                    IsProtected = true;
                }
            }
            catch (Exception e)
            {
                IsProtected = false;

                if (e is ProtectedStringException)
                    throw;

                throw new Exception("Unexpected error while storing data from protected memory");
            }
        }

        /// <inheritdoc />
        public string Get()
        {
            try
            {
                lock (_lock)
                {
                    if (!IsProtected)
                        throw new NotProtectedException();

                    // Decrypt secretData
                    ProtectedMemory.Unprotect(_secretData, _scope);

                    // Determine how long our sensitiveString was by reading the integer at byte 0
                    var secretLength = BitConverter.ToInt32(_secretData, 0);

                    // Read that many bytes to recover the original sensitiveString
                    var sensitiveString = Encoding.ASCII.GetString(_secretData, sizeof(int), secretLength);

                    // Re-protect secretData after retrieval
                    Set(sensitiveString);

                    // Return a reference to unprotected string.
                    return sensitiveString;
                }
            }
            catch (Exception e)
            {
                if (e is ProtectedStringException)
                    throw;

                throw new Exception("Unexpected error while retrieving data from protected memory");
            }
        }

        private static int RoundUp(int numToRound, int multiple)
        {
            if (multiple == 0)
                return numToRound;

            int remainder = numToRound % multiple;
            if (remainder == 0)
                return numToRound;

            return numToRound + multiple - remainder;
        }
    }
}

The question is if this is really useful at all from a security standpoint?

As soon as you unencrypt the contents, you get an unencrypted string back, and that string lives in memory and in theory, can be looked at by memory scanning. Also when that memory is freed (provided you don't have a reference to it), the garbage collector will free it but won't zero it out (securely clear it), so it'll be somewhere in memory, ...unencrypted.

Ultimately I never used this because of the reasons mentioned above, but it's still interesting...

Now, despite this, the above is still useful in some ways, provided you:

a) only copy or store the protected string or pass it between functions.
b) don't store the unencrypted string anywhere.

The other advantage is that the window of exposure of the unencrypted string is small (but it'll still get garbage collected), as you only unencrypt the ProtectedString when you want to use it, otherwise the secret is encrypted at rest.

Still, it doesn't help with the original problem of having unencrypted string copies lingering in system memory somewhere....

Projects

Login

Twitter

Policy Gradient Methods

Understanding Q-Networks

Contextualizing Artificial Intelligence and Psychology

Thoughts on Bayesian networks

A software engineering process for delivering software

Encrypting strings at rest