Blog - Stuart Mathews

Details: Category: Blog; By Stuart Mathews; 25.Jul; 25 July 2025; Last Updated: 02 August 2025; Hits: 692

Since Thoughts on Bayesian Networks, I've been thinking about how they actually work and why they work. I'm going to walk through the process behind the theory I presented previously.

From a learning perspective, i.e., how they learn, my research suggests that they rely on statistics about the increasing number of observations over time. As they increase, this affects the average occurrence of any particular situation as they occur (or do not reoccur).

For example, if you're designing a spam filter which is to learn and then ultimately decide which mail is spam and which is legitimate, you might start by collecting samples over time and then counting which aspects or conditions about the mail lead to known spam mail. As more email comes in, the larger your data grows over time and the larger occurrences of those conditions that lead to spam being detected as spam.

Another example I gave earlier is in predicting the weather. I'll try and detail the process that is used using Bayesian Networks:

If we are going to make predictions about if it's going to rain or not, we need to collect information about what happens when it does actually rain or what happens when it does not rain, i.e we need to collect daily weather observations. For example:

	Observation #	Cloudy	Humid	Rain
A	1	Y (1)	H (1)	Y (1)
A	2	Y (1)	H (1)	Y (1)
B	3	Y	L (0)	N
C	4	N	H	Y
A	5	Y	H	N
D	6	N	L	N
A	7	Y	H	Y
C	8	N	H	N
A	9	Y	H	Y
D	10	N	L	N
B	11	Y	L	N
C	12	N	H	N

This shows 12 observations where some aspects of those observations include rain, cloudy and humid conditions. There are 4 unique combinations of Cloud and Humid conditions and Rain. The patterns are labelled as A, B, C, D.

These are boolean conditions, so if the condition occurred, e.g, it was humid, we use the value of 1, otherwise 0, etc.

We will then work out which combination of conditions or patterns within that data seem to correlate with rain occurring on those days. From these patterns, we will see how many times those conditions occurred over time to determine the average that those particular conditions correlated with rain.

For example, we might say that a particular pattern in the observations almost always also correlates with rain, so we might say then that of all observations that included rain, that particular pattern was present 90% of the time.

From this historical data (observations), we can now start answering probability questions about rain. That is, we can make an inference:

What is the historical probability that it rains, given that it is cloudy and humid (Pattern A)?

We can tell this by seeing that from the historical data (the 12 past observations), there are 4 situations where it is cloudy and humid and rains. But there is another occurrence of cloud and humid conditions where it did not rain. So we have 4 out of 5 occurrences of humid and cloudy, where it also rained, i.e there is a 4/5 chance or 80% probability of rain when it's cloudy and humid:

P(Rain=1|Cloudy=1, Humid=1) = 4/5 or 0.8 or 80% (Pattern A)

But this is only part of the story. Other situations/patterns/conditions also coincide with rain:

Pattern	Cloudy	Humid	Rained = Yes (1)	Rained = No (0)	Occurrence/frequency of condition patterns	P(Rained=1\|C,H)	P(Rained=0, C, H)
A	1	1	4	1	5	0.8 (4/5)	0.2 (1/5)
B	1	0	0	2	2	0.00 (0/2)	1.0 (2/2)
C	0	1	1	2	3	0.33 (2/3)	0.67 (2/3)
D	0	0	0	2	2	0.00 (0/2)	1.0 (2/2)

For example, there is a 33% probability of rain when it is humid, but not cloudy. The other combinations (situations) show there is a 0% probability of rain in those situations (again only based on historical data).

To clarify what you are seeing in the above table, the last two columns show the dividend of the number of times it rained (Rain=1) with a particular situation/pattern (of cloud and humid), and how many times this pattern occurred in all the historical data (weather it rained or not). So for this, we counted cases where conditions coincided with rain, and the very last column, the times when it did not, ie the negative case of when they did. So we can now say we have a full idea of which situations coincide with rain. This is called a Conditional Probability Table (CPT), and the last two columns are conditional probabilities. Conditional probabilities are the probability of rain, given the pattern of conditions that coincide with rain.

The idea is that we now know what the historical probability of rain is given various situations of humid and if its cloudy, and other combinations. This is all based on historical data, however. This doesn't say anything about what will happen today. But if we determine that we know that it is cloudy or humid today with some degree of probability, we can factor that degree of probability into what the historical data knows is the case when those probabilities are 100% (1) or 0% (0) because that's how we created our CPT from the historical data.

For example, if we know the probability of cloud and humid conditions today:

.P(Cloudy=1)	. P(Cloudy=0)	.P(Humid=1)	.P(Humid=0)
0.6	0.4	0.5	0.5

What this is telling us is that we know that there is a 60% chance that it IS cloudy today, 40% chance that it isn't, and also that there is a 50% chance it is humid. So with today's values, we can infer the probability that it rains today by marginalising over the conditions (H,C):

P(Rain=1) = E P(Rain=1|C,H) x P(C) x (H) into a table like this:

C	H	P(Rain=1\|C,H)	P(C)	P(H)	Contribution	Note
1	1	0.8	0.6	0.5	0.8 x 0.6 x 0.5 = 0.24	24 % chance of rain if cloud and humid today
1	0	0	0.6	0.5	0 x 0.6 x 0.5 = 0	0% chance of rain if it's cloud but not humid today
0	1	0.33	0.4	0.5	0.33 x 0.4 x 0.5 = 00.66	6% chance of rain if it's humid but not cloud today
0	0	0.4	0.5	0.5	0 x 0.4 x 0.5 = 0	0% chance of rain if it's not humid and not cloudy today
Sum total:					0.306

The sum of 0.306 says that of all the possible outcomes (scenario combinations), there's a 30.6% chance of rain today, given the possible probabilities for cloud and humid today!

P(Rain=1) = 30.6%

We can also make the above inference more accurate, if we know for certain something about the condition being true instead of using a probability such as 0.6 for cloudy:

If you knew that it was Cloudy today, i.e P(C=1) = 1, i.e you have known evidence (not a probability) then you can be more sure about the probability of rain:

P(R=1|C=1) = E P(R=1|C=1,H) x P(H):

0.8 [P(R=1|C=1, H=1)] x 0.5 [P(H=1)]
0 [P(R=1|C=1,H=0)] x 0.5 [P(H=0)]

Into a table:

0.8 [P(R=1\|C=1, H=1)] x 0.5 [P(H=1)]	0 [P(R=1\|C=1,H=0)] x 0.5 [P(H=0)]
Total (0.4 + 0) = 0.4
0.4	0

Meaning that if we know that it is cloudy, we can say that there is a larger, or 40% chance of rain today, given the historical data.

So in summary, we can consolidate this process into fixed steps:

Work out the conditional probability of rain using past observations of rain and its conditions (eg, humidity and cloudy)
To infer the probability of rain today, use the conditional probability of rain and use today's conditions to scale the conditional probability for rain for today's values, and sum up all possible conditional probabilities of rain to obtain today's probability of rain. This is also known as marginalising over today's conditions.

If you know what a condition is today, e.g, it's cloudy today, use its value in looking up the conditional probability in the conditional probability table (CPT) and marginalise over the remaining conditions.

Details: Category: Blog; By Stuart Mathews; 23.Jul; 16 June 2025; Last Updated: 30 August 2025; Hits: 1121

Introduction

As part of my academic research endeavours, I'm undertaking to train myself to analyse research papers with a more methodical and critical eye.

The particular paper reviewed in this article is "A Fast Learning Algorithm for Deep Belief Nets" by Hinton et al. and is part of a larger survey entitled "A survey of deep neural network architectures and their applications" by Liu, W. et al.

The approach I've used to structure my review process is outlined in Research Review Process.

Research question

How can the performance of neural networks with many hidden layers be improved?

Research aim

This is descriptive and explanatory research.

The research aims to describe how a new learning algorithm (greedy layer-by-layer algorithm) works to solve the problem of 'explaining away' that occurs in DNNs (specifically stacked RBMs with many hidden layers), and that when coupled with a fine-tuning algorithm (top-down), this vastly improves the performance of this type of deep neural network.

Type of research

This is primarily quantitative/empirical research.

This research describes a new pre-training algorithm and how it works and why, it then models it using mathematical concepts, followed by quantitatively testing the design and implementation of the algorithm using a neural network to measure the algorithm's effectiveness at improving the neural network's performance (learning/inference).

For example, through experimentation using the new learning process (algorithm) and a multi-layer neural network using the standard MNIST dataset, a quantitative result of 1.25% error discrepancy was recorded using this dataset. Similarly, other experiments, e.g., using SVM, were conducted to indicate a comparative error discrepancy using the same dataset.

Mode of enquiry

This is scientific research based on a systematic research procedure and empirical testing. It uses an approach that focuses on specifying and applying well-defined algorithms to a neural network, and working with a common, unvarying dataset, i.e MNIST. Furthermore, it follows a repeatable design where experimentation is used to produce observable outcomes that indicate the effectiveness of the algorithm while other variables remain constant, i.e, underlying dataset and neural network.

For example, the learning and fine-tuning algorithms (greedy layer-by-layer & up-down, respectively) are by definition repeatable, the MNIST data set is unvarying, and the neural network is a fixed configuration/design. Furthermore, the work also uses an SVM to compare results with that of the neural network. The learning process is also described as a repeatable process, e.g the application of the Gibbs sampling process that underlies the fine-tuning algorithms described in the paper. These are all hallmarks of a well-defined, repeatable scientific approach.

Methodology

The research is primarily empirical in nature, utilising experimental outcomes to inform observation. It uses a measurable design-science-based approach that uses a well-defined neural network configuration and the application of algorithms which help define and evaluate a repeatable learning process, i.e it designs, develops and evaluates the performance of the research model.

For example, the theory of how 'explaining away' reduces the performance in densely connected networks is explained, it is then modelled mathematically before the algorithms are described. The algorithms are then applied by experimentally testing their result on a neural network acting on the test data.

The main research objectives are:

Shows how using 'complementary priors' removes 'explaining away' using theoretical explanation
Derives a new unsupervised learning algorithm (greedy layer-by-layer) that uses complementary priors
Describes a hybrid neural network that uses associative memory and 3 hidden layers
Use the unsupervised learning algorithm to pre-train a neural network to test/prove the algorithm's effectiveness (fast and accurate)
Shows how to determine what the model has learnt by visualising the learnt weights to generate an image based.

Research Methods

Mathematical modeling
Algorithm design
Algorithm implementation (greedy-layer-by-layer)
Classification experiments using neural networks

The primary research method is using experiments to measure/evaluate the performance of the neural network when it is pre-trained using the greedy-layer-by-layer algorithm.

The problem of explaining away is modeled mathematically, a new algorithm is developed to prevent it, the research then uses experiments that use an existing character digit dataset (MNIST) as input to a pre-trained neural network (using the pre-training algorithm defined in this paper), and the results are then evaluated, specifically how well the data is mapped to the character symbols/labels. Other experimental procedures are also carried out to see how the pre-trained model performs in comparison to other models that do not use the algorithm (most notably, SVM).

Research techniques

Contrastive divergence algorithm
Gibbs sampling (Hidden Markov Monteo carlo method)
Greedy-layer-by-layer algorithm (complementary pairs)
DBN (Deep Belief Network)
Various learning algorithms (Backpropagation, SVMs, squared error and online updates, LeNet5 CNN, cross entropy, etc.)
Generation of an image using a learnt model

The main technique is experimentation using a neural network with 3 hidden layers and using the developed pre-training algorithm to test how well it removes the 'explaining away' in order to improve the network's inference performance. The MNIST data set is used as training data for the network.

The algorithm that is developed is applied to a DNN model (DBN), and it is then tested by experimenting on the model to see the performance that results. The results in an error rate of 1..25% in comparison to the closest rival, which is SVM at 1.4%.

Data

MNIST dataset: 10,000 character digits, grey-scale 32x32 images used to train the neural network (pre-trained with a new algorithm)

The input dataset for the model testing is the MNIST dataset of character digits, which is a repository of 2D images that are well-known and used by researchers for the classification of character digits. The output of the model is numerical data that indicates/predicts the classification of the input data belonging to specific classes of digit characters.

Information

The MNIST data is processed using the neural network, resulting in output from the neural network (model).

The results from the neural network show that using a pre-training algorithm that configures/trains each layer using complementary pairs improves the performance of DBNs, i.e reduces the error discrepancy in predicted vs actual outputs

Knowledge

The phenomenon of 'explaining away' that occurs in DNNs (of stacked RMBs) restricts their performance.

Using complementary pairs to configure/train each layer to establish initial weights removes 'explaining away' and results in a better-performing neural network.

Correlation vs Causation

Variables:

Learning approach (with or without greedy layer-by-layer algorithm)
MNIST dataset
Neural network configuration

After using various types of comparative learning algorithms in comparison to the research's approach (which uses the greedy-layer-to-layer pre-training algorithm), the same dataset is used (MNIST) throughout, therefore only the approach to learning changes. This means each model's discrepancy error is evaluated until the lowest value is found to see which model causes the discrepancy value to be the lowest. The neural network configuration is unchanged.

Literature review

Referenced papers

See reference chronology here

This paper was published in 2006.

Citations

IEEE reports 3888 citations while ACM reports 3280 citations.

IEEE: https://ieeexplore.ieee.org/document/6796673
ACM: https://dl.acm.org/doi/10.1162/neco.2006.18.7.1527

This suggests that this is a very popular piece of research.

Reasoning method (deduction)

The research works from a theoretical description of the 'explaining away' phenomenon, models it mathematically and from this basis, derives a learning process that incorporates the design of a new algorithm and applies it to a neural network. The neural network is used to test the pre-training algorithm and to see if it indeed improves the learning/mapping process as proposed. This is a deductive process:

See this paper's deduction process here

Subjectivity/Objectivity

Construct Validity

No obvious flaws

Internal Validity

Research Correctness

Objectivity

1. The same constant data is used as was used by others (MNIST)
2. The research uses an objective measure of performance (inference error) using the dataset shows that its error rate is better with this model than with previous models.

Subjectivity/Specificity

No obvious subjectivity

Research technique

Objectivity:

Algorithms used in this research are repeatable and inherently automatable. This means all parts of the process, i.e, data, model, and algorithm, are non-varying in nature and therefore can be replicated/verified by third parties.

The comparison of alternate models' performance on the NMIST dataset is suitable for evaluating how the pre-trained model's performance compares to those models that do not use it. The research techniques fit the requirements of this research.

Subjectivity/Specificity:

The research techniques are limited to only used on 2D image data (pixels)
Only a 3-layer neural network is used

Research techniques vs research question

Objectivity:

Varying the application of the learning algorithm while keeping other parameters constant, i.e the common data (MNIST) and the design of the neural network design means that it is simple to evaluate the effect of varying the single variable, i.e, the application of the algorithm, making only the algorithm the independent variable.

This supports the research question as the neural networks' output/performance (error function) of the classification task directly indicates if the neural network worked better than other results from other models that had not used the pre-training algorithm.

Subjectivity/Specificity:

The performance is only measured using data that reflects 2D character images. Larger images or more complex images are not assessed.
Only a specific configuration/design of the DBN (3-layer is used) to remove the effects of 'explaining away'

Conclusion vs methods

Objectivity:

Using experimental results based on empirical testing, observation, and comparison supports the conclusion that the research's specific approach is better than the other approaches that were tested.

Subjectivity/Specificity:

Experiments were based on only 2D image data (pixels) so the conclusions can only be representative of character-based image data

External Validity

Subjectivity/Specificity:

There is no evidence presented that this approach has or will generalise well to wider applications (beyond showing 2D character inference optimisation by eliminating 'explaining away')

Data Validity

Data subjectivity (specificity/narrowness)

Objectivity:

Image data for a neural network classification task is appropriate for evaluating the learning of a neural network for the classification of this data against known classification labels.

The research data is also a well-known dataset that is often used for testing classification performance in models, and so it is appropriate for this type of research.

Subjectivity/Specificity:

Only the MNIST dataset is used, so the data used to represent the solution presented in this research is limited.
This limits the research's outcomes and approaches to dealing with small geometric character recognition.

Data vs Research Question

Objectivity:

Varying the application of the learning algorithm while keeping other parameters constant, i.e the common data (MNIST) and the design of the neural network design means that it is simple to evaluate the effect of varying the single variable, i.e, the application of the algorithm, making only the algorithm the independent variable.

This supports the research question as the neural networks' output/performance (error function) of the classification task directly indicates if the neural network worked better than other results from other models that had not used the pre-training algorithm.

Subjectivity/Specificity:

Only 2D character digits, pixel information is used to show how the techniques in the research improve inference performance.

Summary of general risks to validity

The paper is very technical
1. It relies on an understanding of many different ideas and processes such draw deeply on existing knowledge.
2. Those inexperienced researchers may find it difficult to validate construct and internal validity without being well acquainted with the theory, algorithms and approaches discussed.

Credibility concerns

Objectivity:

There are gaps in the referenced papers; however, as this paper tests a new algorithm using experimentation and comparison with other approaches, the literature is less influential. In this respect, the literature is relatively objective.

Subjectivity:

The research is 19 years old (2025) and techniques here could be outdated or have been improved by subsequent research possibly making this research deprecated.

Relevance, Contribution, Originality and Novelty

Implications & Contributions

A key aspect is that Hinton et al have identified and understood exactly what the problem of explaining away is, and so were able to create an algorithm to circumvent it.

The improvements to the Performance/Learning of/neural networks as a result of this new algorithm improve the performance of all DBNs, and therefore have a great/wide applicability to all domains that use DBNs. The results of the paper are very generalizable.

Another particularly interesting aspect is that the paper shows a way to determine what the model learnt by generating an image based on the learnt weights to 'see' what and how it learned the dataset.

Opinion

This research takes an objectivist approach, which aims to uncover hidden truths (such as the benefit of using compliments pairs and pre-training), favouring empirical testing, a systematic approach and an investigation into cause and effect.

Details: Category: Blog; By Stuart Mathews; 23.Jul; 23 July 2025; Last Updated: 04 August 2025; Hits: 843

I've recently started thinking about how to simulate/model reinforcement learning and how it is implemented. I knew what reinforcement learning was because I knew it described the learning that took place when Ivan Pavlov conducted his famous experiments on conditioning behaviour/learning in dogs. This however, of course, is just the theory and is different to actually implementing it as an algorithm to model learning in a computer.

I read a paper by Mnih et al.on how reinforcement learning was implemented in an agent to learn the how to beat a set of Atari games better and faster than humans experts could. This was done by determining the best and most optimal set of actions to take based on how those actions benefited the agents progress in the game. This is similar to how the dogs in Pavlovs experiment learnt that ringing the bell was a favourable action as it resulted in food, however in the game not loosing health in the game was the reward, and therefore taking or learning about actions that resulted in having the agents health not go down were reinforced as favourable things to do, and that's the actions it learnt not to do.

That specific paper discussed using a concept of Q-Learning which is an implementation of reinforcement learning as an algorithm and combining it with the use of CNNs (Convolutional neural networks) to help it while its doing its reinforcement learning, and as a result, improve the reinforcement learning considerably. Incidentally that approach is called Deep Q-Networks or (DQN) and its the first time I've seen a deep learning technology (CNNs) used in combination with a reinforcement learning technology (Q-learning). The neural network (CNN) that is used with the Q-learning reinforcement algorithm is called a Q-Network.

A clear result of that approach reveals the underlying mechanisms of most reinforcement algorithms, i.e that through experience (training via trial and error), the value of actions taken need to be evaluated by the effect they cause, and if the effect they cause is favourable, then favour (or reinforce) that action in the future. This way, in the future, when the same favourable situations present themselves, the actions that lead to those situations are are automatically taken, and in effect, they are learnt to be taken in the future.

Interesting.

Projects

Login

Twitter

How Bayseian Networks learn

Reviewing A Fast Learning Algorithm for Deep Belief Nets

Introduction

Table of Contents

Research question

Research aim

Type of research

Mode of enquiry

Methodology

Research Methods

Research techniques

Data

Information

Knowledge

Correlation vs Causation

Literature review

Referenced papers

Citations

Reasoning method (deduction)

Subjectivity/Objectivity

Construct Validity

Internal Validity

Research Correctness

Research technique

Research techniques vs research question

Conclusion vs methods

External Validity

Data Validity

Data subjectivity (specificity/narrowness)

Data vs Research Question

Summary of general risks to validity

Credibility concerns

Relevance, Contribution, Originality and Novelty

Implications & Contributions

Opinion

Thoughts on Reinforcement learning

More Articles …