I've recently started thinking about how to simulate/model reinforcement learning and how it is implemented.  I knew what reinforcement learning was because I knew it described the learning that took place when Ivan Pavlov conducted his famous experiments on conditioning behaviour/learning in dogs. This however, of course, is just the theory and is different to actually implementing it as an algorithm to model learning in a computer.

I read a paper by Mnih et al.on how reinforcement learning was implemented in an agent to learn the how to beat a set of Atari games better and faster than humans experts could. This was done by determining the best and most optimal set of actions to take based on how those actions benefited the agents progress in the game. This is similar to how the dogs in Pavlovs experiment learnt that ringing the bell was a favourable action as it resulted in food, however in the game not loosing health in the game was the reward, and therefore taking or learning about actions that resulted in having the agents health not go down were reinforced as favourable things to do, and that's the actions it learnt not to do.

That specific paper discussed using a concept of Q-Learning which is an implementation of reinforcement learning as an algorithm and combining it with the use of CNNs (Convolutional neural networks) to help it while its doing its reinforcement learning, and as a result, improve the reinforcement learning considerably. Incidentally that approach is called Deep Q-Networks or (DQN) and its the first time I've seen a deep learning technology (CNNs) used in combination with a reinforcement learning technology (Q-learning). The neural network (CNN) that is used with the Q-learning reinforcement algorithm is called a Q-Network.

A clear result of that approach reveals the underlying mechanisms of most reinforcement algorithms, i.e that through experience (training via trial and error), the value of actions taken need to be evaluated by the effect they cause, and if the effect they cause is favourable, then favour (or reinforce) that action in the future. This way, in the future, when the same favourable situations present themselves, the actions that lead to those situations are are automatically taken, and in effect, they are learnt to be taken in the future.

Interesting.