The Asynchronous Advantage Actor Critic (A3C) algorithm is one of the newest algorithms to be developed under the field of Deep Reinforcement Learning Algorithms. This algorithm was developed by Google’s DeepMind which is the Artificial Intelligence division of Google. This algorithm was first mentioned in 2016 in a research paper appropriately named Asynchronous Methods for Deep Learning. Decoding the different parts of the algorithm’s name:-
- Asynchronous: Unlike other popular Deep Reinforcement Learning algorithms like Deep Q-Learning which uses a single agent and a single environment, This algorithm uses multiple agents with each agent having its own network parameters and a copy of the environment. This agents interact with their respective environments Asynchronously, learning with each interaction. Each agent is controlled by a global network. As each agent gains more knowledge, it contributes to the total knowledge of the global network. The presence of a global network allows each agent to have more diversified training data. This setup mimics the real-life environment in which humans live as each human gains knowledge from the experiences of some other human thus allowing the whole “global network” to be better.
- Actor-Critic: Unlike some simpler techniques which are based on either Value-Iteration methods or Policy-Gradient methods, the A3C algorithm combines the best parts of both the methods ie the algorithm predicts both the value function V(s) as well as the optimal policy function . The learning agent uses the value of the Value function (Critic) to update the optimal policy function (Actor). Note that here the policy function means the probabilistic distribution of the action space. To be exact, the learning agent determines the conditional probability P(a|s ;) ie the parameterized probability that the agent chooses the action a when in state s.
Advantage: Typically in the implementation of Policy Gradient, the value of Discounted Returns() to tell the agent which of it’s actions were rewarding and which ones were penalized. By using the value of Advantage instead, the agent also learns how much better the rewards were than it’s expectation. This gives a new-found insight to the agent into the environment and thus the learning process is better. The advantage metric is given by the following expression:- Advantage: A = Q(s, a) – V(s) The following pseudo-code is referred from the research paper linked above.
Define global shared parameter vectors [Tex]and[/Tex][Tex]Define global shared counter T = 0 Define thread specific parameter vectors[/Tex][Tex]and[/Tex][Tex]Define thread step counter t = 1 while([/Tex][Tex]) {[/Tex][Tex] [/Tex][Tex] [/Tex][Tex] [/Tex][Tex] [/Tex][Tex] [/Tex][Tex]while([/Tex][Tex]is not terminal[/Tex][Tex]) { Simulate action[/Tex][Tex]according to[/Tex][Tex]Receive reward[/Tex][Tex]and next state[/Tex][Tex]t++ T++ } if([/Tex][Tex]is terminal) { R = 0 } else { R =[/Tex][Tex]} for(i=t-1;i>=[/Tex][Tex];i--) { R =[/Tex][Tex] [/Tex][Tex] [/Tex][Tex]}[/Tex][Tex] [/Tex][Tex]}[/Tex]
Where, – Maximum number of iterations – change in global parameter vector – Total Reward – Policy function – Value function – discount factor Advantages:
- This algorithm is faster and more robust than the standard Reinforcement Learning Algorithms.
- It performs better than the other Reinforcement learning techniques because of the diversification of knowledge as explained above.
- It can be used on discrete as well as continuous action spaces.