REINFORCE Algorithm: Taking baby steps in reinforcement learning

24 July 2024

0

Introduction

Lets’ solve OpenAI’s Cartpole, Lunar Lander, and Pong environments with REINFORCE algorithm.

Reinforcement learning is arguably the coolest branch of artificial intelligence. It has already proven its prowess: stunning the world, beating the world champions in games of Chess, Go, and even DotA 2.

In this article, I would be walking through a fairly rudimentary algorithm, and showing how even this can achieve a superhuman level of performance in certain games.

Reinforcement Learning deals with designing “Agents” that interacts with an “Environment” and learns by itself how to “solve” the environment by systematic trial and error. An environment could be a game like chess or racing, or it could even be a task like solving a maze or achieving an objective. The agent is the bot that performs the activity.

An agent receives “rewards” by interacting with the environment. The agent learns to perform the “actions” required to maximize the reward it receives from the environment. An environment is considered solved if the agent accumulates some predefined reward threshold. This nerd talk is how we teach bots to play superhuman chess or bipedal androids to walk.

REINFORCE Algorithm

REINFORCE belongs to a special class of Reinforcement Learning algorithms called Policy Gradient algorithms. A simple implementation of this algorithm would involve creating a Policy: a model that takes a state as input and generates the probability of taking an action as output. A policy is essentially a guide or cheat-sheet for the agent telling it what action to take at each state. The policy is then iterated on and tweaked slightly at each step until we get a policy that solves the environment.

The policy is usually a Neural Network that takes the state as input and generates a probability distribution across action space as output.

Reinforce Policy Example — Policy: Example

The objective of the policy is to maximize the “Expected reward”.

Each policy generates the probability of taking an action in each station of the environment.

The agent samples from these probabilities and selects an action to perform in the environment. At the end of an episode, we know the total rewards the agent can get if it follows that policy. We backpropagate the reward through the path the agent took to estimate the “Expected reward” at each state for a given policy.

Here the discounted reward is the sum of all the rewards the agent receives in that future discounted by a factor Gamma.

The discounted reward at any stage is the reward it receives at the next step + a discounted sum of all rewards the agent receives in the future.

The discounted factor is calculated for each state by back-propagating the rewards.

For the above equation this is how we calculate the Expected Reward:

As per the original implementation of the REINFORCE algorithm, the Expected reward is the sum of products of a log of probabilities and discounted rewards.

Algorithm steps

The steps involved in the implementation of REINFORCE would be as follows:

Initialize a Random Policy (a NN that takes the state as input and returns the probability of actions)
Use the policy to play N steps of the game — record action probabilities-from policy, reward-from environment, action — sampled by agent
Calculate the discounted reward for each step by backpropagation
Calculate expected reward G
Adjust weights of Policy (back-propagate error in NN) to increase G
Repeat from 2

Check out the implementation using Pytorch on my Github.

Demos

I have tested out the algorithm on Pong, CartPole, and Lunar Lander. It takes forever to train on Pong and Lunar Lander — over 96 hours of training each on a cloud GPU. There are several updates on this algorithm that can make it converge faster, which I haven’t discussed or implemented here. Checkout Actor-Critic models and Proximal Policy Optimization if interested in learning further.

CartPole

State:

Horizontal Position, Horizontal Velocity, Angle of the pole, Angular Velocity

Actions:

Push cart left, Push cart right

Random Policy Play:

Agent Policy Trained with REINFORCE:

Lunar Lander

Random Play Agent

State:

The state is an array of 8 vectors. I am not sure what they represent.

Actions:

0: Do nothing

1: Fire Left Engine

2: Fire Down Engine

3: Fire Right Engine

Agent Policy Trained with REINFORCE:

Pong

This was much harder to train. Trained on a GPU cloud server for days.

State: Image

Actions: Move Paddle Left, Move Paddle Right

Reinforcement Learning has progressed leaps and bounds beyond REINFORCE. My goal in this article was to 1. learn the basics of reinforcement learning and 2. show how powerful even such simple methods can be in solving complex problems. I would love to try these on some money-making “games” like stock trading … guess that’s the holy grail among data scientists.

Github Repo: https://github.com/kvsnoufal/reinforce

Shoulders of giants:

Policy Gradient Algorithms (https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html)
Deriving REINFORCE (https://medium.com/@thechrisyoon/deriving-policy-gradients-and-implementing-reinforce-f887949bd63)
Udacity’s reinforcement learning course (https://github.com/udacity/deep-reinforcement-learning)

About The Author

Author

Noufal kvs

I work in Dubai Holding, UAE as a data scientist. You can reach out to me at [email protected] or https://www.linkedin.com/in/kvsnoufal/

reinforce algorithm

g

guest_blog

24 Nov 2020

Advanced Algorithm Maths Reinforcement Learning Technique

REINFORCE Algorithm: Taking baby steps in reinforcement learning

Introduction

REINFORCE Algorithm

Algorithm steps

Demos

Lunar Lander

Pong

About The Author

Run Local AWS Cloud Stack using LocalStack on Linux

Learn Terraform Automation in 3 days using Video Courses

How To Expose Ansible AWX Service using Nginx Ingress

LEAVE A REPLY Cancel reply

Most Popular

Interview With Vishram Mishra – Founder at MicroSec by Shauli Zacks

How to Delete Your TikTok Account in 2024: Quick & Easy by Tim Mocan

How to Watch CCTV From Anywhere in 2024: Just 3 Steps by Manual Thomas

Cape CEO John Doyle: You Don’t Have To Give Up Security For Connectivity by Roberto Popolizio

Recent Comments

EDITOR PICKS

Interview With Vishram Mishra – Founder at MicroSec by Shauli Zacks

How to Delete Your TikTok Account in 2024: Quick & Easy by Tim Mocan

How to Watch CCTV From Anywhere in 2024: Just 3 Steps by Manual Thomas

POPULAR POSTS

Interview With Vishram Mishra – Founder at MicroSec by Shauli Zacks

How to Delete Your TikTok Account in 2024: Quick & Easy by Tim Mocan

How to Watch CCTV From Anywhere in 2024: Just 3 Steps by Manual Thomas

POPULAR CATEGORY

ABOUT US

FOLLOW US