The temporal difference learning algorithm was introduced by Richard S. Sutton in 1988. The reason the temporal difference learning method became popular was that it combined the advantages of dynamic programming and the Monte Carlo method. But what are those advantages?
This article is an excerpt from the book Deep Reinforcement Learning with Python, Second Edition by Sudharsan Ravichandiran – a comprehensive guide for beginners to become proficient in implementing state-of-the-art RL and deep RL algorithms.
Let’s quickly recap the advantages and disadvantages of DP and the MC method.
Dynamic programming — the advantage of the DP method is that it uses the Bellman equation to compute the value of a state. That is, we have learned that according to the Bellman equation, the value of a state can be obtained as the sum of the immediate reward and the discounted value of the next state. This is called bootstrapping. That is, to compute the value of a state, we don’t have to wait till the end of the episode, instead, using the Bellman equation, we can estimate the value of a state just based on the value of the next state, and this is called bootstrapping.
Remember how we estimated the value function in DP methods (value and policy iteration)? We estimated the value function (the value of a state) as:
As you may recollect, we learned that in order to find the value of a state, we didn’t have to wait till the end of the episode. Instead, we bootstrap, that is, we estimate the value of the current state V(s) by estimating the value of the next state V(s′).
However, the disadvantage of DP is that we can apply the DP method only when we know the model dynamics of the environment. That is, DP is a model-based method and we should know the transition probability in order to use it. When we don’t know the model dynamics of the environment, we cannot apply the DP method.
Monte Carlo method—The advantage of the MC method is that it is a modelfree method, which means that it does not require the model dynamics of the environment to be known in order to estimate the value and Q functions.
However, the disadvantage of the MC method is that in order to estimate the state value or Q value we need to wait until the end of the episode, and if the episode is long then it will cost us a lot of time. Also, we cannot apply MC methods to continuous tasks (non-episodic tasks).
Now, let’s get back to temporal difference learning. The TD learning algorithm takes the benefits of the DP and the MC methods into account. So, just like in DP, we perform bootstrapping so that we don’t have to wait until the end of an episode to compute the state value or Q value, and just like the MC method, it is a model-free method and so it does not require the model dynamics of the environment to compute the state value or Q value. Now that we have the basic idea behind the TD learning algorithm, let’s get into the details and learn exactly how it works.
We can use the TD learning algorithm for both the prediction and control tasks, and so we can categorize TD learning into:
- TD prediction
- TD control
We learned what the prediction and control methods mean in the previous chapter. Let’s recap that a bit before going forward.
In the prediction method, a policy is given as an input and we try to predict the value function or Q function using the given policy. If we predict the value function using the given policy, then we can say how good it is for the agent to be in each state if it uses the given policy. That is, we can say what the expected return an agent can get in each state if it acts according to the given policy.
In the control method, we are not given a policy as input, and the goal in the control method is to find the optimal policy. So, we initialize a random policy and then we try to find the optimal policy iteratively. That is, we try to find an optimal policy that gives us the maximum return.
First, let’s see how to use TD learning to perform prediction task, and then we will learn how to use TD learning for the control task.
Temporal Difference Learning Prediction
In the TD prediction method, the policy is given as input and we try to estimate the value function using the given policy. TD learning bootstraps like DP, so it does not have to wait till the end of the episode, and like the MC method, it does not require the model dynamics of the environment to compute the value function or the Q function. Now, let’s see how the update rule of TD learning is designed, taking the preceding advantages into account.
In the MC method, we estimate the value of a state by taking its return:
V(s) ≈ R(s)
However, a single return value cannot approximate the value of a state perfectly. So, we generate N episodes and compute the value of a state as the average return of a state across N episodes: But with the MC method, we need to wait until the end of the episode to compute the value of a state and when the episode is long, it takes a lot of time. One more problem with the MC method is that we cannot apply it to non-episodic tasks (continuous tasks).
So, in TD learning, we make use of bootstrapping and estimate the value of a state as:
𝑉(𝑠) ≈ 𝑟 + 𝛾V(𝑠′ )
The preceding equation tells us that we can estimate the value of the state by only taking the immediate reward r and the discounted value of the next state 𝛾𝛾𝛾𝛾(𝑠𝑠′ ).
As you may observe from the preceding equation, similar to what we learned in DP methods (value and policy iteration), we perform bootstrapping but here we don’t need to know the model dynamics.
Thus, using temporal difference learning, the value of a state is approximated as:
𝑉(𝑠) ≈ 𝑟 + 𝛾V(𝑠′ )
However, a single value of 𝑟 + 𝛾V(𝑠′) cannot approximate the value of a state perfectly. So, we can take a mean value and instead of taking an arithmetic mean, we can use the incremental mean.
In the MC method, we learned how to use the incremental mean to estimate the value of the state and it given as follows:
𝑉(𝑠) = 𝑉(𝑠) + 𝛼(R – 𝑉(𝑠))
Similarly, here in TD learning, we can use the incremental mean and estimate the value of the state, as shown here:
𝑉(𝑠) = 𝑉(𝑠) + 𝛼(𝑟 + 𝛾V(𝑠′ ) – 𝑉(𝑠))
This equation is called the temporal difference learning update rule. As we can observe, the only difference between the TD learning and the MC method is that to compute the value of the state, in the MC method, we use the full return R, which is computed using the complete episode, whereas in the TD learning method, we use the bootstrap estimate 𝑟𝑟 𝑟 + 𝛾V(𝑠′ ) so that we don’t have to wait until the end of the episode to compute the value of the state. Thus, we can apply TD learning to non-episodic tasks as well. The following shows the difference between the MC method and TD learning:
Thus, our temporal difference learning update rule is:
𝑉(𝑠) = 𝑉(𝑠) + 𝛼(𝑟 + 𝛾V(𝑠′ ) – 𝑉(𝑠))
We learned that 𝑟 + 𝛾V(𝑠′ ) is an estimate of the value of state V(s). So, we can call 𝑟 + 𝛾V(𝑠′ ) the TD target. Thus, subtracting V(s) from 𝑟 + 𝛾V(𝑠′ ) implies that we are subtracting the predicted value from the target value, and this is usually called the TD error. Okay, what about that 𝛼? It is basically the learning rate, also called the step size. That is:
Our TD learning update rule basically implies:
Value of a state = value of a state + learning rate (reward + discount factor(value of next state) – value of a state)
Summary of Temporal Difference Learning
In this article, we explored the TD learning update rule and how temporal difference learning is used to estimate the value of a state. The book further explores the TD prediction algorithm so readers can get a clearer understanding of the TD learning method. Master classic RL, deep RL, distributional RL, inverse RL, and more with OpenAI Gym and TensorFlow with Deep Reinforcement Learning with Python, 2nd Edition by Sudharsan Ravichandiran.
About the author
Sudharsan Ravichandiran is a data scientist and artificial intelligence enthusiast. He holds a Bachelors in Information Technology from Anna University. His area of research focuses on practical implementations of deep learning and reinforcement learning including natural language processing and computer vision. He is an open-source contributor and loves answering questions on Stack Overflow.
Interested in learning more novel data science techniques? Sign up for the Ai+ Training Platform and see what’s trending in data science and gain the skills you need to get the job you want. Some exciting recent sessions include:
Data Science in the Industry: Continuous Delivery for Machine Learning with Open-Source Tools | Team from ThoughtWorks, Inc.
Data, I/O, and TensorFlow: Building a Reliable Machine Learning Data Pipeline: Yong Tang, PhD | Director of Engineering | MobileIron
Continuously Deployed Machine Learning: Max Humber | Lead Instructor | General Assembly