Based on simply watching how an agent acts in the environment, it is hard to tell anything about why it behaves this way and how it works internally. That’s why it is crucial to establish metrics that tell WHY the agent performs in a certain way.
This is challenging especially when the agent doesn’t behave the way we would like it to behave, … which is like always. Every AI practitioner knows that whatever we work on, most of the time it won’t simply work out of the box (they wouldn’t pay us so much for it otherwise).
In this blog post, you’ll learn what to keep track of to inspect/debug your agent learning trajectory. I’ll assume you are already familiar with the Reinforcement Learning (RL) agent-environment setting (see Figure 1) and you’ve heard about at least some of the most common RL algorithms and environments.
Nevertheless, don’t worry if you are just beginning your journey with RL. I’ve tried to not depend too much on readers’ prior knowledge and where I couldn’t omit some details, I’ve put references to useful materials.
Figure 1: The Reinforcement Learning framework (Sutton & Barto, 2018).
I’ll start by discussing useful metrics that give us a glimpse into the training and decision processes of the agent.
Then we will focus on the aggregation statistics of these metrics, like average, that will help us analyze them for many episodes played by the agent throughout the training. These will help root cause any issues with the agent.
At each step, I’ll base my suggestions on my own experience in RL research. Let’s jump right into it!
Metrics I use to inspect RL agent training
There are multiple types of metrics to follow and each of them gives you different information about the model’s performance. So the researcher can get the information about…
…how is the agent doing
Here, we will take a closer look at three metrics that diagnose the overall performance of the agent.
Source: Grand Theft Auto: San Andreas Review by xstevez
Episode return
This is what we care about the most. The whole agent training is all about getting to the highest expected return possible (see Figure 2). If this metric goes up throughout the training, it’s a good sign.
Figure 2: The RL Problem. Find a policy π that maximizes the objective J. The objective J is an expected return E[R] under the environment dynamics P. τ is the trajectory played by the agent (or its policy π).
However, it’s much more useful to us when we know what return to expect, or what is a good score.
That’s why you should always look for baselines, others result in an environment you work on, to compare your results with them.
Random agent baseline is often a good start and allows you to recalibrate, feel what is true “zero” score in the environment – the minimal return you can get simply from bunging into the controller (see Figure 3).
Figure 3. Table 3 from the SimPLe paper with their results on Atari environments compared to many baselines alongside the random agent and human scores.
Episode length
This is a useful metric to analyze in conjunction with the episode return. It tells us if our agent is able to live for some time before termination. In MuJoCo environments, where diverse creatures learn to walk (see Figure 4), it tells you e.g. if your agent does some moves before flipping and resetting to the beginning of the episode.
Figure 4. Humanoid falling Source: A Survey of Reinforcement Learning Techniques for 2D and 3D Bipedal Locomotion
Solve rate
Yet another metric to analyze with episode return. If your environment has a notion of being solved, then it’s useful to check how many episodes it can solve. For instance, in Sokoban (see Figure 5) there are partial rewards for pushing a box onto a target. That being said, the room is only solved when all boxes are on targets.
Figure 5. Sokoban is a transportation puzzle, where the player has to push all boxes in the room on the storage targets.
So, it is possible for the agent to have a positive episode return, but still don’t finish the task it is required to solve.
One more example can be Google Research Football (see Figure 6) with its academies. There are some partial rewards for moving towards the opponents’ goal, but the academy episode (e.g. exercising counterattack situation in smaller groups) is only considered “solved” when the agent’s team scores a goal.
Figure 5. Google Research Football, the “Academy Pass and Shoot” environment.
…progress of training
There are multiple ways of representing the notion of “time” and against what to measure progress in RL. Here are the top 4 picks.
Total environment steps
This simple metric tells you how much experience, in terms of environment steps or timesteps, the agent already gathered. This is often more informative on training advancement (steps) than wall-time, which highly depends on how fast your machine can simulate the environment and do calculations on a neural network (see Figure 6).
Figure 6. DDPG training on the MuJoCo Ant environment. Both runs took 24h, but on different machines. One did ~5M steps and the other ~9.5M. For the latter, it was enough time to converge. For the former not and it scored worse.
Moreover, we report the final agent score together with how much environment steps (often called samples) it took to train it. The higher the score with the fewer samples, the more sample efficient is the agent.
Training steps
We train neural networks with the Stochastic Gradient Descent (SGD) algorithm (see Deep Learning Book).
The training steps metric tells us how many batch updates we did to the network. When training from the off-policy replay buffer, we can match it with total environment steps in order to better understand how many times, on average, each sample from the environment is shown to the network to learn from it:
batch size * trainings steps / total environment steps = batch size / rollout length
where rollout length is the number of new timesteps we gather, on average, during the data collection phase in between training steps (when data collection and training are run sequentially).
The above ratio, sometimes called training intensity, shouldn’t be below 1 as it would mean that some samples aren’t shown even once to the network! In fact, it should be much higher than 1, e.g. 256 (as set in e.g. RLlib implementation of DDPG, look for “training intensity”).
Wall time
This simply tells us how much time an experiment is running.
It can be useful when planning in the future how much time do we need for each experiment to simply finish:
- 2-3 hours?
- full night??
- or a couple of days???
- whole week?!?!?!
Yes, some experiments might take even the whole week on your PC to fully converge or train to the maximum episode return the method you use can achieve.
Thankfully, in the development phase, shorter experiments (a few hours, up to 24h) are most of the time good enough to simply tell if the agent is working or not or to test some improvement ideas.
Note, that you always want to plan your work in such a way, that some experiments are running in the background while you work on something else e.g. code, read, write, think, etc.
This is why some dedicated workstations for only running experiments might be useful.
Steps per second
How many environmental steps an agent does in each second. The average of this value allows you to calculate how much time you need to run some number of environment steps.
…what is the agent thinking/doing
Finally, let’s take a look inside the agent’s brain. In my research – depending on the project – I use value function and policy entropy to get a hint of what is going on.
State/Action value function
Q-learning and actor-critic methods make use of value functions (VFs).
It’s useful to look at the values they predict to detect some anomalies and see how the agent evaluates its odds in the environment.
In the simplest case, I log the network state value estimate at each episode’s timestep and then average them across the whole episode (more on this in the next section). With more training, this metric should start to match the logged episode return (see Figure 7) or, more often, discounted episode return as it is used to train VF. If it doesn’t, then it is a bad sign.
Figure 7. An experiment on the Google Research Football environment. With time, as the agent trains, the agent’s value function matches the episode return mean.
Moreover, on the VF values chart, we can see if some additional data processing is required.
For instance, in the Cart Pole environment, an agent gets a reward of 1 for every timestep until it falls and dies. Episode return quickly gets to orders of tens and hundreds. A VF network that is initialized in such a way that at the beginning of training it outputs small values around zero has a hard time catching this range of values (see Figure 8).
That’s why some additional return normalization before training with it is required. The easiest approach is simply dividing by the max return possible, but somehow we might not know what is the maximum return or there is no such (see e.g. Q-value normalization in the MuZero paper, Appendix B – Backup).
Figure 8. An experiment on the Cart Pole environment. The value function target isn’t normalized and it has a hard time catching up with it.
I’ll discuss an example in the next section when this particular metric joint with extreme aggregation helped me detect a bug in my code.
Policy entropy
Because some RL methods make use of stochastic policies, we can calculate their entropy: how random they are. Even with the deterministic policies we often use epsilon-greedy exploratory policy of which we can still calculate the entropy.
The equation for the policy entropy H, where a is an action and p(a) in an action probability.
The maximum entropy value equals ln(N), where N is the number of actions, and it means that the policy chooses actions uniformly at random. The minimum entropy value equals 0 and it means that always only one action is possible (has 100% probability).
If you observe that the entropy of the agent policy drops rapidly, it’s a bad sign. It means that your agent stops exploring very quickly. If you use stochastic policies, you should think of some entropy regularization methods (e.g. Soft Actor-Critic). If you are using deterministic policies with epsilon-greedy exploratory policy, probably you use too aggressive schedule for epsilon decay.
…how the training goes
Last, but not least, we have some, more standard, Deep Learning metrics.
KL divergence
On-policy methods like Vanilla Policy Gradient (VPG) train on batches of experience sampled from the current policy (they don’t use any replay buffer with experience to train on).
It means that what we do has a high impact on what we learn. If you set a learning rate too high, then the approximate gradient update might take too big steps in some seemingly promising direction which may push the agent right into the worse region of the state space.
Therefore the agent will do worse than before the update (see Figure 9)! This is why we need to monitor KL divergence between the old and the new policy. It can help us e.g. set a learning rate.
Figure 9. VPG training on the Cart Pole environment. On the y-axis, we have an episode length (it equals an episode return in this environment). The orange line is the sliding window average of the score. On the left diagram, the learning rate is too big and the training is unstable. On the right diagram, the learning rate was properly fine-tuned (I found it by hand).
KL divergence is a measure of the distance between two distributions. In our case, these are action distributions (policies). We don’t want our policy to differ too much before and after the update. There are methods like PPO that put a constraint on the KL divergence and won’t allow too big updates at all!
Network weights/gradients/activations histograms
Logging the activations, gradients, and weights histograms of each layer can help you monitor the artificial neural network training dynamics. You should look for signs of:
– Dying ReLUs:
If a ReLU neuron gets clamped to zero in the forward pass, then it won’t get a gradient signal in the backward pass. It can even happen, that some neurons won’t get excited (return a non-zero output) for any input because of unfortunate initialization or too big update during training.
“Sometimes you can forward the entire training set <i.e. the replay buffer in RL> through a trained network and find that a large fraction (e.g. 40%) of your neurons were zero the entire time.” ~ Yes you should understand backprop by Andrej Karpathy
– Vanishing or Exploding gradients:
Very large values of gradient updates can indicate exploding gradients. Gradient clipping may help.
On the other hand, very low values of gradient updates can indicate vanishing gradients. Using ReLU activations and Glorot uniform initializer (a.k.a. Xavier uniform initializer) should help with it.
– Vanishing or Exploding activations:
A good standard deviation for the activations is on the order of 0.5 to 2.0. Significantly outside of this range may indicate vanishing or exploding activations, which in turn may cause problems with gradients. Try Layer/Batch normalization to keep your activations distribution under control.
In general, distributions of layer weights (and activations), that are close to a normal distribution (values around zero without much outliers) are a sign of healthy training.
The above tips should help you keep your network healthy through training.
Policy/Value/Quality/… heads losses
Even though we do optimize some loss function to train an agent, you should know that this isn’t a loss function in the typical sense of the word. Specifically, it is different from the loss functions used in supervised learning.
We optimize the objective from Figure 2. To do so, in Policy Gradient methods you derive the gradient of this objective (called, Policy Gradient). However, because TensorFlow and other DL frameworks are built around auto-grad, you define a surrogate loss function that, after the auto-grad is run on it, yields gradient equal to the Policy Gradient.
Note that the data distribution depends on the policy and changes with training. This means that the loss functions don’t have to decrease monotonically for training to proceed. It can sometimes increase when the agent discovers some new area of state space (see Figure 10).
Figure 10. SAC training on the MuJoCo Humanoid environment. When the episode return starts to go up (our agent learns successfully), the Q-function loss goes up too! It starts to go down again after some time.
Moreover, it doesn’t measure the performance of the agent! The true performance of the agent is an episode return. It’s useful to log losses as a sanity check. However, don’t base your judgments on training progress on it.
Aggregated statistics
Of course, for some metrics (like state/action-values) it’s infeasible to log them for every environment timestep for each experiment. Typically, you would calculate statistics every episode or couple of episodes.
For other metrics, we deal with randomness (e.g. the episode return when the environment and/or the policy are stochastic). Therefore, we have to use sampling to estimate the expected metric value (sample = one agent episode in the episode return case).
In either case, the aggregate statistics are the solution!
Average and standard deviation
When you deal with a stochastic environment (e.g. ghosts in the PacMan act randomly) and/or your policy draws actions at random (e.g. stochastic policy in VPG) you should:
- play multiple episodes (10-20 should be fine),
- average metrics across them,
- log this average and standard deviation.
The average will better estimate the true expected return than simply one episode and standard deviation gives you a hint of how much the metric changes when playing multiple episodes.
Too high variance and you should take more samples into the average (play more episodes) or make use of one of the smoothing techniques like Exponential Moving Average.
Minimum/Maximum value
It’s really useful to inspect extremes when looking for a bug. I’ll discuss it with the example.
In experiments on Google Research Football with my agent that used random rollouts from the current timestep to calculate action qualities, I noticed some strange minimum values of these action qualities.
The average statistic made sense, but something with minimal values was not good. They were below reasonable minimum value (below minus one, see Figure 11).
Figure 11. The mean qualities are all above zero. The minimum qualities are very often below minus one, which is lower than should be possible.
After some digging, it turned out that I use np.empty to create an array for action qualities.
np.empty is a fancy way of doing np.zeros that allocates memory but doesn’t initialize the NumPy array just yet.
Because of that, from time to time some actions had updated scores (which overrode the initial values in the array) that came from the allocated memory locations that had not been erased!
I changed np.empty to np.zeros and it fixed the problem.
Median
The same idea that we used with averaging stochastic episodes, can be applied to the whole training!
As we know, the algorithm used for deep learning is called Stochastic Gradient Descent. It’s stochastic because we draw training samples at random and pack them into batches. This means that running one training multiple times will yield different results.
You should always run your training multiple times with different seeds (pseudo-random numbers generator initialization) and report the median of these runs to be sure that the score is not that high or that low simply by chance.
Figure 12. SAC training on the MuJoCo Ant environment. All runs have the same hyper-parameters, only different seeds. Three runs, three results.
Deep Reinforcement Learning Doesn’t Work Yet and so your agent might fail to train anything, even if your implementation is correct. It can simply fail by chance e.g. because of unlucky initialization (see Figure 12).
Conclusions
Now you know what and why you should log to get the full picture of an agent training process. Moreover, you know what to look for in these logs and even how to deal with the common problems.
Before we finish, please take a look at Figure 12 once again. We see that the training curves, though different, follow similar paths and even two out of three converge to a similar result. Any ideas what that could mean?
Stay tuned for future posts!
Piotr Januszewski
Research Software Engineer at University of Warsaw and PhD student at Gdansk University of Technology