Friday, November 15, 2024
Google search engine
HomeGuest BlogsPervasive Simulator Misuse with Reinforcement Learning

Pervasive Simulator Misuse with Reinforcement Learning

The surge of interest inĀ reinforcement learningĀ is great fun, but I often see confused choices in applying RL algorithms to solve problems. There are two purposes for which you might use a world simulator in reinforcement learning:

  1. Reinforcement Learning Research: You might be interested in creating reinforcement learning algorithms for the real world and use the simulator as a cheap alternative to actual real-world application.
  2. Problem Solving: You want to find a good policy solving a problem for which you have a good simulator.

In the first instance I have no problem, but in the second instance, Iā€™m seeing many head-scratcher choices.

A reinforcement learning algorithm engaging in policy improvement from a continuous stream of experience needs to solve an opportunity-cost problem. (The RL lingo for opportunity-cost is ā€œadvantageā€.) Thinking about this in the context of a 2-person game, at a given state, with your existing rollout policy, is taking the first action leading to a win 1/2 the time good or bad? It could be good since the player is well behind and every other action is worse. Or it could be bad since the player is well ahead and every other action is better. Understanding one actionā€™s long term value relative to anotherā€™s is the essence of the opportunity cost trade-off at the core of many reinforcement learning algorithms.

If you have a choice between an algorithm thatĀ estimatesĀ the opportunity cost and one whichĀ observesĀ the opportunity cost, which works better? Using observed opportunity-cost is an almost pure winner because it cuts out the effect of estimation error. In the real world you canā€™t observe the opportunity cost directlyĀ Groundhog dayĀ style. How many times have you left a conversation and thought to yourself: I wish I had said something else? A simulator is different thoughā€”youĀ canĀ reset a simulator. And when you do reset a simulator, you can directly observe the opportunity-cost of an action which can then directly drive learning updates.

If you are coming from viewpoint 1, using a ā€œreset cheatā€ is unappealing since it doesnā€™t work in the real world and the goal is making algorithms which work in the real world. On the other hand, if you are operating from viewpoint 2, the ā€œreset cheatā€ is a gigantic opportunity to dramatically improve learning algorithms. So, why are many people with goal 2 using goal 1 designed algorithms? I donā€™t know, but here are some hypotheses.

  1. Maybe people just arenā€™t aware that goal 2 style algorithms exist? They are out there. The most prominent examples of goal 2 style algorithms are fromĀ Learning to searchĀ andĀ AlphaGo Zero.
  2. Maybe people are worried about the additional sample complexity of doing multiple rollouts from reset points? But these algorithm typically require little additional sample complexity in the worst case and can provide gigantic wins. People commonly use a discount factorĀ dĀ values future rewardsĀ tĀ timesteps ahead with a discount ofĀ dt. Alternatively, you can terminate rollouts with probabilityĀ 1 ā€“ dĀ and value future rewards with no discount while preserving the expected value. Using this approach a rollout terminates after an expectedĀ 1/(1-d)timesteps bounding the cost of a reset and rollout. Since it is common to use very heavy discounting (e.g.Ā d=0.9), the worst case additional sample complexity is only a small factor larger. On the upside, eliminating estimation error is can radically reduce sample complexity in theory and practice.
  3. Maybe the implementation overhead for a second family of algorithms is to difficult? But the choice of whether or not you use resets is far more important than ā€œoh, weā€™ll just run things for 10x longerā€. It can easily make or break the outcome.

Maybe there is some other reason? As I said above, this is head-scratcher that I find myself trying to address regularly.

RELATED ARTICLES

Most Popular

Recent Comments

ź°•ģ„œźµ¬ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
źøˆģ²œźµ¬ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
ź“‘ėŖ…ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ź“‘ėŖ…ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ė¶€ģ²œģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
źµ¬ģ›”ė™ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ź°•ģ„œźµ¬ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ģ˜¤ģ‚°ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ź“‘ėŖ…ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
ģ•ˆģ–‘ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
ė¶€ģ²œģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ė™ķƒ„ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ģ„œģšøģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ė¶„ė‹¹ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ė¶€ģ²œģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ķ™”ź³”ė™ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
ź°•ģ„œźµ¬ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
ź³ ģ–‘ģ¶œģž„ģ•ˆė§ˆ on How to store XML data into a MySQL database using Python?
ķ™”ģ„±ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?
ģ²œķ˜øė™ģ¶œģž„ė§ˆģ‚¬ģ§€ on How to store XML data into a MySQL database using Python?