Prerequisites: Q-Learning.
In the following derivations, the symbols defined as in the prerequisite article will be used.
The Q-learning technique is based on the Bellman Equation.
where,
E : Expectation
t+1 : next state
: discount factor
Rephrasing the above equation in the form of Q-Value:-
The optimal Q-value is given by
Policy Iteration: It is the process of determining the optimal policy for the model and consists of the following two steps:-
- Policy Evaluation: This process estimates the value of the long-term reward function with the greedy policy obtained from the last Policy Improvement step.
- Policy Improvement: This process updates the policy with the action that maximizes V for each of the state. This process is repeated until convergence is achieved.
Steps Involved:-
- Initialization:
= any real random number
= any A(s) arbitrarily chosen - Policy Evaluation:
while() { for each s in S { } }
- Policy Improvement:
while(true) for each s in S { if() if() break from both loops } return V,
- Value Iteration: This process updates the function V according to the Optimal Bellman Equation.
Working Steps:
- Initialization: Initialize array V by any random real number.
- Computing the optimal value:
while() { for each s in S { } } return