Before we begin, let me just define a few terms:
- $S_t$ is the state at time $t$.
- $A_t$ is the action performed at time $t$.
- $R_t$ is the reward received at time $t$.
- $G_t$ is the return, that is the sum of discounted rewards received from time $t$ onwards, defined as
$G_t = \sum_{i=0}^\infty \gamma^i R_{t+i+1}$
. - $V^\pi(s)$ is the value of a state when following a policy $\pi$, that is the expected return when starting in state $s$ and following a policy $\pi$, defined as
$V^\pi(s) = E[G_t | S_t = s]$
. - $Q^\pi(s, a)$ is the value of a state $s$ when performing and action $a$ and then following the policy $\pi$, that is $Q^\pi(s, a) = E[G_t | S_t = s, A_t = a]$.
Before moving further, note a small algebraic tric for re-writing $G_t$ in terms of itself
We can use this in the definition of $V^\pi(s)$ and get
The last equation is called the Bellman equation for $V^\pi(s)$ and it shows a recursive relationship between the value of the current state and the possible next states. This in and of itself is not as interesting, but we’ll use it to derive a solution to finding the optimal policy.
Let us now define the optimal value function, that is the value function of the optimal policy (denoted $\pi^*$).
that is the optimal value of a state is the maximum over all possible policies. Going one step further, we also define the optimal action-value function as
It’s easy to see now that $V^*(s) = \max_a Q^*(s, a)$
, that is the maximum value of a state is computed by performing the best possible action. We can use this further to arrive at a simplified Bellman equation as follows
Here we managed to do a small but important trick in the second last equation (marked with $\stackrel{*}{=}$). But let us first decompose what does $\mathrm{E}_\pi*$ actually mean. Because $G_r$ is the discounted sum of rewards, it is only defined in terms of a policy. But since we assume the policy to be stochastic, we need to take an expectation over all possible actions chosen by the policy, and the possible rewards.
This changes at the marked equation, because we are no longer referring to the policy $\pi_*$, but rather to the value function $V^*$, which is not stochastic. As a reuslt, the expectation in the second to last equation is simply over $R_t$, because we still assume stochastic rewards.
References
- Equations for the value function and notation borrowed from Reinforcement Learning: An Introduction by Andrew Barto and Richard S. Sutton.
Linear Regression - least squares with orthogonal projection
Variational Inference - Deriving ELBO
Reinforcement Learning