Before we begin, let me just define a few terms:
$S_t$ is the state at time $t$. $A_t$ is the action performed at time $t$. $R_t$ is the reward received at time $t$. $G_t$ is the return, that is the sum of discounted rewards received from time $t$ onwards, defined as $G_t = \sum_{i=0}^\infty \gamma^i R_{t+i+1}$. $V^\pi(s)$ is the value of a state when following a policy $\pi$, that is the expected return when starting in state $s$ and following a policy $\pi$, defined as $V^\pi(s) = E[G_t | S_t = s]$.
Read More →