Bellman Equation

4 min read • Published: November 22, 2018

Before we begin, let me just define a few terms:

  • StS_t is the state at time tt.
  • AtA_t is the action performed at time tt.
  • RtR_t is the reward received at time tt.
  • GtG_t is the return, that is the sum of discounted rewards received from time tt onwards, defined as Gt=i=0γiRt+i+1G_t = \sum_{i=0}^\infty \gamma^i R_{t+i+1}.
  • Vπ(s)V^\pi(s) is the value of a state when following a policy π\pi, that is the expected return when starting in state ss and following a policy π\pi, defined as Vπ(s)=E[GtSt=s]V^\pi(s) = E[G_t | S_t = s].
  • Qπ(s,a)Q^\pi(s, a) is the value of a state ss when performing and action aa and then following the policy π\pi, that is Qπ(s,a)=E[GtSt=s,At=a]Q^\pi(s, a) = E[G_t | S_t = s, A_t = a].

Before moving further, note a small algebraic tric for re-writing GtG_t in terms of itself

Gt=i=0γiRt+i+1=Rt+1+γRt+2+γ2Rt+3+=Rt+1+γ(Rt+2+γRt+3+)=Rt+1+γGt+1.\begin{aligned} G_t &= \sum_{i=0}^\infty \gamma^i R_{t+i+1} \\ &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \\ &= R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \cdots) \\ &= R_{t+1} + \gamma G_{t+1}. \end{aligned}

We can use this in the definition of Vπ(s)V^\pi(s) and get

Vπ(s)=E[GtSt=s]=E[Rt+γGt+1St=s]=aπ(as)srp(s,rs,a)[r+γE[Gt+1St+1=s]]=aπ(as)srp(s,rs,a)[r+γVπ(s)].\begin{aligned} V^\pi(s) &= \mathrm{E}[G_t | S_t = s] \\ &= \mathrm{E}[R_t + \gamma G_{t+1} | S_t = s] \\ &= \sum_a \pi(a|s) \sum_{s'}\sum_r p(s', r | s, a)[ r + \gamma \mathrm{E}[G_{t+1} | S_{t+1} = s']] \\ &= \sum_a \pi(a|s) \sum_{s'}\sum_r p(s', r | s, a)[ r + \gamma V^\pi(s')]. \end{aligned}

The last equation is called the Bellman equation for Vπ(s)V^\pi(s) and it shows a recursive relationship between the value of the current state and the possible next states. This in and of itself is not as interesting, but we’ll use it to derive a solution to finding the optimal policy.

Let us now define the optimal value function, that is the value function of the optimal policy (denoted π\pi^*).

V(s)=maxπVπ(s)V^*(s) = \max_\pi V^\pi(s)

that is the optimal value of a state is the maximum over all possible policies. Going one step further, we also define the optimal action-value function as

Q(s,a)=maxπQπ(s,a).Q^*(s, a) = \max_\pi Q_\pi(s, a).

It’s easy to see now that V(s)=maxaQ(s,a)V^*(s) = \max_a Q^*(s, a), that is the maximum value of a state is computed by performing the best possible action. We can use this further to arrive at a simplified Bellman equation as follows

V(s)=maxaQπ(s,a)=maxaEπ[GtSt=s,At=a]=maxaEπ[Rt+γGt+1St=s,At=a]=maxaE[Rt+γV(St+1)St=s,At=a]=maxasrp(s,rs,a)[r+γV(s)].\begin{aligned} V^*(s) &= \max_a Q_{\pi_*}(s, a) \\ &= \max_a \mathrm{E}_{\pi_*} [G_t | S_t = s, A_t = a ] \\ &= \max_a \mathrm{E}_{\pi_*} [R_t + \gamma G_{t+1} | S_t = s, A_t = a ] \\ &\stackrel{*}{=} \max_a \mathrm{E} [R_t + \gamma V^*(S_{t+1}) | S_t = s, A_t = a ] \\ &= \max_a \sum_{s'}\sum_r p(s', r | s, a) [r + \gamma V^*(s')]. \end{aligned}

Here we managed to do a small but important trick in the second last equation (marked with =\stackrel{*}{=}). But let us first decompose what does Eπ\mathrm{E}_\pi* actually mean. Because GrG_r is the discounted sum of rewards, it is only defined in terms of a policy. But since we assume the policy to be stochastic, we need to take an expectation over all possible actions chosen by the policy, and the possible rewards.

This changes at the marked equation, because we are no longer referring to the policy π\pi_*, but rather to the value function VV^*, which is not stochastic. As a reuslt, the expectation in the second to last equation is simply over RtR_t, because we still assume stochastic rewards.

References



Share on Twitter and Facebook



Discussion of "Bellman Equation"

If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.

If you'd prefer to reach out to me via email, my address is loading ..