Maxton‘s Blog

Back

Value Iteration and Policy Iteration#

Value Iteration#

Value Iteration approximates the optimal value function vv_* by iteratively updating the value function vkv_k. The core iteration formula is:

vk+1=f(vk)=maxπ(rπ+γPπvk),k=1,2,3v_{k+1} = f(v_k) = \max_{\pi} (r_{\pi} + \gamma P_{\pi} v_k), \quad k = 1, 2, 3 \dots

This process can be decomposed into two steps:

  1. Policy Update:

    πk+1=argmaxπ(rπ+γPπvk)\pi_{k+1} = \arg\max_{\pi} (r_{\pi} + \gamma P_{\pi} v_k)
  2. Value Update:

    vk+1=rπk+1+γPπk+1vkv_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}} v_k

Note: Here, vkv_k represents the estimated value vector at the kk-th iteration, not the final state value.

1. Policy Update#

πk+1=argmaxπ(rπ+γPπvk)\pi_{k+1} = \arg\max_{\pi} (r_{\pi} + \gamma P_{\pi} v_k)

Its elementwise form is:

πk+1(s)=argmaxπaπ(as)(rp(rs,a)r+γsp(ss,a)vk(s))qk(s,a),sS\pi_{k+1}(s) = \arg\max_{\pi} \sum_{a} \pi(a|s) \underbrace{\left( \sum_{r} p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_k(s') \right)}_{q_k(s,a)}, \quad s \in \mathcal{S}

The resulting πk+1\pi_{k+1} is a greedy policy, which selects the action ak(s)a_k^*(s) that maximizes qk(s,a)q_k(s,a) in state ss:

ak(s)=argmaxaqk(a,s)a_k^*(s) = \arg\max_{a} q_k(a, s) πk+1(as)={1,a=ak(s)0,aak(s)\pi_{k+1}(a|s) = \begin{cases} 1, & a = a_k^*(s) \\ 0, & a \neq a_k^*(s) \end{cases}

2. Value Update#

vk+1=rπk+1+γPπk+1vkv_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}} v_k

Its elementwise form is:

vk+1(s)=aπk+1(as)(rp(rs,a)r+γsp(ss,a)vk(s))qk(s,a),sSv_{k+1}(s) = \sum_{a} \pi_{k+1}(a|s) \underbrace{ \left( \sum_{r} p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_k(s') \right) }_{q_k(s,a)}, \quad s \in \mathcal{S}

Since πk+1\pi_{k+1} is a greedy policy, the above equation is equivalent to:

vk+1(s)=maxaqk(a,s)v_{k+1}(s) = \max_{a} q_k(a, s)

Policy Iteration#

Policy Iteration consists of the following steps:

  1. Initialization: Given a random initial policy π0\pi_0.

  2. Policy Evaluation (PE): Calculate the state value vπkv_{\pi_k} of the current policy.

    vπk=rπk+γPπkvπkv_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}
  3. Policy Improvement (PI): Generate a better policy based on the current value.

    πk+1=argmaxπ(rπ+γPπvπk)\pi_{k+1} = \arg \max_{\pi} (r_\pi + \gamma P_\pi v_{\pi_k})

1. Policy Evaluation#

Solving for vπkv_{\pi_k} usually employs an iterative method:

  • Matrix-vector form:

    vπk(j+1)=rπk+γPπkvπk(j),j=0,1,2,v_{\pi_k}^{(j+1)} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}^{(j)}, \quad j = 0, 1, 2, \dots
  • Elementwise form:

    vπk(j+1)(s)=aπk(as)(rp(rs,a)r+γsp(ss,a)vπk(j)(s)),sSv_{\pi_k}^{(j+1)}(s) = \sum_{a} \pi_k(a|s) \left( \sum_{r} p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a)v_{\pi_k}^{(j)}(s') \right), \quad s \in \mathcal{S}

Stopping condition: Stop iterating when jj \to \infty or when vπk(j+1)vπk(j)\| v_{\pi_k}^{(j+1)} - v_{\pi_k}^{(j)} \| is sufficiently small.

2. Policy Improvement#

  • Matrix-vector form:

    πk+1=argmaxπ(rπ+γPπvπk)\pi_{k+1} = \arg \max_{\pi} (r_\pi + \gamma P_\pi v_{\pi_k})
  • Elementwise form:

    πk+1(s)=argmaxπaπ(as)(rp(rs,a)r+γsp(ss,a)vπk(s))qπk(s,a),sS\pi_{k+1}(s) = \arg \max_{\pi} \sum_{a} \pi(a|s) \underbrace{\left( \sum_{r} p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a)v_{\pi_k}(s') \right)}_{q_{\pi_k}(s, a)}, \quad s \in \mathcal{S}

Let ak(s)=argmaxaqπk(a,s)a_k^*(s) = \arg \max_{a} q_{\pi_k}(a, s), the updated policy is a deterministic greedy policy:

πk+1(as)={1,a=ak(s)0,aak(s)\pi_{k+1}(a|s) = \begin{cases} 1, & a = a_k^*(s) \\ 0, & a \neq a_k^*(s) \end{cases}

Truncated Policy Iteration#

We can unify Value Iteration and Policy Iteration from the perspective of “the number of policy evaluation steps”:

Policy Iteration: π0PEvπ0PIπ1PEvπ1PIπ2Value Iteration: π0PEv0PUπ1VUv1PUπ2\begin{alignat*}{2} \text{Policy Iteration: } & \pi_0 \xrightarrow{PE} v_{\pi_0} \xrightarrow{PI} \pi_1 \xrightarrow{PE} v_{\pi_1} \xrightarrow{PI} \pi_2 \dots \\ \text{Value Iteration: } & \phantom{\pi_0 \xrightarrow{PE}} v_0 \xrightarrow{PU} \pi_1' \xrightarrow{VU} v_1 \xrightarrow{PU} \pi_2' \dots \end{alignat*}
  • Policy Iteration: Pvvvv...Pvvvv...\text{P} \to \text{vvvv...} \to \text{P} \to \text{vvvv...} (Evaluation to convergence)
  • Value Iteration: PvPvPv\text{P} \to \text{v} \to \text{P} \to \text{v} \to \text{P} \to \text{v} (Evaluation for only one step)

Algorithm comparison under a unified perspective:

vπ1(0)=v0Initial valueValue Iterationv1vπ1(1)=rπ1+γPπ1vπ1(0)(Iterate only 1 time)vπ1(2)=rπ1+γPπ1vπ1(1)Truncated Policy Iterationvˉ1vπ1(j)=rπ1+γPπ1vπ1(j1)(Iterate j times)Policy Iterationvπ1vπ1()=rπ1+γPπ1vπ1()(Iterate to convergence)\begin{array}{rll} & v_{\pi_1}^{(0)} = v_0 & \text{Initial value} \\ \text{Value Iteration} \leftarrow v_1 \longleftarrow & v_{\pi_1}^{(1)} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}^{(0)} & \text{(Iterate only 1 time)} \\ & v_{\pi_1}^{(2)} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}^{(1)} & \\ & \quad \vdots & \\ \text{Truncated Policy Iteration} \leftarrow \bar{v}_1 \longleftarrow & v_{\pi_1}^{(j)} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}^{(j-1)} & \text{(Iterate j times)} \\ & \quad \vdots & \\ \text{Policy Iteration} \leftarrow v_{\pi_1} \longleftarrow & v_{\pi_1}^{(\infty)} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}^{(\infty)} & \text{(Iterate to convergence)} \end{array}

Note: Standard Policy Iteration requires an exact solution for vπkv_{\pi_k} at each step (i.e., jj \to \infty), which is often infeasible or inefficient in practical calculations. Therefore, Truncated Policy Iteration is commonly used in practice, which limits the number of evaluation steps jj.

RL Study Notes: Value Iteration and Policy Iteration
https://en.maxtonniu.com/blog/rl_chapter04
Author Maxton Niu
Published at February 18, 2026