Value Iteration and Policy Iteration#
Value Iteration#
Value Iteration approximates the optimal value function v ∗ v_* v ∗ by iteratively updating the value function v k v_k v k . The core iteration formula is:
v k + 1 = f ( v k ) = max π ( r π + γ P π v k ) , k = 1 , 2 , 3 … v_{k+1} = f(v_k) = \max_{\pi} (r_{\pi} + \gamma P_{\pi} v_k), \quad k = 1, 2, 3 \dots v k + 1 = f ( v k ) = π max ( r π + γ P π v k ) , k = 1 , 2 , 3 …
This process can be decomposed into two steps:
Policy Update :
π k + 1 = arg max π ( r π + γ P π v k ) \pi_{k+1} = \arg\max_{\pi} (r_{\pi} + \gamma P_{\pi} v_k) π k + 1 = arg π max ( r π + γ P π v k )
Value Update :
v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}} v_k v k + 1 = r π k + 1 + γ P π k + 1 v k
Note: Here, v k v_k v k represents the estimated value vector at the k k k -th iteration, not the final state value.
1. Policy Update#
π k + 1 = arg max π ( r π + γ P π v k ) \pi_{k+1} = \arg\max_{\pi} (r_{\pi} + \gamma P_{\pi} v_k) π k + 1 = arg π max ( r π + γ P π v k )
Its elementwise form is:
π k + 1 ( s ) = arg max π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ) ⏟ q k ( s , a ) , s ∈ S \pi_{k+1}(s) = \arg\max_{\pi} \sum_{a} \pi(a|s) \underbrace{\left( \sum_{r} p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_k(s') \right)}_{q_k(s,a)}, \quad s \in \mathcal{S} π k + 1 ( s ) = arg π max a ∑ π ( a ∣ s ) q k ( s , a ) ( r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v k ( s ′ ) ) , s ∈ S
The resulting π k + 1 \pi_{k+1} π k + 1 is a greedy policy, which selects the action a k ∗ ( s ) a_k^*(s) a k ∗ ( s ) that maximizes q k ( s , a ) q_k(s,a) q k ( s , a ) in state s s s :
a k ∗ ( s ) = arg max a q k ( a , s ) a_k^*(s) = \arg\max_{a} q_k(a, s) a k ∗ ( s ) = arg a max q k ( a , s )
π k + 1 ( a ∣ s ) = { 1 , a = a k ∗ ( s ) 0 , a ≠ a k ∗ ( s ) \pi_{k+1}(a|s) = \begin{cases} 1, & a = a_k^*(s) \\ 0, & a \neq a_k^*(s) \end{cases} π k + 1 ( a ∣ s ) = { 1 , 0 , a = a k ∗ ( s ) a = a k ∗ ( s )
2. Value Update#
v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1} = r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}} v_k v k + 1 = r π k + 1 + γ P π k + 1 v k
Its elementwise form is:
v k + 1 ( s ) = ∑ a π k + 1 ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ) ⏟ q k ( s , a ) , s ∈ S v_{k+1}(s) = \sum_{a} \pi_{k+1}(a|s) \underbrace{ \left( \sum_{r} p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_k(s') \right) }_{q_k(s,a)}, \quad s \in \mathcal{S} v k + 1 ( s ) = a ∑ π k + 1 ( a ∣ s ) q k ( s , a ) ( r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v k ( s ′ ) ) , s ∈ S
Since π k + 1 \pi_{k+1} π k + 1 is a greedy policy, the above equation is equivalent to:
v k + 1 ( s ) = max a q k ( a , s ) v_{k+1}(s) = \max_{a} q_k(a, s) v k + 1 ( s ) = a max q k ( a , s )
Policy Iteration#
Policy Iteration consists of the following steps:
Initialization : Given a random initial policy π 0 \pi_0 π 0 .
Policy Evaluation (PE) : Calculate the state value v π k v_{\pi_k} v π k of the current policy.
v π k = r π k + γ P π k v π k v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k} v π k = r π k + γ P π k v π k
Policy Improvement (PI) : Generate a better policy based on the current value.
π k + 1 = arg max π ( r π + γ P π v π k ) \pi_{k+1} = \arg \max_{\pi} (r_\pi + \gamma P_\pi v_{\pi_k}) π k + 1 = arg π max ( r π + γ P π v π k )
1. Policy Evaluation#
Solving for v π k v_{\pi_k} v π k usually employs an iterative method:
Matrix-vector form :
v π k ( j + 1 ) = r π k + γ P π k v π k ( j ) , j = 0 , 1 , 2 , … v_{\pi_k}^{(j+1)} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}^{(j)}, \quad j = 0, 1, 2, \dots v π k ( j + 1 ) = r π k + γ P π k v π k ( j ) , j = 0 , 1 , 2 , …
Elementwise form :
v π k ( j + 1 ) ( s ) = ∑ a π k ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π k ( j ) ( s ′ ) ) , s ∈ S v_{\pi_k}^{(j+1)}(s) = \sum_{a} \pi_k(a|s) \left( \sum_{r} p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a)v_{\pi_k}^{(j)}(s') \right), \quad s \in \mathcal{S} v π k ( j + 1 ) ( s ) = a ∑ π k ( a ∣ s ) ( r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v π k ( j ) ( s ′ ) ) , s ∈ S
Stopping condition: Stop iterating when j → ∞ j \to \infty j → ∞ or when ∥ v π k ( j + 1 ) − v π k ( j ) ∥ \| v_{\pi_k}^{(j+1)} - v_{\pi_k}^{(j)} \| ∥ v π k ( j + 1 ) − v π k ( j ) ∥ is sufficiently small.
2. Policy Improvement#
Matrix-vector form :
π k + 1 = arg max π ( r π + γ P π v π k ) \pi_{k+1} = \arg \max_{\pi} (r_\pi + \gamma P_\pi v_{\pi_k}) π k + 1 = arg π max ( r π + γ P π v π k )
Elementwise form :
π k + 1 ( s ) = arg max π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π k ( s ′ ) ) ⏟ q π k ( s , a ) , s ∈ S \pi_{k+1}(s) = \arg \max_{\pi} \sum_{a} \pi(a|s) \underbrace{\left( \sum_{r} p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a)v_{\pi_k}(s') \right)}_{q_{\pi_k}(s, a)}, \quad s \in \mathcal{S} π k + 1 ( s ) = arg π max a ∑ π ( a ∣ s ) q π k ( s , a ) ( r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v π k ( s ′ ) ) , s ∈ S
Let a k ∗ ( s ) = arg max a q π k ( a , s ) a_k^*(s) = \arg \max_{a} q_{\pi_k}(a, s) a k ∗ ( s ) = arg max a q π k ( a , s ) , the updated policy is a deterministic greedy policy:
π k + 1 ( a ∣ s ) = { 1 , a = a k ∗ ( s ) 0 , a ≠ a k ∗ ( s ) \pi_{k+1}(a|s) = \begin{cases} 1, & a = a_k^*(s) \\ 0, & a \neq a_k^*(s) \end{cases} π k + 1 ( a ∣ s ) = { 1 , 0 , a = a k ∗ ( s ) a = a k ∗ ( s )
Truncated Policy Iteration#
We can unify Value Iteration and Policy Iteration from the perspective of “the number of policy evaluation steps”:
Policy Iteration: π 0 → P E v π 0 → P I π 1 → P E v π 1 → P I π 2 … Value Iteration: π 0 → P E v 0 → P U π 1 ′ → V U v 1 → P U π 2 ′ … \begin{alignat*}{2}
\text{Policy Iteration: } & \pi_0 \xrightarrow{PE} v_{\pi_0} \xrightarrow{PI} \pi_1 \xrightarrow{PE} v_{\pi_1} \xrightarrow{PI} \pi_2 \dots \\
\text{Value Iteration: } & \phantom{\pi_0 \xrightarrow{PE}} v_0 \xrightarrow{PU} \pi_1' \xrightarrow{VU} v_1 \xrightarrow{PU} \pi_2' \dots
\end{alignat*} Policy Iteration: Value Iteration: π 0 PE v π 0 P I π 1 PE v π 1 P I π 2 … π 0 PE v 0 P U π 1 ′ V U v 1 P U π 2 ′ …
Policy Iteration : P → vvvv... → P → vvvv... \text{P} \to \text{vvvv...} \to \text{P} \to \text{vvvv...} P → vvvv... → P → vvvv... (Evaluation to convergence)
Value Iteration : P → v → P → v → P → v \text{P} \to \text{v} \to \text{P} \to \text{v} \to \text{P} \to \text{v} P → v → P → v → P → v (Evaluation for only one step)
Algorithm comparison under a unified perspective:
v π 1 ( 0 ) = v 0 Initial value Value Iteration ← v 1 ⟵ v π 1 ( 1 ) = r π 1 + γ P π 1 v π 1 ( 0 ) (Iterate only 1 time) v π 1 ( 2 ) = r π 1 + γ P π 1 v π 1 ( 1 ) ⋮ Truncated Policy Iteration ← v ˉ 1 ⟵ v π 1 ( j ) = r π 1 + γ P π 1 v π 1 ( j − 1 ) (Iterate j times) ⋮ Policy Iteration ← v π 1 ⟵ v π 1 ( ∞ ) = r π 1 + γ P π 1 v π 1 ( ∞ ) (Iterate to convergence) \begin{array}{rll}
& v_{\pi_1}^{(0)} = v_0 & \text{Initial value} \\
\text{Value Iteration} \leftarrow v_1 \longleftarrow & v_{\pi_1}^{(1)} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}^{(0)} & \text{(Iterate only 1 time)} \\
& v_{\pi_1}^{(2)} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}^{(1)} & \\
& \quad \vdots & \\
\text{Truncated Policy Iteration} \leftarrow \bar{v}_1 \longleftarrow & v_{\pi_1}^{(j)} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}^{(j-1)} & \text{(Iterate j times)} \\
& \quad \vdots & \\
\text{Policy Iteration} \leftarrow v_{\pi_1} \longleftarrow & v_{\pi_1}^{(\infty)} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}^{(\infty)} & \text{(Iterate to convergence)}
\end{array} Value Iteration ← v 1 ⟵ Truncated Policy Iteration ← v ˉ 1 ⟵ Policy Iteration ← v π 1 ⟵ v π 1 ( 0 ) = v 0 v π 1 ( 1 ) = r π 1 + γ P π 1 v π 1 ( 0 ) v π 1 ( 2 ) = r π 1 + γ P π 1 v π 1 ( 1 ) ⋮ v π 1 ( j ) = r π 1 + γ P π 1 v π 1 ( j − 1 ) ⋮ v π 1 ( ∞ ) = r π 1 + γ P π 1 v π 1 ( ∞ ) Initial value (Iterate only 1 time) (Iterate j times) (Iterate to convergence)
Note : Standard Policy Iteration requires an exact solution for v π k v_{\pi_k} v π k at each step (i.e., j → ∞ j \to \infty j → ∞ ), which is often infeasible or inefficient in practical calculations. Therefore, Truncated Policy Iteration is commonly used in practice, which limits the number of evaluation steps j j j .