Maxton‘s Blog

Back

Bellman Optimality Equation#

Definition#

For all states sSs \in \mathcal{S}, if the state value function vπ(s)v_{\pi^*}(s) of policy π\pi^* is not less than the state value function vπ(s)v_{\pi}(s) of any other policy π\pi, that is:

vπ(s)vπ(s),sS,πv_{\pi^*}(s) \geq v_{\pi}(s), \quad \forall s \in \mathcal{S}, \forall \pi

Then policy π\pi^* is called the optimal policy. The state value corresponding to the optimal policy is called the optimal state value function, denoted as v(s)v^*(s).

Derivation of the Optimality Equation#

The optimal state value function v(s)v^*(s) satisfies the Bellman Optimality Equation. The core idea is: the optimal value equals the expected return obtained by taking the optimal action in the current state.

Scalar Form#

v(s)=maxaAq(s,a)=maxaA(rRp(rs,a)r+γsSp(ss,a)v(s))\begin{aligned} v^*(s) &= \max_{a \in \mathcal{A}} q^*(s, a) \\ &= \max_{a \in \mathcal{A}} \left( \sum_{r \in \mathcal{R}} p(r|s,a)r + \gamma \sum_{s' \in \mathcal{S}} p(s'|s,a)v^*(s') \right) \end{aligned}

If written as a maximization over policy π\pi, we have:

v(s)=maxπaAπ(as)q(s,a)v^*(s) = \max_{\pi} \sum_{a \in \mathcal{A}} \pi(a|s) q^*(s, a)

Since a weighted average cannot exceed the maximum value:

aAπ(as)q(s,a)maxaAq(s,a)\sum_{a \in \mathcal{A}} \pi(a|s) q^*(s, a) \leq \max_{a \in \mathcal{A}} q^*(s, a)

The condition for equality is that the policy π\pi assigns probability completely to the action that maximizes q(s,a)q^*(s,a). This implies that the optimal policy π\pi^* is deterministic:

π(as)={1,a=argmaxaAq(s,a)0,otherwise\pi^*(a|s) = \begin{cases} 1, & a = \arg\max_{a' \in \mathcal{A}} q^*(s, a') \\ 0, & \text{otherwise} \end{cases}

Vector Form#

We treat the process of solving for vv^* as an operator operation. Defining the optimal Bellman operator T\mathcal{T}^*, the Bellman Optimality Equation is a fixed-point equation:

v=T(v)v^* = \mathcal{T}^*(v^*)

Specifically expanded as:

v=maxπ(rπ+γPπv)v^* = \max_{\pi} (r_{\pi} + \gamma P_{\pi} v^*)

Where:

  • rπr_{\pi} is the average immediate reward vector under policy π\pi, defined as [rπ]s=aπ(as)rp(rs,a)r[r_{\pi}]_s = \sum_{a} \pi(a|s) \sum_{r} p(r|s,a)r
  • PπP_{\pi} is the state transition matrix under policy π\pi, defined as [Pπ]s,s=aπ(as)p(ss,a)[P_{\pi}]_{s,s'} = \sum_{a} \pi(a|s) p(s'|s,a)

Contraction Mapping and Fixed Points#

The Bellman optimal operator T\mathcal{T}^* satisfies the Contraction Mapping Theorem when γ[0,1)\gamma \in [0, 1). This implies:

  1. Existence: There exists a unique fixed point vv^* satisfying v=T(v)v^* = \mathcal{T}^*(v^*).
  2. Convergence: For any initial value v0v_0, the iterative sequence vk+1=T(vk)v_{k+1} = \mathcal{T}^*(v_k) inevitably converges to vv^*.
    • That is, limkvk=v\lim_{k \to \infty} v_k = v^*.
    • The convergence speed is geometric (exponential convergence), controlled by the discount factor γ\gamma.

The Essence of Value Iteration#

The Value Iteration algorithm utilizes the aforementioned fixed-point property with the iterative formula:

vk+1=maxπ(rπ+γPπvk)v_{k+1} = \max_{\pi} (r_{\pi} + \gamma P_{\pi} v_k)

This step actually contains two implicit processes:

  1. Implicit Policy Improvement: Based on the current value estimate vkv_k, find a greedy policy, i.e., select the action that currently appears to have the highest qq value.

    πgreedy=argmaxπ(rπ+γPπvk)\pi_{greedy} = \arg\max_{\pi} (r_{\pi} + \gamma P_{\pi} v_k)
  2. Value Update (Policy Evaluation): Assuming the greedy action is taken, calculate its one-step expected return as the new value estimate vk+1v_{k+1}.

Summary: Value Iteration essentially “seizes” the currently best action in every round, calculates its value, and then “seizes” the best action again in the next round based on the new value.

Determinants of the Optimal Policy#

The optimal policy π\pi^* is determined by the following formula:

π(s)=argmaxa(rp(rs,a)r+γsp(ss,a)v(s))\pi^*(s) = \arg\max_{a} \left( \sum_{r} p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v^*(s') \right)

Key Factors#

  1. System Dynamics: p(ss,a)p(s'|s, a) and p(rs,a)p(r|s, a). These are the physical laws of the environment and are generally immutable.
  2. Discount Factor γ\gamma:
    • γ0\gamma \to 0: The agent becomes “myopic,” focusing only on immediate rewards.
    • γ1\gamma \to 1: The agent becomes “far-sighted,” valuing long-term cumulative returns.
  3. Reward Function rr:
    • The relative numerical values of the reward are more important than the absolute values.

Affine Transformation of the Reward Function#

If a linear transformation is applied to the reward function:

r(s,a,s)=αr(s,a,s)+βr'(s, a, s') = \alpha \cdot r(s, a, s') + \beta

Where α>0\alpha > 0 and β\beta is a constant.

  • Impact on Value Function: The new value function vv' has a linear relationship with the original value function vv.
v(s)=αv(s)+β1γv'(s) = \alpha v(s) + \frac{\beta}{1-\gamma}
  • Impact on Policy: The optimal policy remains unchanged.
argmaxaq(s,a)=argmaxa(αq(s,a)+β1γ)=argmaxaq(s,a)\arg\max_a q'(s,a) = \arg\max_a \left( \alpha q(s,a) + \frac{\beta}{1-\gamma} \right) = \arg\max_a q(s,a)

This indicates that as long as the partial order relations and relative proportions between rewards are preserved, the specific numerical magnitude does not alter the optimal behavior pattern.

RL Study Notes: Bellman Optimality Equation
https://en.maxtonniu.com/blog/rl_chapter03
Author Maxton Niu
Published at February 17, 2026