RL Study Notes: Monte Carlo Methods
RL Monte Carlo methods: MC Basic, Exploring Starts, GPI, and epsilon-Greedy for model-free optimization.
Monte Carlo Methods#
The Simplest MC-based RL Algorithm: MC Basic#
-
Core Process: Starting from a state-action pair , follow a policy to generate an episode.
-
Return Calculation:
- The (discounted) return obtained from this episode is denoted as .
- is a sample of , meaning it is a sample of .
-
Law of Large Numbers Estimation: If many episodes generate a set , then:
-
Core Idea: When there is no Model, data is required; when there is no data, patterns are required—specifically, Experience.
-
Positioning: MC Basic is a variant of the Policy Iteration algorithm that removes the model-based components.
-
Drawbacks: MC Basic is too inefficient and is rarely used directly in practice.
Considerations on Episode Length#
- Too Short: Only states that are close enough can find the optimal policy.
- Increasing Length: As the length increases, the optimal policy can eventually be found.
- Conclusion: Episodes must be sufficiently long, but do not need to be infinitely long.
MC Exploring Starts#
Sampling Sequence Example#
Definition of a Visit#
In an episode, every occurrence of a state-action pair is called a visit. Based on the decomposition of the sequence above:
Data Efficiency Methods#
- First-visit: Estimate using only the first time each state-action pair appears.
- Every-visit: Re-estimate every time a state-action pair appears.
Generalized Policy Iteration (GPI)#
Considerations on Update Timing:
- Wait until all episodes are collected, then average them for estimation.
- Or, start estimating after obtaining a single episode, re-estimating each time.
Core Concepts of GPI:
- GPI is not a specific algorithm, but a general term/framework.
- It embodies the process of continuously switching between Policy Evaluation and Policy Improvement.
- Many Model-based and Model-free algorithms fall within this framework.
MC--Greedy#
Soft Policies#
- Definition: A policy is called a “soft policy” if it selects every action with a non-zero probability.
- Role: Eliminates the need to ensure coverage by “generating massive episodes from every state-action pair,” thereby removing the strong assumption of Exploring Starts.
-Greedy Policy#
The formula is as follows:
Where and is the total number of available actions in that state.
Balancing Exploitation and Exploration:
- When : Becomes Greedy (Full Exploitation).
- When : Becomes a Uniform Distribution (Full Exploration).
Policy Improvement#
The goal is to maximize the action-value function:
The derived update rule is:
Conclusion: By introducing -Greedy, the “exploring starts” condition (the assumption allowing episodes to start from any state) is no longer required.