https://www.youtube.com/watch?v=E3f2Camj0Is&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=2
학습목표
1. MP, MRP, MDP Bellman operator, contraction operator, model, Q-value, Policy 정의 암기
2. Value Iteration, Policy Iteration 계산하기
3. 여러 가지 Policy Evaluation Approaches 들의 장단점 알기
4. Contraction Properties 증명할 줄 알기
5. MP, MRP, MDP and Markov assumptions의 한계 알기 - 어떤 policy evaluation methods가 Markov assumption을 필요로 하나?
Markov Process
1. Memoryless random procss
- Sequence of random states with Markov property
2. Definition of Markov Process
-
-
3. Note : no rewards, no actions
4. In finite number of states(
Example of Markov Process
right = 0.4 rotate = 0.6 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
left = 0.4 rotate = 0.6 |
Markov Reward Processes (MRPs)
1. Markov Reward Process is a Markov Chain+ rewards
2. Definition of Markov Reward Process(MRP)
-
-
-
- Discount factor
3. Note : yes rewards, no actions
4. If finite number (\N\) of states, can express
Example of Markov Reward Process
right = 0.4 rotate = 0.6 reward = 1 |
right = 0.4 rotate = 0.2 left = 0.4 reward = 0 |
right = 0.4 rotate = 0.2 left = 0.4 reward = 0 |
right = 0.4 rotate = 0.2 left = 0.4 reward = 0 |
right = 0.4 rotate = 0.2 left = 0.4 reward = 0 |
right = 0.4 rotate = 0.2 left = 0.4 reward = 0 |
left = 0.4 rotate = 0.6 reward = 10 |
Markov Decision Processes (MDPs)

1. Markov Decision Process is a Markov Reward Process + actions
2. Definition of MDP
-
-
-
-
3. MDP is a tuple :
4. MDP can model a huge number of interesting problems and settings
- Bandits : single state MDP
- Optimal control mostly about continuous-state MDP
- Partially observable MDPs = MDP where state is history
5. Note : yes rewards. yes actions
Example of Markov Decision Process
Return & Value Function
1. Definition of Horizon (H)
- Number of time steps in each episode
- Can be infinite
- Otherwise called finite Markov reward process
2. Definition of Return,
- Discounted sum of rewards from time step
3. Definition of State Value Function,
- Expected return from starting in state
Computing the Value of a MRP(Markov Reward Process)
- Markov property provides structure
- MRP value fnction satisfies
Iterative Algorithm for Computing Value of a MRP
- Dynamic programming
- Initialize
- For
- For all
-
MDP Policies
1. Policy specifies what action to take in each state
- Can be deterministic or stochastic
2. For generality, consider as a conditional distribution
- Given a state, specifies a distribution over a actions
3. Policy:
MDP + Policy
1. MDP +
2. Precisely, it is the MRP
3. implies we can use same techniques to evaluate the value of a policy for a MDP as we could to compute the value of a MRP, by defining a MRP with
MDP Policy Evaluation, Iterative Algorithm
- Initialize
- For
- For all
-
- This is a Bellman backup for a prticular policy
- Note that if the policy is deterministic then the above update simplifies to
-
Example Iteration of Policy Evaluation 1
- Dynamics :
- Reward : for all actions, +1 in state
- Let
Q. compute
Example Iteration of Policy Evaluation 2
- 7 discrete states
- 2 actions
Q. How many deterministic policies are there?
Q. Is the highest reward policy for a MDP always unique?
MDP Control
-Compute the optimal policy
-
- There exists a unique optimal value function
- Optimal policy for a MDP in an infinite horizon problem(agent acts forever) is
- Deterministic
- Stationary (does not depend on time step)
- Uniq? Not necessarily, may have two policies with identical(optimal) values
Policy Search
- One option is searching to compute best policy
- Number of deterministic policies is
- Policy iteration is generally more efficient than enumeration
MDP Policy Iteration (PI)
- set
- Initialize
- While
-
-
-
State Action Value Q
1. State-action value of a policy
-
2. Take action
Policy Improvement
1. Compute state-action vlaue of a policy
- For
-
2. Computenew policy
-
Example Policy Iteration
1. If policy doesn't change, can it ever change again?
2. Is there a maximum number of iterations of policy iteration?
Bellman Equation and Bellman Backup Operators
- Value function of a policy must satisfy the Bellman Equation
-
- Bellman backup operator
- Applied to a value function
- Returns a new value function
- Improves the value if possible
-
- BV yields a value function over all states
Value Iteration (VI)
1. set
2. Initialize
3. Loop until convergence: (for ex.
- for each state
-
- View as Bellman backup on value function
-
-
슬슬 어지럽네요 수식 보고 걍 바로 이해할수 있는 능력자였으면 좋겠습니다
Policy Iteration as Bellman Operations
- Bellman backup operator
-
- Policy evaluation amounts to computing the fixed point of
- To do policy evaluation, repeatedly apply operator until
-
- To do Policy Improvement
-
Going Back to Value Iteration (VI)
1. set
2. Initialize
3. Loop until [finite horizon, convergence]
- for each state
-
- View as Bellman backup on value function
-
-To extract optimal policy if can act for
-
Value Iteration for Finite Horizon H
- Initialize
- For
- For each state
-
-
Computing the Value of a Policy in a Finite Horizon
- Alternatively can estimate by simulation
- Generate a large number of episodes
- Average returns
- Concentration inequalities bound how quickly average concetrates to expected value
- Requires no assumption of Markov structure
Example of Finite Horizon H
right = 0.4 rotate = 0.6 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
left = 0.4 rotate = 0.6 |
- Reward : +1 in
- Sample returns for sample 4-step
-
-
-
Question : Finite Horizon Policies
1. set
2. Initialize
3. Loop until
- For each state
-
-
Is optimal policy stationary( independent of time step) in finite horizon tasks?
In general no.
Value Iteration vs Policy Iteration
Value Iteration :
- Compute optimal value for hirozon =
- Note this can ve used to compute optimal policy if horizon =
- Increment
Policy Iteration :
- Compute infinite horizon value of a policy
- Use to select another (better) policy
- Closely related to a very popular method in RL : Policy gradient
'Have Done > Reinforcement Learning' 카테고리의 다른 글
[강화학습 GYM] env.render() (0) | 2022.05.23 |
---|---|
[강화학습] CS 234 class 3 & class 4 (0) | 2022.05.02 |
[강화학습] CS234 class1 (0) | 2022.04.26 |
[강화학습] Space-Invader 환경설정 후 학습하기 (0) | 2022.04.14 |
[강화학습] OPEN AI GYM issue (0) | 2022.04.11 |
댓글