본문 바로가기
Have Done/Reinforcement Learning

[강화학습] CS234 class 2

by 에아오요이가야 2022. 4. 26.

https://www.youtube.com/watch?v=E3f2Camj0Is&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=2 

 

학습목표

1. MP, MRP, MDP Bellman operator, contraction operator, model, Q-value, Policy 정의 암기

2. Value Iteration, Policy Iteration 계산하기

3. 여러 가지 Policy Evaluation Approaches 들의 장단점 알기

4. Contraction Properties 증명할 줄 알기

5. MP, MRP, MDP and Markov assumptions의 한계 알기 - 어떤 policy evaluation methods가 Markov assumption을 필요로 하나?

 

Markov Process

1. Memoryless random procss

  - Sequence of random states with Markov property

2. Definition of Markov Process

  - S is a (finite) set of states(sS)

  - P is dynamics/transition model that specifices p(st+1=s|st=s

3. Note : no rewards, no actions

4. In finite number of states(N), can express P as a matrix

P=(P(s1|s1)...P(sN|s1).........P(s1|sN)...P(sN|sN))

 

Example of Markov Process

s1 s2 s3 s4 s5 s6 s7
right = 0.4
rotate = 0.6
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
left = 0.4
rotate = 0.6

 

P=(0.60.4000000.40.20.4000000.40.20.4000000.40.20.4000000.40.20.4000000.40.20.4000000.40.6)

 

Markov Reward Processes (MRPs)

1. Markov Reward Process is a Markov Chain+ rewards

2. Definition of Markov Reward Process(MRP)

  - S is a (finite) set of statessS

  - P is dynamics/transition model that specifices p(st+1=s|st=s)

  - R is a reward funciton R(st=s)=E[rt|st=s]

  - Discount factor γ[0,1]

3. Note : yes rewards, no actions

4. If finite number (\N\) of states, can express R as a vector

 

Example of Markov Reward Process

s1 s2 s3 s4 s5 s6 s7
right = 0.4
rotate = 0.6


reward = 1
right = 0.4
rotate = 0.2
left = 0.4

reward = 0
right = 0.4
rotate = 0.2
left = 0.4

reward = 0
right = 0.4
rotate = 0.2
left = 0.4

reward = 0
right = 0.4
rotate = 0.2
left = 0.4

reward = 0
right = 0.4
rotate = 0.2
left = 0.4

reward = 0
left = 0.4
rotate = 0.6


reward = 10

 

Markov Decision Processes (MDPs)

1. Markov Decision Process is a Markov Reward Process + actions

2. Definition of MDP

  - S is a (finite) set of Markov states sS

  - A is a (finite) set of actions aA

  - P is dynamics/transition model for each action, that specifies p(st+1=s|st=s,at=a)

  - R is a reward function R(st=s,at=a)=E[rt|st=s,at=a]

3. MDP is a tuple : (S,A,P,R,γ)

4. MDP can model a huge number of interesting problems and settings

  - Bandits : single state MDP

  - Optimal control mostly about continuous-state MDP

  - Partially observable MDPs = MDP where state is history

5. Note : yes rewards. yes actions

 

Example of Markov Decision Process

s1 s2 s3 s4 s5 s6 s7
             

P(s|s,a1=left)=(1000000100000001000000010000000100000001000000010)

 

P(s|s,a2=right)=(0100000001000000010000000100000001000000010000001)

 

Return & Value Function

1. Definition of Horizon (H)

  - Number of time steps in each episode

  - Can be infinite

  - Otherwise called finite Markov reward process

 

2. Definition of Return, Gt (for a MRP)

  - Discounted sum of rewards from time step t to horizon H

Gt=rt+γrt+1+γ2rt+2+...+γH1rt+H1

 

 

3. Definition of State Value Function, V(s) (for a MRP)

  - Expected return from starting in state s

V(s)=E[Gt|st=s]=E[rt+γrt+1+γ2rt+2+...+γH1rt+H1|st=s]

 

Computing the Value of a MRP(Markov Reward Process)

- Markov property provides structure

- MRP value fnction satisfies

V(s)=R(s)+γsSP(s|s)V(s)

S : Immediate reward

γsSP(s|s)V(s) : Discounted sum of future rewards

 

 

V=R+γPV

R=VγPV

R=(IγP)V=R

V=(IγP)1R

(IγP) is invertible

 

Iterative Algorithm for Computing Value of a MRP

- Dynamic programming

- Initialize V0(s)=0 for all s

- For k=1 until convergence

  - For all sS 

     - Vk(s)=R(s)+γsSP(s|s)Vk1(s)

 

MDP Policies

1. Policy specifies what action to take in each state

  - Can be deterministic or stochastic

2. For generality, consider as a conditional distribution

  - Given a state, specifies a distribution over a actions

3. Policy: π(a|s)=P(at=a|st=s)

 

MDP + Policy

1. MDP + π(a|s)= Markov Reward Process

2. Precisely, it is the MRP (S,Rπ,Pπ,γ), where

Rπ(s)=aAπ(a|s)R(s,a)

Pπ(s|s)=aAπ(a|s)P(s|s,a)

3. implies we can use same techniques to evaluate the value of a policy for a MDP as we could to compute the value of a MRP, by defining a MRP with Rπ and Pπ

 

MDP Policy Evaluation, Iterative Algorithm

- Initialize V0(s)=0 for all s

- For k=1 until convergence

  - For all sS 

     - Vkπ(s)=aπ(a|s)[r(s,a)+γsSP(s|s,a)Vk1π(s)]

- This is a Bellman backup for a prticular policy

- Note that if the policy is deterministic then the above update simplifies to

     - Vkπ(s)=r(s,π(s))+γsSP(s|s,π(s))Vk1π(s)

 

Example Iteration of Policy Evaluation 1

- Dynamics : p(s6|s6,a1)=0.5,p(s7|s6,a1)=0.5

- Reward : for all actions, +1 in state s1, +10 in state s7, 0 otherwise

- Let π(s)=a1, for all s, assume Vkπ=[1 0 0 0 0 0 10] and k=1,γ=0.5 

Q. compute Vk+1π(s6)

 

Example Iteration of Policy Evaluation 2

- 7 discrete states

- 2 actions

Q. How many deterministic policies are there?

Q. Is the highest reward policy for a MDP always unique?

 

MDP Control

-Compute the optimal policy

  - π(s)=argmaxπVπ(s) 

- There exists a unique optimal value function

- Optimal policy for a MDP in an infinite horizon problem(agent acts forever) is

  - Deterministic

  - Stationary (does not depend on time step)

  - Uniq? Not necessarily, may have two policies with identical(optimal) values

 

Policy Search

- One option is searching to compute best policy

- Number of deterministic policies is |A||S|

- Policy iteration is generally more efficient than enumeration

 

MDP Policy Iteration (PI)

- set i=0

- Initialize π0(s) randomly for all states s

- While i==0 or ||πiπi1||1>0

  - Vπi : MDP V function policy evaluation of pii

 -  πi+1 : Policy improvement 

  - i=i+1

 

State Action Value Q

1. State-action value of a policy 

  - Qπ(s,a)=R(s,a)+γsSP(s|s,a)Vπ(s) 

2. Take action a, then follow the policy π

 

Policy Improvement

1. Compute state-action vlaue of a policy πi

  - For sS and aA

    -Qπi(s,a)=R(s,a)+γsSP(s|s,a)Vπi(s)

2. Computenew policy πi+1, for all sS

    - πi+1(s)=argmaxaQπi(s,a)

 

Example Policy Iteration 

1. If policy doesn't change, can it ever change again?

2. Is there a maximum number of iterations of policy iteration?

 

Bellman Equation and Bellman Backup Operators

- Value function of a policy must satisfy the Bellman Equation 

  - Vπ(s)=Rπ(s)=γsSPπ(s|s)Vπ(s)

- Bellman backup operator

  - Applied to a value function 

  - Returns a new value function

  - Improves the value if possible

    - BV(s)=maxa[R(s,a)+γsSp(s|s,a)V(s)]

  - BV yields a value function over all states s

 

Value Iteration (VI)

1. set k=1

2. Initialize V0(s)=0 for all states s

3. Loop until convergence: (for ex. ||Vk+1Vk||\infinϵ

  - for each state s

    - Vk+1(s)=maxa[R(s,a)+γsSP(s|s,a)Vk(s)]

  - View as Bellman backup on value function 

    -Vk+1=BVk

    - πk+1(s)=argmaxa[R(s,a)+γsSP(s|s,a)Vk(s)]

 

슬슬 어지럽네요 수식 보고 걍 바로 이해할수 있는 능력자였으면 좋겠습니다

 

Policy Iteration as Bellman Operations

- Bellman backup operator Bπ for a particular policy is defined as 

  - BπV(s)=Rπ(s)+γsSPπ(s|s)V(s)

- Policy evaluation amounts to computing the fixed point of Bπ

- To do policy evaluation, repeatedly apply operator until V stops vhanging 

  - Vπ=BπBπ...BπV

- To do Policy Improvement

 - πk+1(s)=argmaxa[R(s,a)+γsSP(s|s,a)Vπk(s)]

 

Going Back to Value Iteration (VI)

1. set k=1

2. Initialize V0(s)=0 for all states s

3. Loop until [finite horizon, convergence]

  - for each state s

    - Vk+1(s)=maxa[R(s,a)+γsSP(s|s,a)Vk(s)]

  - View as Bellman backup on value function 

    -Vk+1=BVk

  -To extract optimal policy if can act for k+1 more steps, 

    - πk+1(s)=argmaxa[R(s,a)+γsSP(s|s,a)Vk(s)]

 

Value Iteration for Finite Horizon H

Vk = optimal value if making k more decisions

πk = optimal policy if making k more decisions

- Initialize V0(s)=0 for all states s

- For k=1:H 

  - For each state s

  -  Vk+1(s)=maxa[R(s,a)+γsSP(s|s,a)Vk(s)]

  - πk+1(s)=argmaxa[R(s,a)+γsSP(s|s,a)Vk(s)]

 

Computing the Value of a Policy in a Finite Horizon

- Alternatively can estimate by simulation

  - Generate a large number of episodes

  - Average returns

  - Concentration inequalities bound how quickly average concetrates to expected value

  - Requires no assumption of Markov structure

 

Example of Finite Horizon H

s1 s2 s3 s4 s5 s6 s7
right = 0.4
rotate = 0.6
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
left = 0.4
rotate = 0.6

- Reward : +1 in s1, +10 in s7, 0 in all other states

- Sample returns for sample 4-step (h=4) episodes, start state s4, γ=12

    - s4,s5,s6,s7=0+12×0+14×0+18×10=1.25

    - s4,s4,s5,s4=0+12×0+14×0+18×0=0

    - s4,s3,s2,s1=0+12×0+14×0+18×1=0.125

 

Question : Finite Horizon Policies

1. set k=1

2. Initialize V0(s)=0 for all states s

3. Loop until k==H:

  - For each state s

  -  Vk+1(s)=maxa[R(s,a)+γsSP(s|s,a)Vk(s)]

  - πk+1(s)=argmaxa[R(s,a)+γsSP(s|s,a)Vk(s)]

 

Is optimal policy stationary( independent of time step) in finite horizon tasks?

In general no.

 

Value Iteration vs Policy Iteration

Value Iteration :

 - Compute optimal value for hirozon = k

    - Note this can ve used to compute optimal policy if horizon =k

  - Increment k

Policy Iteration : 

  - Compute infinite horizon value of a policy

  - Use to select another (better) policy

  - Closely related to a very popular method in RL : Policy gradient