본문 바로가기
Have Done/Reinforcement Learning

[강화학습] CS234 class 2

by 에아오요이가야 2022. 4. 26.

https://www.youtube.com/watch?v=E3f2Camj0Is&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=2 

 

학습목표

1. MP, MRP, MDP Bellman operator, contraction operator, model, Q-value, Policy 정의 암기

2. Value Iteration, Policy Iteration 계산하기

3. 여러 가지 Policy Evaluation Approaches 들의 장단점 알기

4. Contraction Properties 증명할 줄 알기

5. MP, MRP, MDP and Markov assumptions의 한계 알기 - 어떤 policy evaluation methods가 Markov assumption을 필요로 하나?

 

Markov Process

1. Memoryless random procss

  - Sequence of random states with Markov property

2. Definition of Markov Process

  - \(S\) is a (finite) set of states(\( s \in S\))

  - \(P\) is dynamics/transition model that specifices \(p(s_{t+1}=s' | s_t =s\)

3. Note : no rewards, no actions

4. In finite number of states(\(N\)), can express P as a matrix

\( P = \left( \begin {array}{cc} P(s_1|s_1) &... & P(s_N|s_1)\\...&...&...\\ P(s_1|s_N) &...&P(s_N|s_N)\end {array} \right) \)

 

Example of Markov Process

\(s_1\) \(s_2\) \(s_3\) \(s_4\) \(s_5\) \(s_6\) \(s_7\)
right = 0.4
rotate = 0.6
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
left = 0.4
rotate = 0.6

 

\( P = \left( \begin {array}{cc} 0.6 & 0.4 &0&0&0&0 & 0\\ 0.4 & 0.2 &0.4&0&0&0 & 0\\0 & 0.4 &0.2&0.4&0&0 & 0\\0 & 0 &0.4&0.2&0.4&0 & 0\\0 & 0 &0&0.4&0.2&0.4 & 0\\0 & 0 &0&0&0.4&0.2 & 0.4\\ 0 & 0 &0&0&0&0.4 & 0.6\end {array} \right) \)

 

Markov Reward Processes (MRPs)

1. Markov Reward Process is a Markov Chain+ rewards

2. Definition of Markov Reward Process(MRP)

  - \(S\) is a (finite) set of states\( s\in S\)

  - \(P\) is dynamics/transition model that specifices \(p(s_{t+1}=s' | s_t =s)\)

  - \( R\) is a reward funciton \(R(s_t=s) = \mathbb {E}[r_t|s_t=s]\)

  - Discount factor \(\gamma \in [0,1]\)

3. Note : yes rewards, no actions

4. If finite number (\N\) of states, can express \(R\) as a vector

 

Example of Markov Reward Process

\(s_1\) \(s_2\) \(s_3\) \(s_4\) \(s_5\) \(s_6\) \(s_7\)
right = 0.4
rotate = 0.6


reward = 1
right = 0.4
rotate = 0.2
left = 0.4

reward = 0
right = 0.4
rotate = 0.2
left = 0.4

reward = 0
right = 0.4
rotate = 0.2
left = 0.4

reward = 0
right = 0.4
rotate = 0.2
left = 0.4

reward = 0
right = 0.4
rotate = 0.2
left = 0.4

reward = 0
left = 0.4
rotate = 0.6


reward = 10

 

Markov Decision Processes (MDPs)

1. Markov Decision Process is a Markov Reward Process + actions

2. Definition of MDP

  - \(S\) is a (finite) set of Markov states \(s\in S\)

  - \(A\) is a (finite) set of actions \(a\in A\)

  - \(P\) is dynamics/transition model for each action, that specifies \(p(s_{t+1}=s' | s_t =s, a_t = a)\)

  - \(R\) is a reward function \( R(s_t = s, a_t = a) = \mathbb {E}[r_t|s_t=s, a_t=a]\)

3. MDP is a tuple : \((S, A, P, R,\gamma)\)

4. MDP can model a huge number of interesting problems and settings

  - Bandits : single state MDP

  - Optimal control mostly about continuous-state MDP

  - Partially observable MDPs = MDP where state is history

5. Note : yes rewards. yes actions

 

Example of Markov Decision Process

\(s_1\) \(s_2\) \(s_3\) \(s_4\) \(s_5\) \(s_6\) \(s_7\)
             

\( P(s'|s, a_1=left) = \left( \begin {array}{cc} 1 & 0 &0&0&0&0 & 0\\ 1 & 0 &0&0&0&0 & 0\\0 & 1 & 0 &0&0&0&0 \\0 & 0 &1&0&0&0 & 0\\0 & 0 &0&1&0&0 & 0\\0& 0 &0&0&1&0 & 0\\ 0 & 0 &0&0&0&1 & 0\end {array} \right) \)

 

\( P(s'|s, a_2=right) = \left( \begin {array}{cc} 0 & 1 & 0 &0&0&0&0 \\0 & 0 &1&0&0&0 & 0\\0 & 0 &0&1&0&0 & 0\\0& 0 &0&0&1&0 & 0\\ 0 & 0 &0&0&0&1 & 0 \\ 0 & 0 &0&0&0&0 & 1\\ 0 & 0 &0&0&0&0 & 1 \end {array} \right) \)

 

Return & Value Function

1. Definition of Horizon (H)

  - Number of time steps in each episode

  - Can be infinite

  - Otherwise called finite Markov reward process

 

2. Definition of Return, \(G_t\) (for a MRP)

  - Discounted sum of rewards from time step \(t\) to horizon \(H\)

\( G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} +... + \gamma^{H-1} r_{t+H-1} \)

 

 

3. Definition of State Value Function, \( V(s)\) (for a MRP)

  - Expected return from starting in state \(s\)

\( V(s) = \mathbb {E}[G_t|s_t = s] =\mathbb {E}[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} +... + \gamma^{H-1} r_{t+H-1}| s_t = s] \)

 

Computing the Value of a MRP(Markov Reward Process)

- Markov property provides structure

- MRP value fnction satisfies

\( V(s) = R(s) + \gamma \sum\limits_{s' \in S} P(s'| s) V(s') \)

\(S\) : Immediate reward

\( \gamma \sum\limits_{s' \in S} P(s'| s) V(s') \) : Discounted sum of future rewards

 

 

\(V = R+ \gamma PV \)

\(R = V-\gamma PV\)

\(R = (I-\gamma P) V = R\)

\(V = (I-\gamma P)^{-1} R\)

\((I-\gamma P)\) is invertible

 

Iterative Algorithm for Computing Value of a MRP

- Dynamic programming

- Initialize \(V_0 (s) = 0 \) for all \(s\)

- For \(k=1\) until convergence

  - For all \(s \in S\) 

     - \(V_k(s) = R(s) + \gamma \sum\limits_{s'\in S} P(s'|s) V_{k-1}(s')\)

 

MDP Policies

1. Policy specifies what action to take in each state

  - Can be deterministic or stochastic

2. For generality, consider as a conditional distribution

  - Given a state, specifies a distribution over a actions

3. Policy: \(\pi(a|s) = P(a_t =a| s_t = s) \)

 

MDP + Policy

1. MDP + \(\pi(a|s) = \) Markov Reward Process

2. Precisely, it is the MRP \( (S, R^{\pi}, P^{\pi},\gamma ) \), where

\( R^{\pi} (s) = \sum\limits_{a \in A} \pi(a|s) R(s, a)\)

\( P^{\pi} (s'|s) = \sum\limits_{a \in A} \pi(a|s) P(s'|s, a)\)

3. implies we can use same techniques to evaluate the value of a policy for a MDP as we could to compute the value of a MRP, by defining a MRP with \(R^{\pi}~ and~ P^{\pi} \)

 

MDP Policy Evaluation, Iterative Algorithm

- Initialize \(V_0(s) = 0 \) for all \(s\)

- For \(k=1\) until convergence

  - For all \(s \in S\) 

     - \(V^{\pi}_k(s) = \sum\limits_{a}\pi(a|s)[r(s, a) + \gamma \sum\limits_{s'\in S} P(s'|s, a) V^{\pi}_{k-1}(s')] \)

- This is a Bellman backup for a prticular policy

- Note that if the policy is deterministic then the above update simplifies to

     - \(V^{\pi}_k(s) = r(s,\pi(s)) + \gamma \sum\limits_{s'\in S} P(s'|s,\pi (s)) V^{\pi}_{k-1}(s')\)

 

Example Iteration of Policy Evaluation 1

- Dynamics : \( p(s_6|s_6, a_1) = 0.5, p(s_7|s_6, a_1) = 0.5\)

- Reward : for all actions, +1 in state \(s_1\), +10 in state \(s_7\), 0 otherwise

- Let \( \pi(s) =a_1\), for all \(s\), assume \(V^{\pi}_k = [1~0~0~0~0~0~10] ~and~ k=1, \gamma = 0.5\) 

Q. compute \(V^{\pi}_{k+1}(s_6)\)

 

Example Iteration of Policy Evaluation 2

- 7 discrete states

- 2 actions

Q. How many deterministic policies are there?

Q. Is the highest reward policy for a MDP always unique?

 

MDP Control

-Compute the optimal policy

  - \(\pi^{*} (s) = arg\max\limits_{\pi} V^{\pi} (s) \) 

- There exists a unique optimal value function

- Optimal policy for a MDP in an infinite horizon problem(agent acts forever) is

  - Deterministic

  - Stationary (does not depend on time step)

  - Uniq? Not necessarily, may have two policies with identical(optimal) values

 

Policy Search

- One option is searching to compute best policy

- Number of deterministic policies is \(|A|^{|S|}\)

- Policy iteration is generally more efficient than enumeration

 

MDP Policy Iteration (PI)

- set \( i=0 \)

- Initialize \(\pi_0(s) \) randomly for all states \(s\)

- While \(i==0\) or \(||\pi_i-\pi_{i-1}||_1>0\)

  - \(V^{\pi_i}\) : MDP \(V\) function policy evaluation of \(pi_i\)

 -  \(\pi_{i+1}\) : Policy improvement 

  - \( i=i+1\)

 

State Action Value Q

1. State-action value of a policy 

  - \( Q^{\pi}(s, a)  =  R(s, a)  + \gamma \sum\limits_{s'\in S} P(s'|s, a) V^{\pi} (s') \) 

2. Take action \(a\), then follow the policy \(\pi\)

 

Policy Improvement

1. Compute state-action vlaue of a policy \(\pi_i\)

  - For \(s\in S\) and \(a\in A\)

    -\(Q^{\pi_i}(s, a) = R(s,a) + \gamma \sum\limits_{s'\in S} P(s'|s,a)V^{\pi_i}(s') \)

2. Computenew policy \(\pi_{i+1}\), for all \(s \in S\)

    - \(\pi_{i+1}(s) = arg\max_a Q^{\pi_i}(s,a) \)

 

Example Policy Iteration 

1. If policy doesn't change, can it ever change again?

2. Is there a maximum number of iterations of policy iteration?

 

Bellman Equation and Bellman Backup Operators

- Value function of a policy must satisfy the Bellman Equation 

  - \(V^{\pi}(s) = R^{\pi} (s) = \gamma \sum\limits_{s'\in S}P^{\pi} (s'|s)V^{\pi} (s') \)

- Bellman backup operator

  - Applied to a value function 

  - Returns a new value function

  - Improves the value if possible

    - \( BV(s) = \max_a [R(s,a) + \gamma \sum\limits_{s'\in S} p(s'|s,a) V(s')] \)

  - BV yields a value function over all states \(s\)

 

Value Iteration (VI)

1. set \(k=1\)

2. Initialize \(V_0(s) = 0\) for all states \(s\)

3. Loop until convergence: (for ex. \(||V_{k+1}-V_k||_\infin \leq \epsilon\)) 

  - for each state \(s\)

    - \( V_{k+1}(s) = \max_a[R(s,a) + \gamma \sum\limits_{s'\in S}P(s'|s,a) V_k(s') ]\)

  - View as Bellman backup on value function 

    -\( V_{k+1} = BV_k\)

    - \(\pi_{k+1}(s) = arg\max_a [R(s,a)+\gamma \sum\limits_{s'\in S} P(s'|s,a) V_k(s')]  \)

 

슬슬 어지럽네요 수식 보고 걍 바로 이해할수 있는 능력자였으면 좋겠습니다

 

Policy Iteration as Bellman Operations

- Bellman backup operator \(B^{\pi}\) for a particular policy is defined as 

  - \(B^{\pi} V(s) = R^{\pi}(s) +\gamma \sum\limits_{s'\in S} P^{\pi}(s'|s)V(s')\)

- Policy evaluation amounts to computing the fixed point of \(B^\pi\)

- To do policy evaluation, repeatedly apply operator until \(V\) stops vhanging 

  - \(V^\pi = B^\pi B^\pi ...B^\pi V\)

- To do Policy Improvement

 - \(\pi_{k+1}(s) = arg\max_a[R(s,a) + \gamma \sum\limits_{s'\in S} P(s'|s,a) V^{\pi_k}(s')]\)

 

Going Back to Value Iteration (VI)

1. set \(k=1\)

2. Initialize \(V_0(s) = 0\) for all states \(s\)

3. Loop until [finite horizon, convergence]

  - for each state \(s\)

    - \( V_{k+1}(s) = \max_a[R(s,a) + \gamma \sum\limits_{s'\in S}P(s'|s,a) V_k(s') ]\)

  - View as Bellman backup on value function 

    -\( V_{k+1} = BV_k\)

  -To extract optimal policy if can act for \(k+1\) more steps, 

    - \(\pi_{k+1}(s) = arg\max_a [R(s,a)+\gamma \sum\limits_{s'\in S} P(s'|s,a) V_k(s')]  \)

 

Value Iteration for Finite Horizon H

\(V_k\) = optimal value if making \(k\) more decisions

\(\pi_k\) = optimal policy if making \(k\) more decisions

- Initialize \(V_0(s) = 0\) for all states \(s\)

- For \(k=1 :H \) 

  - For each state \(s\)

  -  \(V_{k+1}(s) = \max_a[R(s,a) + \gamma \sum\limits_{s'\in S} P(s'|s,a)V_k(s')]\)

  - \(\pi_{k+1}(s) = arg\max_a[R(s,a) + \gamma \sum\limits_{s'\in S} P(s'|s,a)V_k(s')]\)

 

Computing the Value of a Policy in a Finite Horizon

- Alternatively can estimate by simulation

  - Generate a large number of episodes

  - Average returns

  - Concentration inequalities bound how quickly average concetrates to expected value

  - Requires no assumption of Markov structure

 

Example of Finite Horizon H

\(s_1\) \(s_2\) \(s_3\) \(s_4\) \(s_5\) \(s_6\) \(s_7\)
right = 0.4
rotate = 0.6
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
right = 0.4
rotate = 0.2
left = 0.4
left = 0.4
rotate = 0.6

- Reward : +1 in \(s_1\), +10 in \(s_7\), 0 in all other states

- Sample returns for sample 4-step \((h=4) \) episodes, start state \(s_4\), \(\gamma = \frac12\)

    - \(s_4,s_5,s_6,s_7 = 0+\frac12\times0+\frac14\times0+\frac18\times10 = 1.25\)

    - \(s_4,s_4,s_5,s_4 = 0+\frac12\times0+\frac14\times0+\frac18\times0 = 0\)

    - \(s_4,s_3,s_2,s_1 = 0+\frac12\times0+\frac14\times0+\frac18\times1 = 0.125\)

 

Question : Finite Horizon Policies

1. set \(k=1\)

2. Initialize \(V_0(s) = 0\) for all states \(s\)

3. Loop until \(k==H:\)

  - For each state \(s\)

  -  \(V_{k+1}(s) = \max_a[R(s,a) + \gamma \sum\limits_{s'\in S} P(s'|s,a)V_k(s')]\)

  - \(\pi_{k+1}(s) = arg\max_a[R(s,a) + \gamma \sum\limits_{s'\in S} P(s'|s,a)V_k(s')]\)

 

Is optimal policy stationary( independent of time step) in finite horizon tasks?

In general no.

 

Value Iteration vs Policy Iteration

Value Iteration :

 - Compute optimal value for hirozon = \(k\)

    - Note this can ve used to compute optimal policy if horizon =\(k\)

  - Increment \(k\)

Policy Iteration : 

  - Compute infinite horizon value of a policy

  - Use to select another (better) policy

  - Closely related to a very popular method in RL : Policy gradient

    

댓글