https://www.youtube.com/watch?v=E3f2Camj0Is&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=2
학습목표
1. MP, MRP, MDP Bellman operator, contraction operator, model, Q-value, Policy 정의 암기
2. Value Iteration, Policy Iteration 계산하기
3. 여러 가지 Policy Evaluation Approaches 들의 장단점 알기
4. Contraction Properties 증명할 줄 알기
5. MP, MRP, MDP and Markov assumptions의 한계 알기 - 어떤 policy evaluation methods가 Markov assumption을 필요로 하나?
Markov Process
1. Memoryless random procss
- Sequence of random states with Markov property
2. Definition of Markov Process
- \(S\) is a (finite) set of states(\( s \in S\))
- \(P\) is dynamics/transition model that specifices \(p(s_{t+1}=s' | s_t =s\)
3. Note : no rewards, no actions
4. In finite number of states(\(N\)), can express P as a matrix
\( P = \left( \begin {array}{cc} P(s_1|s_1) &... & P(s_N|s_1)\\...&...&...\\ P(s_1|s_N) &...&P(s_N|s_N)\end {array} \right) \)
Example of Markov Process
\(s_1\) | \(s_2\) | \(s_3\) | \(s_4\) | \(s_5\) | \(s_6\) | \(s_7\) |
right = 0.4 rotate = 0.6 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
left = 0.4 rotate = 0.6 |
\( P = \left( \begin {array}{cc} 0.6 & 0.4 &0&0&0&0 & 0\\ 0.4 & 0.2 &0.4&0&0&0 & 0\\0 & 0.4 &0.2&0.4&0&0 & 0\\0 & 0 &0.4&0.2&0.4&0 & 0\\0 & 0 &0&0.4&0.2&0.4 & 0\\0 & 0 &0&0&0.4&0.2 & 0.4\\ 0 & 0 &0&0&0&0.4 & 0.6\end {array} \right) \)
Markov Reward Processes (MRPs)
1. Markov Reward Process is a Markov Chain+ rewards
2. Definition of Markov Reward Process(MRP)
- \(S\) is a (finite) set of states\( s\in S\)
- \(P\) is dynamics/transition model that specifices \(p(s_{t+1}=s' | s_t =s)\)
- \( R\) is a reward funciton \(R(s_t=s) = \mathbb {E}[r_t|s_t=s]\)
- Discount factor \(\gamma \in [0,1]\)
3. Note : yes rewards, no actions
4. If finite number (\N\) of states, can express \(R\) as a vector
Example of Markov Reward Process
\(s_1\) | \(s_2\) | \(s_3\) | \(s_4\) | \(s_5\) | \(s_6\) | \(s_7\) |
right = 0.4 rotate = 0.6 reward = 1 |
right = 0.4 rotate = 0.2 left = 0.4 reward = 0 |
right = 0.4 rotate = 0.2 left = 0.4 reward = 0 |
right = 0.4 rotate = 0.2 left = 0.4 reward = 0 |
right = 0.4 rotate = 0.2 left = 0.4 reward = 0 |
right = 0.4 rotate = 0.2 left = 0.4 reward = 0 |
left = 0.4 rotate = 0.6 reward = 10 |
Markov Decision Processes (MDPs)
1. Markov Decision Process is a Markov Reward Process + actions
2. Definition of MDP
- \(S\) is a (finite) set of Markov states \(s\in S\)
- \(A\) is a (finite) set of actions \(a\in A\)
- \(P\) is dynamics/transition model for each action, that specifies \(p(s_{t+1}=s' | s_t =s, a_t = a)\)
- \(R\) is a reward function \( R(s_t = s, a_t = a) = \mathbb {E}[r_t|s_t=s, a_t=a]\)
3. MDP is a tuple : \((S, A, P, R,\gamma)\)
4. MDP can model a huge number of interesting problems and settings
- Bandits : single state MDP
- Optimal control mostly about continuous-state MDP
- Partially observable MDPs = MDP where state is history
5. Note : yes rewards. yes actions
Example of Markov Decision Process
\(s_1\) | \(s_2\) | \(s_3\) | \(s_4\) | \(s_5\) | \(s_6\) | \(s_7\) |
\( P(s'|s, a_1=left) = \left( \begin {array}{cc} 1 & 0 &0&0&0&0 & 0\\ 1 & 0 &0&0&0&0 & 0\\0 & 1 & 0 &0&0&0&0 \\0 & 0 &1&0&0&0 & 0\\0 & 0 &0&1&0&0 & 0\\0& 0 &0&0&1&0 & 0\\ 0 & 0 &0&0&0&1 & 0\end {array} \right) \)
\( P(s'|s, a_2=right) = \left( \begin {array}{cc} 0 & 1 & 0 &0&0&0&0 \\0 & 0 &1&0&0&0 & 0\\0 & 0 &0&1&0&0 & 0\\0& 0 &0&0&1&0 & 0\\ 0 & 0 &0&0&0&1 & 0 \\ 0 & 0 &0&0&0&0 & 1\\ 0 & 0 &0&0&0&0 & 1 \end {array} \right) \)
Return & Value Function
1. Definition of Horizon (H)
- Number of time steps in each episode
- Can be infinite
- Otherwise called finite Markov reward process
2. Definition of Return, \(G_t\) (for a MRP)
- Discounted sum of rewards from time step \(t\) to horizon \(H\)
\( G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} +... + \gamma^{H-1} r_{t+H-1} \)
3. Definition of State Value Function, \( V(s)\) (for a MRP)
- Expected return from starting in state \(s\)
\( V(s) = \mathbb {E}[G_t|s_t = s] =\mathbb {E}[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} +... + \gamma^{H-1} r_{t+H-1}| s_t = s] \)
Computing the Value of a MRP(Markov Reward Process)
- Markov property provides structure
- MRP value fnction satisfies
\( V(s) = R(s) + \gamma \sum\limits_{s' \in S} P(s'| s) V(s') \)
\(S\) : Immediate reward
\( \gamma \sum\limits_{s' \in S} P(s'| s) V(s') \) : Discounted sum of future rewards
\(V = R+ \gamma PV \)
\(R = V-\gamma PV\)
\(R = (I-\gamma P) V = R\)
\(V = (I-\gamma P)^{-1} R\)
\((I-\gamma P)\) is invertible
Iterative Algorithm for Computing Value of a MRP
- Dynamic programming
- Initialize \(V_0 (s) = 0 \) for all \(s\)
- For \(k=1\) until convergence
- For all \(s \in S\)
- \(V_k(s) = R(s) + \gamma \sum\limits_{s'\in S} P(s'|s) V_{k-1}(s')\)
MDP Policies
1. Policy specifies what action to take in each state
- Can be deterministic or stochastic
2. For generality, consider as a conditional distribution
- Given a state, specifies a distribution over a actions
3. Policy: \(\pi(a|s) = P(a_t =a| s_t = s) \)
MDP + Policy
1. MDP + \(\pi(a|s) = \) Markov Reward Process
2. Precisely, it is the MRP \( (S, R^{\pi}, P^{\pi},\gamma ) \), where
\( R^{\pi} (s) = \sum\limits_{a \in A} \pi(a|s) R(s, a)\)
\( P^{\pi} (s'|s) = \sum\limits_{a \in A} \pi(a|s) P(s'|s, a)\)
3. implies we can use same techniques to evaluate the value of a policy for a MDP as we could to compute the value of a MRP, by defining a MRP with \(R^{\pi}~ and~ P^{\pi} \)
MDP Policy Evaluation, Iterative Algorithm
- Initialize \(V_0(s) = 0 \) for all \(s\)
- For \(k=1\) until convergence
- For all \(s \in S\)
- \(V^{\pi}_k(s) = \sum\limits_{a}\pi(a|s)[r(s, a) + \gamma \sum\limits_{s'\in S} P(s'|s, a) V^{\pi}_{k-1}(s')] \)
- This is a Bellman backup for a prticular policy
- Note that if the policy is deterministic then the above update simplifies to
- \(V^{\pi}_k(s) = r(s,\pi(s)) + \gamma \sum\limits_{s'\in S} P(s'|s,\pi (s)) V^{\pi}_{k-1}(s')\)
Example Iteration of Policy Evaluation 1
- Dynamics : \( p(s_6|s_6, a_1) = 0.5, p(s_7|s_6, a_1) = 0.5\)
- Reward : for all actions, +1 in state \(s_1\), +10 in state \(s_7\), 0 otherwise
- Let \( \pi(s) =a_1\), for all \(s\), assume \(V^{\pi}_k = [1~0~0~0~0~0~10] ~and~ k=1, \gamma = 0.5\)
Q. compute \(V^{\pi}_{k+1}(s_6)\)
Example Iteration of Policy Evaluation 2
- 7 discrete states
- 2 actions
Q. How many deterministic policies are there?
Q. Is the highest reward policy for a MDP always unique?
MDP Control
-Compute the optimal policy
- \(\pi^{*} (s) = arg\max\limits_{\pi} V^{\pi} (s) \)
- There exists a unique optimal value function
- Optimal policy for a MDP in an infinite horizon problem(agent acts forever) is
- Deterministic
- Stationary (does not depend on time step)
- Uniq? Not necessarily, may have two policies with identical(optimal) values
Policy Search
- One option is searching to compute best policy
- Number of deterministic policies is \(|A|^{|S|}\)
- Policy iteration is generally more efficient than enumeration
MDP Policy Iteration (PI)
- set \( i=0 \)
- Initialize \(\pi_0(s) \) randomly for all states \(s\)
- While \(i==0\) or \(||\pi_i-\pi_{i-1}||_1>0\)
- \(V^{\pi_i}\) : MDP \(V\) function policy evaluation of \(pi_i\)
- \(\pi_{i+1}\) : Policy improvement
- \( i=i+1\)
State Action Value Q
1. State-action value of a policy
- \( Q^{\pi}(s, a) = R(s, a) + \gamma \sum\limits_{s'\in S} P(s'|s, a) V^{\pi} (s') \)
2. Take action \(a\), then follow the policy \(\pi\)
Policy Improvement
1. Compute state-action vlaue of a policy \(\pi_i\)
- For \(s\in S\) and \(a\in A\)
-\(Q^{\pi_i}(s, a) = R(s,a) + \gamma \sum\limits_{s'\in S} P(s'|s,a)V^{\pi_i}(s') \)
2. Computenew policy \(\pi_{i+1}\), for all \(s \in S\)
- \(\pi_{i+1}(s) = arg\max_a Q^{\pi_i}(s,a) \)
Example Policy Iteration
1. If policy doesn't change, can it ever change again?
2. Is there a maximum number of iterations of policy iteration?
Bellman Equation and Bellman Backup Operators
- Value function of a policy must satisfy the Bellman Equation
- \(V^{\pi}(s) = R^{\pi} (s) = \gamma \sum\limits_{s'\in S}P^{\pi} (s'|s)V^{\pi} (s') \)
- Bellman backup operator
- Applied to a value function
- Returns a new value function
- Improves the value if possible
- \( BV(s) = \max_a [R(s,a) + \gamma \sum\limits_{s'\in S} p(s'|s,a) V(s')] \)
- BV yields a value function over all states \(s\)
Value Iteration (VI)
1. set \(k=1\)
2. Initialize \(V_0(s) = 0\) for all states \(s\)
3. Loop until convergence: (for ex. \(||V_{k+1}-V_k||_\infin \leq \epsilon\))
- for each state \(s\)
- \( V_{k+1}(s) = \max_a[R(s,a) + \gamma \sum\limits_{s'\in S}P(s'|s,a) V_k(s') ]\)
- View as Bellman backup on value function
-\( V_{k+1} = BV_k\)
- \(\pi_{k+1}(s) = arg\max_a [R(s,a)+\gamma \sum\limits_{s'\in S} P(s'|s,a) V_k(s')] \)
슬슬 어지럽네요 수식 보고 걍 바로 이해할수 있는 능력자였으면 좋겠습니다
Policy Iteration as Bellman Operations
- Bellman backup operator \(B^{\pi}\) for a particular policy is defined as
- \(B^{\pi} V(s) = R^{\pi}(s) +\gamma \sum\limits_{s'\in S} P^{\pi}(s'|s)V(s')\)
- Policy evaluation amounts to computing the fixed point of \(B^\pi\)
- To do policy evaluation, repeatedly apply operator until \(V\) stops vhanging
- \(V^\pi = B^\pi B^\pi ...B^\pi V\)
- To do Policy Improvement
- \(\pi_{k+1}(s) = arg\max_a[R(s,a) + \gamma \sum\limits_{s'\in S} P(s'|s,a) V^{\pi_k}(s')]\)
Going Back to Value Iteration (VI)
1. set \(k=1\)
2. Initialize \(V_0(s) = 0\) for all states \(s\)
3. Loop until [finite horizon, convergence]
- for each state \(s\)
- \( V_{k+1}(s) = \max_a[R(s,a) + \gamma \sum\limits_{s'\in S}P(s'|s,a) V_k(s') ]\)
- View as Bellman backup on value function
-\( V_{k+1} = BV_k\)
-To extract optimal policy if can act for \(k+1\) more steps,
- \(\pi_{k+1}(s) = arg\max_a [R(s,a)+\gamma \sum\limits_{s'\in S} P(s'|s,a) V_k(s')] \)
Value Iteration for Finite Horizon H
\(V_k\) = optimal value if making \(k\) more decisions
\(\pi_k\) = optimal policy if making \(k\) more decisions
- Initialize \(V_0(s) = 0\) for all states \(s\)
- For \(k=1 :H \)
- For each state \(s\)
- \(V_{k+1}(s) = \max_a[R(s,a) + \gamma \sum\limits_{s'\in S} P(s'|s,a)V_k(s')]\)
- \(\pi_{k+1}(s) = arg\max_a[R(s,a) + \gamma \sum\limits_{s'\in S} P(s'|s,a)V_k(s')]\)
Computing the Value of a Policy in a Finite Horizon
- Alternatively can estimate by simulation
- Generate a large number of episodes
- Average returns
- Concentration inequalities bound how quickly average concetrates to expected value
- Requires no assumption of Markov structure
Example of Finite Horizon H
\(s_1\) | \(s_2\) | \(s_3\) | \(s_4\) | \(s_5\) | \(s_6\) | \(s_7\) |
right = 0.4 rotate = 0.6 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
right = 0.4 rotate = 0.2 left = 0.4 |
left = 0.4 rotate = 0.6 |
- Reward : +1 in \(s_1\), +10 in \(s_7\), 0 in all other states
- Sample returns for sample 4-step \((h=4) \) episodes, start state \(s_4\), \(\gamma = \frac12\)
- \(s_4,s_5,s_6,s_7 = 0+\frac12\times0+\frac14\times0+\frac18\times10 = 1.25\)
- \(s_4,s_4,s_5,s_4 = 0+\frac12\times0+\frac14\times0+\frac18\times0 = 0\)
- \(s_4,s_3,s_2,s_1 = 0+\frac12\times0+\frac14\times0+\frac18\times1 = 0.125\)
Question : Finite Horizon Policies
1. set \(k=1\)
2. Initialize \(V_0(s) = 0\) for all states \(s\)
3. Loop until \(k==H:\)
- For each state \(s\)
- \(V_{k+1}(s) = \max_a[R(s,a) + \gamma \sum\limits_{s'\in S} P(s'|s,a)V_k(s')]\)
- \(\pi_{k+1}(s) = arg\max_a[R(s,a) + \gamma \sum\limits_{s'\in S} P(s'|s,a)V_k(s')]\)
Is optimal policy stationary( independent of time step) in finite horizon tasks?
In general no.
Value Iteration vs Policy Iteration
Value Iteration :
- Compute optimal value for hirozon = \(k\)
- Note this can ve used to compute optimal policy if horizon =\(k\)
- Increment \(k\)
Policy Iteration :
- Compute infinite horizon value of a policy
- Use to select another (better) policy
- Closely related to a very popular method in RL : Policy gradient
'Have Done > Reinforcement Learning' 카테고리의 다른 글
[강화학습 GYM] env.render() (0) | 2022.05.23 |
---|---|
[강화학습] CS 234 class 3 & class 4 (0) | 2022.05.02 |
[강화학습] CS234 class1 (0) | 2022.04.26 |
[강화학습] Space-Invader 환경설정 후 학습하기 (0) | 2022.04.14 |
[강화학습] OPEN AI GYM issue (0) | 2022.04.11 |
댓글