[강화학습] CS234 class 2

https://www.youtube.com/watch?v=E3f2Camj0Is&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=2

학습목표

1. MP, MRP, MDP Bellman operator, contraction operator, model, Q-value, Policy 정의 암기

2. Value Iteration, Policy Iteration 계산하기

3. 여러 가지 Policy Evaluation Approaches 들의 장단점 알기

4. Contraction Properties 증명할 줄 알기

5. MP, MRP, MDP and Markov assumptions의 한계 알기 - 어떤 policy evaluation methods가 Markov assumption을 필요로 하나?

Markov Process

1. Memoryless random procss

- Sequence of random states with Markov property

2. Definition of Markov Process

- $S$ is a (finite) set of states( $s \in S$ )

- $P$ is dynamics/transition model that specifices $p (s_{t + 1} = s^{'} | s_{t} = s$

3. Note : no rewards, no actions

4. In finite number of states( $N$ ), can express P as a matrix

$P = (\begin{array}{cc} P (s_{1} | s_{1}) & . . . & P (s_{N} | s_{1}) \\ . . . & . . . & . . . \\ P (s_{1} | s_{N}) & . . . & P (s_{N} | s_{N}) \end{array})$

Example of Markov Process

$s_{1}$	$s_{2}$	$s_{3}$	$s_{4}$	$s_{5}$	$s_{6}$	$s_{7}$
right = 0.4 rotate = 0.6	right = 0.4 rotate = 0.2 left = 0.4	right = 0.4 rotate = 0.2 left = 0.4	right = 0.4 rotate = 0.2 left = 0.4	right = 0.4 rotate = 0.2 left = 0.4	right = 0.4 rotate = 0.2 left = 0.4	left = 0.4 rotate = 0.6

$P = (\begin{array}{cc} 0.6 & 0.4 & 0 & 0 & 0 & 0 & 0 \\ 0.4 & 0.2 & 0.4 & 0 & 0 & 0 & 0 \\ 0 & 0.4 & 0.2 & 0.4 & 0 & 0 & 0 \\ 0 & 0 & 0.4 & 0.2 & 0.4 & 0 & 0 \\ 0 & 0 & 0 & 0.4 & 0.2 & 0.4 & 0 \\ 0 & 0 & 0 & 0 & 0.4 & 0.2 & 0.4 \\ 0 & 0 & 0 & 0 & 0 & 0.4 & 0.6 \end{array})$

Markov Reward Processes (MRPs)

1. Markov Reward Process is a Markov Chain+ rewards

2. Definition of Markov Reward Process(MRP)

- $S$ is a (finite) set of states $s \in S$

- $P$ is dynamics/transition model that specifices $p (s_{t + 1} = s^{'} | s_{t} = s)$

- $R$ is a reward funciton $R (s_{t} = s) = E [r_{t} | s_{t} = s]$

- Discount factor $γ \in [0, 1]$

3. Note : yes rewards, no actions

4. If finite number (\N\) of states, can express $R$ as a vector

Example of Markov Reward Process

$s_{1}$	$s_{2}$	$s_{3}$	$s_{4}$	$s_{5}$	$s_{6}$	$s_{7}$
right = 0.4 rotate = 0.6 reward = 1	right = 0.4 rotate = 0.2 left = 0.4 reward = 0	right = 0.4 rotate = 0.2 left = 0.4 reward = 0	right = 0.4 rotate = 0.2 left = 0.4 reward = 0	right = 0.4 rotate = 0.2 left = 0.4 reward = 0	right = 0.4 rotate = 0.2 left = 0.4 reward = 0	left = 0.4 rotate = 0.6 reward = 10

Markov Decision Processes (MDPs)

1. Markov Decision Process is a Markov Reward Process + actions

2. Definition of MDP

- $S$ is a (finite) set of Markov states $s \in S$

- $A$ is a (finite) set of actions $a \in A$

- $P$ is dynamics/transition model for each action, that specifies $p (s_{t + 1} = s^{'} | s_{t} = s, a_{t} = a)$

- $R$ is a reward function $R (s_{t} = s, a_{t} = a) = E [r_{t} | s_{t} = s, a_{t} = a]$

3. MDP is a tuple : $(S, A, P, R, γ)$

4. MDP can model a huge number of interesting problems and settings

- Bandits : single state MDP

- Optimal control mostly about continuous-state MDP

- Partially observable MDPs = MDP where state is history

5. Note : yes rewards. yes actions

Example of Markov Decision Process

$s_{1}$	$s_{2}$	$s_{3}$	$s_{4}$	$s_{5}$	$s_{6}$	$s_{7}$

$P (s^{'} | s, a_{1} = l e f t) = (\begin{array}{cc} 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 \end{array})$

$P (s^{'} | s, a_{2} = r i g h t) = (\begin{array}{cc} 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{array})$

Return & Value Function

1. Definition of Horizon (H)

- Number of time steps in each episode

- Can be infinite

- Otherwise called finite Markov reward process

2. Definition of Return, $G_{t}$ (for a MRP)

- Discounted sum of rewards from time step $t$ to horizon $H$

$G_{t} = r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + . . . + γ^{H - 1} r_{t + H - 1}$

3. Definition of State Value Function, $V (s)$ (for a MRP)

- Expected return from starting in state $s$

$V (s) = E [G_{t} | s_{t} = s] = E [r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + . . . + γ^{H - 1} r_{t + H - 1} | s_{t} = s]$

Computing the Value of a MRP(Markov Reward Process)

- Markov property provides structure

- MRP value fnction satisfies

$V (s) = R (s) + γ \sum_{s^{'} \in S} P (s^{'} | s) V (s^{'})$

$S$ : Immediate reward

$γ \sum_{s^{'} \in S} P (s^{'} | s) V (s^{'})$ : Discounted sum of future rewards

$V = R + γ P V$

$R = V - γ P V$

$R = (I - γ P) V = R$

$V = (I - γ P)^{- 1} R$

$(I - γ P)$ is invertible

Iterative Algorithm for Computing Value of a MRP

- Dynamic programming

- Initialize $V_{0} (s) = 0$ for all $s$

- For $k = 1$ until convergence

- For all $s \in S$

- $V_{k} (s) = R (s) + γ \sum_{s^{'} \in S} P (s^{'} | s) V_{k - 1} (s^{'})$

MDP Policies

1. Policy specifies what action to take in each state

- Can be deterministic or stochastic

2. For generality, consider as a conditional distribution

- Given a state, specifies a distribution over a actions

3. Policy: $π (a | s) = P (a_{t} = a | s_{t} = s)$

MDP + Policy

1. MDP + $π (a | s) =$ Markov Reward Process

2. Precisely, it is the MRP $(S, R^{π}, P^{π}, γ)$ , where

$R^{π} (s) = \sum_{a \in A} π (a | s) R (s, a)$

$P^{π} (s^{'} | s) = \sum_{a \in A} π (a | s) P (s^{'} | s, a)$

3. implies we can use same techniques to evaluate the value of a policy for a MDP as we could to compute the value of a MRP, by defining a MRP with $R^{π} a n d P^{π}$

MDP Policy Evaluation, Iterative Algorithm

- Initialize $V_{0} (s) = 0$ for all $s$

- For $k = 1$ until convergence

- For all $s \in S$

- $V_{k}^{π} (s) = \sum_{a} π (a | s) [r (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V_{k - 1}^{π} (s^{'})]$

- This is a Bellman backup for a prticular policy

- Note that if the policy is deterministic then the above update simplifies to

- $V_{k}^{π} (s) = r (s, π (s)) + γ \sum_{s^{'} \in S} P (s^{'} | s, π (s)) V_{k - 1}^{π} (s^{'})$

Example Iteration of Policy Evaluation 1

- Dynamics : $p (s_{6} | s_{6}, a_{1}) = 0.5, p (s_{7} | s_{6}, a_{1}) = 0.5$

- Reward : for all actions, +1 in state $s_{1}$ , +10 in state $s_{7}$ , 0 otherwise

- Let $π (s) = a_{1}$ , for all $s$ , assume $V_{k}^{π} = [1 0 0 0 0 0 10] a n d k = 1, γ = 0.5$

Q. compute $V_{k + 1}^{π} (s_{6})$

Example Iteration of Policy Evaluation 2

- 7 discrete states

- 2 actions

Q. How many deterministic policies are there?

Q. Is the highest reward policy for a MDP always unique?

MDP Control

-Compute the optimal policy

- $π^{*} (s) = a r g max_{π} V^{π} (s)$

- There exists a unique optimal value function

- Optimal policy for a MDP in an infinite horizon problem(agent acts forever) is

- Deterministic

- Stationary (does not depend on time step)

- Uniq? Not necessarily, may have two policies with identical(optimal) values

Policy Search

- One option is searching to compute best policy

- Number of deterministic policies is $| A |^{| S |}$

- Policy iteration is generally more efficient than enumeration

MDP Policy Iteration (PI)

- set $i = 0$

- Initialize $π_{0} (s)$ randomly for all states $s$

- While $i == 0$ or $| | π_{i} - π_{i - 1} | |_{1} > 0$

- $V^{π_{i}}$ : MDP $V$ function policy evaluation of $p i_{i}$

- $π_{i + 1}$ : Policy improvement

- $i = i + 1$

State Action Value Q

1. State-action value of a policy

- $Q^{π} (s, a) = R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V^{π} (s^{'})$

2. Take action $a$ , then follow the policy $π$

Policy Improvement

1. Compute state-action vlaue of a policy $π_{i}$

- For $s \in S$ and $a \in A$

- $Q^{π_{i}} (s, a) = R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V^{π_{i}} (s^{'})$

2. Computenew policy $π_{i + 1}$ , for all $s \in S$

- $π_{i + 1} (s) = a r g max_{a} Q^{π_{i}} (s, a)$

Example Policy Iteration

1. If policy doesn't change, can it ever change again?

2. Is there a maximum number of iterations of policy iteration?

Bellman Equation and Bellman Backup Operators

- Value function of a policy must satisfy the Bellman Equation

- $V^{π} (s) = R^{π} (s) = γ \sum_{s^{'} \in S} P^{π} (s^{'} | s) V^{π} (s^{'})$

- Bellman backup operator

- Applied to a value function

- Returns a new value function

- Improves the value if possible

- $B V (s) = max_{a} [R (s, a) + γ \sum_{s^{'} \in S} p (s^{'} | s, a) V (s^{'})]$

- BV yields a value function over all states $s$

Value Iteration (VI)

1. set $k = 1$

2. Initialize $V_{0} (s) = 0$ for all states $s$

3. Loop until convergence: (for ex. $| | V_{k + 1} - V_{k} | |_{\infin} \leq ϵ$ )

- for each state $s$

- $V_{k + 1} (s) = max_{a} [R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V_{k} (s^{'})]$

- View as Bellman backup on value function

- $V_{k + 1} = B V_{k}$

- $π_{k + 1} (s) = a r g max_{a} [R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V_{k} (s^{'})]$

슬슬 어지럽네요 수식 보고 걍 바로 이해할수 있는 능력자였으면 좋겠습니다

Policy Iteration as Bellman Operations

- Bellman backup operator $B^{π}$ for a particular policy is defined as

- $B^{π} V (s) = R^{π} (s) + γ \sum_{s^{'} \in S} P^{π} (s^{'} | s) V (s^{'})$

- Policy evaluation amounts to computing the fixed point of $B^{π}$

- To do policy evaluation, repeatedly apply operator until $V$ stops vhanging

- $V^{π} = B^{π} B^{π} . . . B^{π} V$

- To do Policy Improvement

- $π_{k + 1} (s) = a r g max_{a} [R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V^{π_{k}} (s^{'})]$

Going Back to Value Iteration (VI)

1. set $k = 1$

2. Initialize $V_{0} (s) = 0$ for all states $s$

3. Loop until [finite horizon, convergence]

- for each state $s$

- $V_{k + 1} (s) = max_{a} [R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V_{k} (s^{'})]$

- View as Bellman backup on value function

- $V_{k + 1} = B V_{k}$

-To extract optimal policy if can act for $k + 1$ more steps,

- $π_{k + 1} (s) = a r g max_{a} [R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V_{k} (s^{'})]$

Value Iteration for Finite Horizon H

$V_{k}$ = optimal value if making $k$ more decisions

$π_{k}$ = optimal policy if making $k$ more decisions

- Initialize $V_{0} (s) = 0$ for all states $s$

- For $k = 1 : H$

- For each state $s$

- $V_{k + 1} (s) = max_{a} [R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V_{k} (s^{'})]$

- $π_{k + 1} (s) = a r g max_{a} [R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V_{k} (s^{'})]$

Computing the Value of a Policy in a Finite Horizon

- Alternatively can estimate by simulation

- Generate a large number of episodes

- Average returns

- Concentration inequalities bound how quickly average concetrates to expected value

- Requires no assumption of Markov structure

Example of Finite Horizon H

$s_{1}$	$s_{2}$	$s_{3}$	$s_{4}$	$s_{5}$	$s_{6}$	$s_{7}$
right = 0.4 rotate = 0.6	right = 0.4 rotate = 0.2 left = 0.4	right = 0.4 rotate = 0.2 left = 0.4	right = 0.4 rotate = 0.2 left = 0.4	right = 0.4 rotate = 0.2 left = 0.4	right = 0.4 rotate = 0.2 left = 0.4	left = 0.4 rotate = 0.6

- Reward : +1 in $s_{1}$ , +10 in $s_{7}$ , 0 in all other states

- Sample returns for sample 4-step $(h = 4)$ episodes, start state $s_{4}$ , $γ = \frac{1}{2}$

- $s_{4}, s_{5}, s_{6}, s_{7} = 0 + \frac{1}{2} \times 0 + \frac{1}{4} \times 0 + \frac{1}{8} \times 10 = 1.25$

- $s_{4}, s_{4}, s_{5}, s_{4} = 0 + \frac{1}{2} \times 0 + \frac{1}{4} \times 0 + \frac{1}{8} \times 0 = 0$

- $s_{4}, s_{3}, s_{2}, s_{1} = 0 + \frac{1}{2} \times 0 + \frac{1}{4} \times 0 + \frac{1}{8} \times 1 = 0.125$

Question : Finite Horizon Policies

1. set $k = 1$

2. Initialize $V_{0} (s) = 0$ for all states $s$

3. Loop until $k == H :$

- For each state $s$

- $V_{k + 1} (s) = max_{a} [R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V_{k} (s^{'})]$

- $π_{k + 1} (s) = a r g max_{a} [R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V_{k} (s^{'})]$

Is optimal policy stationary( independent of time step) in finite horizon tasks?

In general no.

Value Iteration vs Policy Iteration

Value Iteration :

- Compute optimal value for hirozon = $k$

- Note this can ve used to compute optimal policy if horizon = $k$

- Increment $k$

Policy Iteration :

- Compute infinite horizon value of a policy

- Use to select another (better) policy

- Closely related to a very popular method in RL : Policy gradient

'Have Done > Reinforcement Learning' 카테고리의 다른 글

[강화학습 GYM] env.render() (0)	2022.05.23
[강화학습] CS 234 class 3 & class 4 (0)	2022.05.02
[강화학습] CS234 class1 (0)	2022.04.26
[강화학습] Space-Invader 환경설정 후 학습하기 (0)	2022.04.14
[강화학습] OPEN AI GYM issue (0)	2022.04.11

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Problem Solver

[강화학습] CS234 class 2

'Have Done > Reinforcement Learning' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[강화학습] CS234 class 2

'Have Done > Reinforcement Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역