[강화학습] CS 234 class 3 & class 4

https://www.youtube.com/watch?v=dRIhrn8cc9w&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=3 https://www.youtube.com/watch?v=j080VBVGkfQ&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=4

Model Free Evaluation - Policy Evaluation without knowing how the world works

학습목표

Dynamic programming [class 3]
Monte Carlo policy Evaluation [class 3]
Temporal Difference (TD) [class 3]
$ϵ - g r e e d y$ policy [class 4]

1. Dynamic Programming - 세상이 어떻게 돌아가는지 모를때 - reward/ policy algorithm 모를때

If we knew dynamics and reward model, we can do policy evaluation

[Alg] Dynamic Programming

Initialize $V_{0}^{π} (s) = s f o r a l l s$
For $k = 1$ Until converge
$\forall s \in S, V_{k}^{π} (s) = r (s, π (s)) + γ \sum_{s^{'} \in S} p (s^{'} | s, π (s)) V_{k - 1}^{π} (s^{'})$

$V_{k}^{π} (s)$ is exactly the k-Horizon value of state s Under policy $π$

$V_{k}^{π} (s)$ is an estimate of the infinite horizon value of state s Under policy $π$

$V^{π} (s) = E_{π} [G_{t} | s_{t} = s] \approx E [r_{t} + γ V_{k - 1} | s_{t} = s]$

2. Monte Carlo

$G_{t} = r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + \dots$ in MDP M under policy $π$
$V^{π} (s) = E_{T \sim π} [G_{t} | s_{t} = s]$
- Expectation over trajectories T Generated by following $π$

Simple idea : Value = mean return
If trajectories are all finite, sample set of trajectories & average returns
Does not require MDP dynamics/rewards -> requires some samples
Does not assume state is Markov
Can only be applied to episodic MDPs
- Averaging over returns from a complete episode
- Requires each episode to terminate

Aim : estimate $V^{π} (s)$ Given episodes generated under policy $π$
- $s_{1}, a_{1}, r_{1}, s_{2}, a_{2}, r_{2}, . . .$ Where the actions are sampled from $π$
$G_{t} = r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + . . .$ in MDP M under policy $π$
$V^{π} (s) = E [G_{t} | s_{t} = s]$
MC computes empirical mean return
Often do this in an incremental fashion
- After each episode, update estimate of $V^{π}$

[Alg] Monte Carlo

Initialize $N (s) = 0$ : number of times we visited, $G (s) = 0, \forall s \in S$
Loop
1. Sample episode $i = s_{i, 1}, a_{i, 1}, r_{i, 1}, s_{i, 2}, a_{i, 2}, r_{i, 2}, . . ., s_{i, T_{i}}$
2. Define $G_{i, t} = r_{i, t} + γ r_{i, t + 1} + γ^{2} r_{i, t + 2} + . . . + γ^{T_{i - 1}} r_{i, T - i}$ as return from time step $t$ Onwards in $i_{t h}$ episode
3. For each time step $t$ Till the end of the episode $i$
  1. If this is the first time t That state s Is visited in episode $i$
    1. Increment counter of total first visits : $N (s) = N (s) + 1$
    2. Increment total return $G (s) = G (s) + G_{i, t}$
    3. Update estimate $V^{π} (s) = \frac{G (s)}{N (s)}$

3. Temporal Difference (TD)

Monte Carlo 와 Dynamic programming의 융합!
Model-free
아무 상황에서나 다 쓸 수 있음!
Immediately updates estimate of $V$ After each $(s, a, r, s^{'})$ tuple
Updates and Samples

Aim: estimate $V^{π} (s)$ Given episodes generated under policy $π$
$G_{t} = r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + . . .$ in MDP M under policy $π$
$V^{π} (s) = E [G_{t} | s_{t} = s]$
Recall Bellman operator (if know MDP models)
- $B^{π} V (s) = r (s, π (s)) + γ \sum_{s^{'} \in S} p (s^{'} | s, π (s)) V (s^{'})$
In incremental every-visit MC, update estimate using 1 sample of return (for the current $i_{t h}$ episode)
- $V^{π} (s) = V^{π} (s) + α (G_{i, t} - V^{π} (s))$

Insight : have an estimate of $V^{π}$ , use to estimate expected return
- $V^{π} (s) = V^{π} (s) + α ([r_{t} + γ V^{π} (s_{t + 1})] - V^{π} (s))$

[Alg] Temporal Difference

Initialize $V^{π} (s) = 0, \forall s \in S$
Loop
1. Sample tuple $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ (s_t,a_t,r_t,s_{t+1})\)
2. $V^{π} (s_{t}) = V^{π} (s_{t}) + α ([r_{t} + γ V^{π} (s_{t + 1})] - V^{π} (s_{t}))$

Dynamic ProgrammingMonte CarloTemporal Difference

	Dynamic Programming	Monte Carlo	Temporal Difference
Can use w/out access to true MDP models	o	o	o
Usable in continuing(non-episodic) setting	o		o
Assumes Markov Process	o		o
Converges to true value in limit	o	o	o
Unbiased estimate of value		o
Usable when no models of current domain		o	o

4. $ϵ - g r e e d y$ policy

explore와 exploit을 균형있게 택하는 방법!

a state action value $Q (s, a)$ is

$π (s | a) = a r g max_{a} Q (s, a)$ With probability $1 - ϵ + \frac{ϵ}{| A |}$ Else random action

'Have Done > Reinforcement Learning' 카테고리의 다른 글

[강화학습] OPEN AI GYM issue (0)	2022.05.24
[강화학습 GYM] env.render() (0)	2022.05.23
[강화학습] CS234 class 2 (0)	2022.04.26
[강화학습] CS234 class1 (0)	2022.04.26
[강화학습] Space-Invader 환경설정 후 학습하기 (0)	2022.04.14

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Problem Solver

[강화학습] CS 234 class 3 & class 4

'Have Done > Reinforcement Learning' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[강화학습] CS 234 class 3 & class 4

'Have Done > Reinforcement Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역