https://www.youtube.com/watch?v=dRIhrn8cc9w&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=3 https://www.youtube.com/watch?v=j080VBVGkfQ&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=4
Model Free Evaluation - Policy Evaluation without knowing how the world works
학습목표
- Dynamic programming [class 3]
- Monte Carlo policy Evaluation [class 3]
- Temporal Difference (TD) [class 3]
- \(\epsilon-greedy\) policy [class 4]
1. Dynamic Programming - 세상이 어떻게 돌아가는지 모를때 - reward/ policy algorithm 모를때
- If we knew dynamics and reward model, we can do policy evaluation
[Alg] Dynamic Programming
- Initialize \(V^\pi_0(s) = s ~for~all~s\)
- For \(k=1\) Until converge
- \(\forall s \in S, ~ V^\pi_k(s) = r(s,\pi(s))+\gamma \sum\limits_{s'\in S} p(s'|s,\pi(s))V^\pi_{k-1}(s')\)
\(V^\pi_k(s)\) is exactly the k-Horizon value of state s Under policy \(\pi\)
\(V^\pi_k(s)\) is an estimate of the infinite horizon value of state s Under policy \(\pi\)
\(V^\pi(s) = \mathbb {E}_\pi [G_t|s_t=s] \approx \mathbb {E}[r_t+\gamma V_{k-1}|s_t=s]\)
2. Monte Carlo
- \(G_t = r_t + \gamma r_{t+1}+ \gamma^2 r_{t+2} + \dots\) in MDP M under policy \(\pi\)
- \(V^\pi (s) = \mathbb{E}_{T\sim \pi}[G_t|s_t=s]\)
- Expectation over trajectories T Generated by following \(\pi\)
- Simple idea : Value = mean return
- If trajectories are all finite, sample set of trajectories & average returns
- Does not require MDP dynamics/rewards -> requires some samples
- Does not assume state is Markov
- Can only be applied to episodic MDPs
- Averaging over returns from a complete episode
- Requires each episode to terminate
- Aim : estimate \(V^\pi(s)\) Given episodes generated under policy \(\pi\)
- \(s_1,a_1,r_1,s_2,a_2,r_2,... \) Where the actions are sampled from \(\pi\)
- \(G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} +...\) in MDP M under policy \(\pi\)
- \(V^\pi(s) = \mathbb {E}[G_t|s_t=s]\)
- MC computes empirical mean return
- Often do this in an incremental fashion
- After each episode, update estimate of \(V^\pi\)
[Alg] Monte Carlo
- Initialize \(N(s)=0\) : number of times we visited, \(G(s) = 0, \forall s \in S\)
- Loop
- Sample episode \(i = s_{i,1},a_{i,1},r_{i,1},s_{i,2},a_{i,2},r_{i,2},...,s_{i,T_i}\)
- Define \(G_{i,t} = r_{i,t} + \gamma r_{i,t+1}+ \gamma^2 r_{i,t+2}+...+\gamma^{T_{i-1}} r_{i,T-i}\) as return from time step \(t\) Onwards in \(i_{th}\) episode
- For each time step \( t \) Till the end of the episode \( i \)
- If this is the first time t That state s Is visited in episode \(i\)
- Increment counter of total first visits : \(N(s) = N(s) +1\)
- Increment total return \(G(s) = G(s)+G_{i,t}\)
- Update estimate\(V^\pi(s) = \frac{G(s)}{N(s)}\)
- If this is the first time t That state s Is visited in episode \(i\)
3. Temporal Difference (TD)
- Monte Carlo 와 Dynamic programming의 융합!
- Model-free
- 아무 상황에서나 다 쓸 수 있음!
- Immediately updates estimate of \(V \) After each \((s, a, r, s')\) tuple
- Updates and Samples
- Aim: estimate\(V^\pi(s)\) Given episodes generated under policy \(\pi\)
- \(G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2}+ ...\) in MDP M under policy \(\pi\)
- \( V^\pi(s) = \mathbb {E}[G_t|s_t=s] \)
- Recall Bellman operator (if know MDP models)
- \(B^{\pi} V(s) = r(s,\pi(s)) + \gamma \sum\limits_{s'\in S} p(s'|s,\pi(s))V(s')\)
- In incremental every-visit MC, update estimate using 1 sample of return (for the current \(i_{th}\)episode)
- \(V^\pi(s) = V^\pi(s) +\alpha(G_{i, t}-V^\pi (s))\)
- Insight : have an estimate of \(V^\pi\) , use to estimate expected return
- \(V^\pi(s) = V^\pi(s) + \alpha([r_t+\gamma V^\pi(s_{t+1})]-V^\pi(s))\)
[Alg] Temporal Difference
- Initialize \(V^\pi(s) = 0, \forall s \in S\)
- Loop
- Sample tuple \((s_t, a_t, r_t, s_{t+1})\)(s_t,a_t,r_t,s_{t+1})\)
- \(V^{\pi}(s_t) = V^\pi(s_t) +\alpha([r_t+\gamma V^\pi(s_{t+1})]-V^\pi(s_t))\)
Dynamic ProgrammingMonte CarloTemporal Difference
Dynamic Programming |
Monte Carlo |
Temporal Difference |
|
Can use w/out access to true MDP models | o | o | o |
Usable in continuing(non-episodic) setting | o | o | |
Assumes Markov Process | o | o | |
Converges to true value in limit | o | o | o |
Unbiased estimate of value | o | ||
Usable when no models of current domain | o | o |
4. \( \epsilon-greedy\) policy
explore와 exploit을 균형있게 택하는 방법!
a state action value \(Q(s,a) \) is
\( \pi(s|a) = arg\max\limits_a Q(s,a)\) With probability \(1-\epsilon +\frac{\epsilon}{|A|}\) Else random action
'Have Done > Reinforcement Learning' 카테고리의 다른 글
[강화학습] OPEN AI GYM issue (0) | 2022.05.24 |
---|---|
[강화학습 GYM] env.render() (0) | 2022.05.23 |
[강화학습] CS234 class 2 (0) | 2022.04.26 |
[강화학습] CS234 class1 (0) | 2022.04.26 |
[강화학습] Space-Invader 환경설정 후 학습하기 (0) | 2022.04.14 |
댓글