https://www.youtube.com/watch?v=dRIhrn8cc9w&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=3 https://www.youtube.com/watch?v=j080VBVGkfQ&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=4
Model Free Evaluation - Policy Evaluation without knowing how the world works
학습목표
- Dynamic programming [class 3]
- Monte Carlo policy Evaluation [class 3]
- Temporal Difference (TD) [class 3]
policy [class 4]
1. Dynamic Programming - 세상이 어떻게 돌아가는지 모를때 - reward/ policy algorithm 모를때
- If we knew dynamics and reward model, we can do policy evaluation
[Alg] Dynamic Programming
- Initialize
- For
Until converge
2. Monte Carlo
in MDP M under policy- Expectation over trajectories T Generated by following
- Expectation over trajectories T Generated by following
- Simple idea : Value = mean return
- If trajectories are all finite, sample set of trajectories & average returns
- Does not require MDP dynamics/rewards -> requires some samples
- Does not assume state is Markov
- Can only be applied to episodic MDPs
- Averaging over returns from a complete episode
- Requires each episode to terminate
- Aim : estimate
Given episodes generated under policy Where the actions are sampled from
in MDP M under policy- MC computes empirical mean return
- Often do this in an incremental fashion
- After each episode, update estimate of
- After each episode, update estimate of
[Alg] Monte Carlo
- Initialize
: number of times we visited, - Loop
- Sample episode
- Define
as return from time step Onwards in episode - For each time step
Till the end of the episode- If this is the first time t That state s Is visited in episode
- Increment counter of total first visits :
- Increment total return
- Update estimate
- Increment counter of total first visits :
- If this is the first time t That state s Is visited in episode
- Sample episode
3. Temporal Difference (TD)
- Monte Carlo 와 Dynamic programming의 융합!
- Model-free
- 아무 상황에서나 다 쓸 수 있음!
- Immediately updates estimate of
After each tuple - Updates and Samples
- Aim: estimate
Given episodes generated under policy in MDP M under policy- Recall Bellman operator (if know MDP models)
- In incremental every-visit MC, update estimate using 1 sample of return (for the current
episode)
- Insight : have an estimate of
, use to estimate expected return
[Alg] Temporal Difference
- Initialize
- Loop
- Sample tuple
(s_t,a_t,r_t,s_{t+1})\)
- Sample tuple
Dynamic ProgrammingMonte CarloTemporal Difference
Dynamic Programming |
Monte Carlo |
Temporal Difference |
|
Can use w/out access to true MDP models | o | o | o |
Usable in continuing(non-episodic) setting | o | o | |
Assumes Markov Process | o | o | |
Converges to true value in limit | o | o | o |
Unbiased estimate of value | o | ||
Usable when no models of current domain | o | o |
4.
explore와 exploit을 균형있게 택하는 방법!
a state action value
'Have Done > Reinforcement Learning' 카테고리의 다른 글
[강화학습] OPEN AI GYM issue (0) | 2022.05.24 |
---|---|
[강화학습 GYM] env.render() (0) | 2022.05.23 |
[강화학습] CS234 class 2 (0) | 2022.04.26 |
[강화학습] CS234 class1 (0) | 2022.04.26 |
[강화학습] Space-Invader 환경설정 후 학습하기 (0) | 2022.04.14 |
댓글