본문 바로가기
Have Done/Reinforcement Learning

[강화학습] CS 234 class 3 & class 4

by 에아오요이가야 2022. 5. 2.

https://www.youtube.com/watch?v=dRIhrn8cc9w&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=3 https://www.youtube.com/watch?v=j080VBVGkfQ&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=4 

Model Free Evaluation - Policy Evaluation without knowing how the world works

학습목표

  1. Dynamic programming [class 3]
  2. Monte Carlo policy Evaluation [class 3]
  3. Temporal Difference (TD) [class 3]
  4. \(\epsilon-greedy\) policy [class 4]

1. Dynamic Programming - 세상이 어떻게 돌아가는지 모를때 - reward/ policy algorithm 모를때

  • If we knew dynamics and reward model, we can do policy evaluation

[Alg] Dynamic Programming

  1. Initialize \(V^\pi_0(s) = s ~for~all~s\)
  2. For \(k=1\)  Until converge
  3. \(\forall s \in S, ~ V^\pi_k(s) = r(s,\pi(s))+\gamma \sum\limits_{s'\in S} p(s'|s,\pi(s))V^\pi_{k-1}(s')\)

\(V^\pi_k(s)\)  is exactly the k-Horizon value of state s Under policy  \(\pi\) 

\(V^\pi_k(s)\)  is an estimate of the infinite horizon value of state s Under policy \(\pi\) 

\(V^\pi(s) = \mathbb {E}_\pi [G_t|s_t=s] \approx \mathbb {E}[r_t+\gamma V_{k-1}|s_t=s]\)

 

2. Monte Carlo

  • \(G_t = r_t + \gamma r_{t+1}+ \gamma^2 r_{t+2} + \dots\) in MDP M under policy \(\pi\) 
  • \(V^\pi (s) = \mathbb{E}_{T\sim \pi}[G_t|s_t=s]\)
    • Expectation over trajectories T Generated by following \(\pi\) 
  • Simple idea : Value = mean return
  • If trajectories are all finite, sample set of trajectories & average returns
  • Does not require MDP dynamics/rewards -> requires some samples
  • Does not assume state is Markov
  • Can only be applied to episodic MDPs
    • Averaging over returns from a complete episode
    • Requires each episode to terminate
  • Aim : estimate \(V^\pi(s)\)  Given episodes generated under policy \(\pi\) 
    • \(s_1,a_1,r_1,s_2,a_2,r_2,... \) Where the actions are sampled from \(\pi\) 
  • \(G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} +...\) in MDP M under policy \(\pi\) 
  • \(V^\pi(s) = \mathbb {E}[G_t|s_t=s]\)
  • MC computes empirical mean return
  • Often do this in an incremental fashion
    • After each episode, update estimate of \(V^\pi\) 

[Alg] Monte Carlo

  1. Initialize \(N(s)=0\) : number of times we visited, \(G(s) = 0, \forall s \in S\) 
  2. Loop
    1. Sample episode \(i = s_{i,1},a_{i,1},r_{i,1},s_{i,2},a_{i,2},r_{i,2},...,s_{i,T_i}\) 
    2. Define \(G_{i,t} = r_{i,t} + \gamma r_{i,t+1}+ \gamma^2 r_{i,t+2}+...+\gamma^{T_{i-1}} r_{i,T-i}\) as return from time step \(t\)  Onwards in \(i_{th}\)  episode
    3. For each time step \( t \)  Till the end of the episode \( i \)
      1. If this is the first time t That state s Is visited in episode \(i\)
        1. Increment counter of total first visits : \(N(s) = N(s) +1\) 
        2. Increment total return \(G(s) = G(s)+G_{i,t}\) 
        3. Update estimate\(V^\pi(s) = \frac{G(s)}{N(s)}\)  

3. Temporal Difference (TD)

  • Monte Carlo 와 Dynamic programming의 융합!
  • Model-free
  • 아무 상황에서나 다 쓸 수 있음!
  • Immediately updates estimate of \(V \)  After each \((s, a, r, s')\)  tuple
  • Updates and Samples
  • Aim: estimate\(V^\pi(s)\)   Given episodes generated under policy \(\pi\) 
  • \(G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2}+ ...\) in MDP M under policy \(\pi\) 
  • \( V^\pi(s) = \mathbb {E}[G_t|s_t=s] \)
  • Recall Bellman operator (if know MDP models)
    • \(B^{\pi} V(s) = r(s,\pi(s)) + \gamma \sum\limits_{s'\in S} p(s'|s,\pi(s))V(s')\)
  • In incremental every-visit MC, update estimate using 1 sample of return (for the current \(i_{th}\)episode)
    • \(V^\pi(s) = V^\pi(s) +\alpha(G_{i, t}-V^\pi (s))\)
  • Insight : have an estimate of \(V^\pi\) , use to estimate expected return
    • \(V^\pi(s) = V^\pi(s) + \alpha([r_t+\gamma V^\pi(s_{t+1})]-V^\pi(s))\)

[Alg] Temporal Difference

  1. Initialize \(V^\pi(s) = 0, \forall s \in S\)
  2. Loop
    1. Sample tuple \((s_t, a_t, r_t, s_{t+1})\)(s_t,a_t,r_t,s_{t+1})\)
    2. \(V^{\pi}(s_t) = V^\pi(s_t) +\alpha([r_t+\gamma V^\pi(s_{t+1})]-V^\pi(s_t))\)

Dynamic ProgrammingMonte CarloTemporal Difference

  Dynamic
Programming
Monte
Carlo
Temporal
Difference
Can use w/out access to true MDP models o o o
Usable in continuing(non-episodic) setting o   o
Assumes Markov Process o   o
Converges to true value in limit o o o
Unbiased estimate of value   o  
Usable when no models of current domain   o o

 

4. \( \epsilon-greedy\) policy

 

explore와 exploit을 균형있게 택하는 방법!

 

a state action value \(Q(s,a) \) is

\( \pi(s|a) = arg\max\limits_a Q(s,a)\) With probability \(1-\epsilon +\frac{\epsilon}{|A|}\) Else random action

댓글