본문 바로가기
Have Done/Reinforcement Learning

[강화학습] CS 234 class 3 & class 4

by 에아오요이가야 2022. 5. 2.

https://www.youtube.com/watch?v=dRIhrn8cc9w&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=3 https://www.youtube.com/watch?v=j080VBVGkfQ&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=4 

Model Free Evaluation - Policy Evaluation without knowing how the world works

학습목표

  1. Dynamic programming [class 3]
  2. Monte Carlo policy Evaluation [class 3]
  3. Temporal Difference (TD) [class 3]
  4. ϵgreedy policy [class 4]

1. Dynamic Programming - 세상이 어떻게 돌아가는지 모를때 - reward/ policy algorithm 모를때

  • If we knew dynamics and reward model, we can do policy evaluation

[Alg] Dynamic Programming

  1. Initialize V0π(s)=s for all s
  2. For k=1  Until converge
  3. sS, Vkπ(s)=r(s,π(s))+γsSp(s|s,π(s))Vk1π(s)

Vkπ(s)  is exactly the k-Horizon value of state s Under policy  π 

Vkπ(s)  is an estimate of the infinite horizon value of state s Under policy π 

Vπ(s)=Eπ[Gt|st=s]E[rt+γVk1|st=s]

 

2. Monte Carlo

  • Gt=rt+γrt+1+γ2rt+2+ in MDP M under policy π 
  • Vπ(s)=ETπ[Gt|st=s]
    • Expectation over trajectories T Generated by following π 
  • Simple idea : Value = mean return
  • If trajectories are all finite, sample set of trajectories & average returns
  • Does not require MDP dynamics/rewards -> requires some samples
  • Does not assume state is Markov
  • Can only be applied to episodic MDPs
    • Averaging over returns from a complete episode
    • Requires each episode to terminate
  • Aim : estimate Vπ(s)  Given episodes generated under policy π 
    • s1,a1,r1,s2,a2,r2,... Where the actions are sampled from π 
  • Gt=rt+γrt+1+γ2rt+2+... in MDP M under policy π 
  • Vπ(s)=E[Gt|st=s]
  • MC computes empirical mean return
  • Often do this in an incremental fashion
    • After each episode, update estimate of Vπ 

[Alg] Monte Carlo

  1. Initialize N(s)=0 : number of times we visited, G(s)=0,sS 
  2. Loop
    1. Sample episode i=si,1,ai,1,ri,1,si,2,ai,2,ri,2,...,si,Ti 
    2. Define Gi,t=ri,t+γri,t+1+γ2ri,t+2+...+γTi1ri,Ti as return from time step t  Onwards in ith  episode
    3. For each time step t  Till the end of the episode i
      1. If this is the first time t That state s Is visited in episode i
        1. Increment counter of total first visits : N(s)=N(s)+1 
        2. Increment total return G(s)=G(s)+Gi,t 
        3. Update estimateVπ(s)=G(s)N(s)  

3. Temporal Difference (TD)

  • Monte Carlo 와 Dynamic programming의 융합!
  • Model-free
  • 아무 상황에서나 다 쓸 수 있음!
  • Immediately updates estimate of V  After each (s,a,r,s)  tuple
  • Updates and Samples
  • Aim: estimateVπ(s)   Given episodes generated under policy π 
  • Gt=rt+γrt+1+γ2rt+2+... in MDP M under policy π 
  • Vπ(s)=E[Gt|st=s]
  • Recall Bellman operator (if know MDP models)
    • BπV(s)=r(s,π(s))+γsSp(s|s,π(s))V(s)
  • In incremental every-visit MC, update estimate using 1 sample of return (for the current ithepisode)
    • Vπ(s)=Vπ(s)+α(Gi,tVπ(s))
  • Insight : have an estimate of Vπ , use to estimate expected return
    • Vπ(s)=Vπ(s)+α([rt+γVπ(st+1)]Vπ(s))

[Alg] Temporal Difference

  1. Initialize Vπ(s)=0,sS
  2. Loop
    1. Sample tuple (st,at,rt,st+1)(s_t,a_t,r_t,s_{t+1})\)
    2. Vπ(st)=Vπ(st)+α([rt+γVπ(st+1)]Vπ(st))

Dynamic ProgrammingMonte CarloTemporal Difference

  Dynamic
Programming
Monte
Carlo
Temporal
Difference
Can use w/out access to true MDP models o o o
Usable in continuing(non-episodic) setting o   o
Assumes Markov Process o   o
Converges to true value in limit o o o
Unbiased estimate of value   o  
Usable when no models of current domain   o o

 

4. ϵgreedy policy

 

explore와 exploit을 균형있게 택하는 방법!

 

a state action value Q(s,a) is

π(s|a)=argmaxaQ(s,a) With probability 1ϵ+ϵ|A| Else random action