[강화학습] CS 234 class 3 & class 4

Model Free Evaluation - Policy Evaluation without knowing how the world works

학습목표

1. Dynamic Programming - 세상이 어떻게 돌아가는지 모를때 - reward/ policy algorithm 모를때

[Alg] Dynamic Programming

Initialize \(V^\pi_0(s) = s ~for~all~s\)
For \(k=1\) Until converge
\(\forall s \in S, ~ V^\pi_k(s) = r(s,\pi(s))+\gamma \sum\limits_{s'\in S} p(s'|s,\pi(s))V^\pi_{k-1}(s')\)

\(V^\pi_k(s)\) is exactly the k-Horizon value of state s Under policy \(\pi\)

\(V^\pi_k(s)\) is an estimate of the infinite horizon value of state s Under policy \(\pi\)

\(V^\pi(s) = \mathbb {E}_\pi [G_t|s_t=s] \approx \mathbb {E}[r_t+\gamma V_{k-1}|s_t=s]\)

2. Monte Carlo

\(G_t = r_t + \gamma r_{t+1}+ \gamma^2 r_{t+2} + \dots\) in MDP M under policy \(\pi\)
\(V^\pi (s) = \mathbb{E}_{T\sim \pi}[G_t|s_t=s]\)
- Expectation over trajectories T Generated by following \(\pi\)

Simple idea : Value = mean return
If trajectories are all finite, sample set of trajectories & average returns
Does not require MDP dynamics/rewards -> requires some samples
Does not assume state is Markov
Can only be applied to episodic MDPs
- Averaging over returns from a complete episode
- Requires each episode to terminate

Aim : estimate \(V^\pi(s)\) Given episodes generated under policy \(\pi\)
- \(s_1,a_1,r_1,s_2,a_2,r_2,... \) Where the actions are sampled from \(\pi\)
\(G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} +...\) in MDP M under policy \(\pi\)
\(V^\pi(s) = \mathbb {E}[G_t|s_t=s]\)
MC computes empirical mean return
Often do this in an incremental fashion
- After each episode, update estimate of \(V^\pi\)

[Alg] Monte Carlo

Initialize \(N(s)=0\) : number of times we visited, \(G(s) = 0, \forall s \in S\)
Loop
1. Sample episode \(i = s_{i,1},a_{i,1},r_{i,1},s_{i,2},a_{i,2},r_{i,2},...,s_{i,T_i}\)
2. Define \(G_{i,t} = r_{i,t} + \gamma r_{i,t+1}+ \gamma^2 r_{i,t+2}+...+\gamma^{T_{i-1}} r_{i,T-i}\) as return from time step \(t\) Onwards in \(i_{th}\) episode
3. For each time step \( t \) Till the end of the episode \( i \)
  1. If this is the first time t That state s Is visited in episode \(i\)
    1. Increment counter of total first visits : \(N(s) = N(s) +1\)
    2. Increment total return \(G(s) = G(s)+G_{i,t}\)
    3. Update estimate\(V^\pi(s) = \frac{G(s)}{N(s)}\)

3. Temporal Difference (TD)

Aim: estimate\(V^\pi(s)\) Given episodes generated under policy \(\pi\)
\(G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2}+ ...\) in MDP M under policy \(\pi\)
\( V^\pi(s) = \mathbb {E}[G_t|s_t=s] \)
Recall Bellman operator (if know MDP models)
- \(B^{\pi} V(s) = r(s,\pi(s)) + \gamma \sum\limits_{s'\in S} p(s'|s,\pi(s))V(s')\)
In incremental every-visit MC, update estimate using 1 sample of return (for the current \(i_{th}\)episode)
- \(V^\pi(s) = V^\pi(s) +\alpha(G_{i, t}-V^\pi (s))\)

Insight : have an estimate of \(V^\pi\) , use to estimate expected return
- \(V^\pi(s) = V^\pi(s) + \alpha([r_t+\gamma V^\pi(s_{t+1})]-V^\pi(s))\)

[Alg] Temporal Difference

Initialize \(V^\pi(s) = 0, \forall s \in S\)
Loop
1. Sample tuple \((s_t, a_t, r_t, s_{t+1})\)(s_t,a_t,r_t,s_{t+1})\)
2. \(V^{\pi}(s_t) = V^\pi(s_t) +\alpha([r_t+\gamma V^\pi(s_{t+1})]-V^\pi(s_t))\)

Dynamic ProgrammingMonte CarloTemporal Difference

	Dynamic Programming	Monte Carlo	Temporal Difference
Can use w/out access to true MDP models	o	o	o
Usable in continuing(non-episodic) setting	o		o
Assumes Markov Process	o		o
Converges to true value in limit	o	o	o
Unbiased estimate of value		o
Usable when no models of current domain		o	o

4. \( \epsilon-greedy\) policy

explore와 exploit을 균형있게 택하는 방법!

a state action value \(Q(s,a) \) is

\( \pi(s|a) = arg\max\limits_a Q(s,a)\) With probability \(1-\epsilon +\frac{\epsilon}{|A|}\) Else random action

[강화학습] OPEN AI GYM issue (0)	2022.05.24
[강화학습 GYM] env.render() (0)	2022.05.23
[강화학습] CS234 class 2 (0)	2022.04.26
[강화학습] CS234 class1 (0)	2022.04.26
[강화학습] Space-Invader 환경설정 후 학습하기 (0)	2022.04.14

Problem Solver