본문 바로가기
Have Done/Reinforcement Learning

[강화학습] CS234 class1

by 에아오요이가야 2022. 4. 26.

https://www.youtube.com/watch?v=FgzM3zpZ55o&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=1 

 

강화 학습 = Learn to make good sequences of decisions under uncertainty.

 

강화 학습은 5가지 category를 포함하고 있다.

1. Optimization - best outcome을 주는 optimal way를 찾는 것이 목표

2. Delayed Consequences - decision(action)에 의해 얻어지는 immediate benefit과 logner term benefit의 balance

3. Generalization 

4. Learns from experience [Exploitation] - 이전에 알고 있던 action에 의해 얻어지는 benefit

5. Exploration - random 한  (이전에 알지 못했던) action에 의해 얻어지는 benefit

 

 

강화 학습과 다른 AI모델들의 차이점

  AI Planning Supervised Learning Unsupervised Learning Imitation Learning Reinforcement Learning
Optimization O     O O
Delayed
Consequences
O     O O
Generalization O O O O O
Exploitation   O O O O
Exploration         O

 

학습 목표

1. RL과 다른 ML의 특징적인 차이 -> Exploration이라고 위에 표로 정리됐네요

2. RL로 정의되는 응용문제들을 [State, Action, Dynamics, Reward] 형태로 정의하고 어떤 알고리즘(Dynamics)이 좋은지 찾기

3. RL 알고리즘을 평가할 수 있는 [Regret, Sample Complexity, Computational Complexity, Empirical Performance, Convergence]등을 이해하고 평가하기

4. Exploration과 Exploitation을 [Performance, Scalability, Complexity of Implementation, Theoretical Guarantees] 관점에서 비교 분석하기

 

  Exploitation Exploration
Movie  Watch a favorite movie you've seen  watch a new movie
Advertising  show most effective ad of far  show a different ad
Driving  try fastest route given prior experience  try a different route

Example 1.

- 학생이 덧셈, 뺄셈을 모른다.

- AI tutor가 덧셈, 뺄셈 연습문제를 내준다.

- 학생이 맞으면 +1, 틀리면 -1을 reward로 준다.

이 process를 [state, action, reward] 형태로 모델링 해라, 어떤 policy가 expected discounted sum of reward를 optimize 할까?

 

Example 2.

- 학생이 덧셈, 뺄셈을 모른다.

- AI tutor가 덧셈, 뺄셈 activity를 준다.

- 학생이 맞으면 +1, 틀리면 -1을 reward로 준다.

이 process를 [state, action, reward] 형태로 모델링 해라, 이게 최선인가?

 

Markov Assumtion - decision making의 조상

- Information state : sufficient statistc of history

- State st is Markov <=> p(st+1|st,at)=p(st+1|ht,at)

- Future is independent of past given present

 

왜 Markov Assumption이 유명할까?

1. Can always be satisfied

  - Setting state as history is always Markov: st=ht

2. In practice often assume most recent observation is sufficient statistic of history: st=ot

3. State representation has big implications for:

  - Computational complexity

  - Data required

  - Resulting performance

 

Q&A - 모든 history를 보관할 수 있으면 어떤 decision making 이든 Markov chain으로 표현 가능하다.

 

Bandit - simple example of MDP(Markov Decision Process)

현재 행동이 다음 action에 영향을 안 줌 -> No Delayed Reward

 

RL Algorithmm Components often includes one or more of Model, Policy, Value Function

- Model : Mathmematical models of dynamics and reward

- Policy : Function mapping agent's state to action

- Value Function : Future rewards from being in a state and\or action when following a particular policy

 

Example of Mars Rover Stochastic Markov Model

sn is state

r^ is reward 

s1 s2 s3 s4 s5 s6 s7
r^=1 r^=0 r^=0 r^=0 r^=0 r^=0 r^=10

Part of agent's transition model:

  - 0.5=P(s1|s1,Right)=P(s2|s1,Right)

  - 0.5=P(s2|s2,Right)=P(s3|s2,Right)...

Model may be wrong

 

Policy π determines how the agent chooses actions

π:SA, mapping from states to actions

Deterministic policy : π(s)=a

Stochastic policy : π(a|s)=Pr(at=a|st=s)

 

Quick Question :

If Rover is in s4 and π(s1)=π(s2)=...π(s7)=Right,

then is this deterministic or stochastic policy?

 

Value Function Vπ : expected discounted sum of future rewards under a particular policy π

Vπ(st=s)=Eπ[rt+γrt+1+γ2rt+2+...|st=s]

Can be used to quantify goodness/badness of states and actions and decide how to act by comparing policies.

 

 

Key Challenges in learning to make sequence of good decisions

1. AI Planning (agent's internal computation)

  - Given model of how the world works : dynamics and reward model

  - Algorithm computes how to act in order to maximize expected reward : with no interaction with environment

2. Reinforcement learning

  - Agent doesn't know how world works

  - Interactions with world to implicitly/explicitly learn how world works

  - Agent improves policy (may involve planning)

 

Evaluation : Estimate/predict the expected rewards from following a given policy

Control : find the best policy(optimization)

 

Evaluation Example 

s1 s2 s3 s4 s5 s6 s7
Right Right Right Right Right Right Right

-π(s1)=π(s2)=...=π(s7)=Right

- γ=0

- What is the value of this policy?

 

Answer 

- First, Value Function Vπ(st=s)=Eπ[rt+γrt+1+γ2rt+2+...|st=s]

- So, Vπ(st=s)=r(s).