본문 바로가기
Have Done/Reinforcement Learning

[강화학습] CS234 class1

by 에아오요이가야 2022. 4. 26.

https://www.youtube.com/watch?v=FgzM3zpZ55o&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=1 

 

강화 학습 = Learn to make good sequences of decisions under uncertainty.

 

강화 학습은 5가지 category를 포함하고 있다.

1. Optimization - best outcome을 주는 optimal way를 찾는 것이 목표

2. Delayed Consequences - decision(action)에 의해 얻어지는 immediate benefit과 logner term benefit의 balance

3. Generalization 

4. Learns from experience [Exploitation] - 이전에 알고 있던 action에 의해 얻어지는 benefit

5. Exploration - random 한  (이전에 알지 못했던) action에 의해 얻어지는 benefit

 

 

강화 학습과 다른 AI모델들의 차이점

  AI Planning Supervised Learning Unsupervised Learning Imitation Learning Reinforcement Learning
Optimization O     O O
Delayed
Consequences
O     O O
Generalization O O O O O
Exploitation   O O O O
Exploration         O

 

학습 목표

1. RL과 다른 ML의 특징적인 차이 -> Exploration이라고 위에 표로 정리됐네요

2. RL로 정의되는 응용문제들을 [State, Action, Dynamics, Reward] 형태로 정의하고 어떤 알고리즘(Dynamics)이 좋은지 찾기

3. RL 알고리즘을 평가할 수 있는 [Regret, Sample Complexity, Computational Complexity, Empirical Performance, Convergence]등을 이해하고 평가하기

4. Exploration과 Exploitation을 [Performance, Scalability, Complexity of Implementation, Theoretical Guarantees] 관점에서 비교 분석하기

 

  Exploitation Exploration
Movie  Watch a favorite movie you've seen  watch a new movie
Advertising  show most effective ad of far  show a different ad
Driving  try fastest route given prior experience  try a different route

Example 1.

- 학생이 덧셈, 뺄셈을 모른다.

- AI tutor가 덧셈, 뺄셈 연습문제를 내준다.

- 학생이 맞으면 +1, 틀리면 -1을 reward로 준다.

이 process를 [state, action, reward] 형태로 모델링 해라, 어떤 policy가 expected discounted sum of reward를 optimize 할까?

 

Example 2.

- 학생이 덧셈, 뺄셈을 모른다.

- AI tutor가 덧셈, 뺄셈 activity를 준다.

- 학생이 맞으면 +1, 틀리면 -1을 reward로 준다.

이 process를 [state, action, reward] 형태로 모델링 해라, 이게 최선인가?

 

Markov Assumtion - decision making의 조상

- Information state : sufficient statistc of history

- State \(s_t\) is Markov <=> \(p(s_{t+1}| s_t, a_t) = p(s_{t+1}| h_t, a_t)\)

- Future is independent of past given present

 

왜 Markov Assumption이 유명할까?

1. Can always be satisfied

  - Setting state as history is always Markov: \(s_t = h_t \)

2. In practice often assume most recent observation is sufficient statistic of history: \(s_t = o_t\)

3. State representation has big implications for:

  - Computational complexity

  - Data required

  - Resulting performance

 

Q&A - 모든 history를 보관할 수 있으면 어떤 decision making 이든 Markov chain으로 표현 가능하다.

 

Bandit - simple example of MDP(Markov Decision Process)

현재 행동이 다음 action에 영향을 안 줌 -> No Delayed Reward

 

RL Algorithmm Components often includes one or more of Model, Policy, Value Function

- Model : Mathmematical models of dynamics and reward

- Policy : Function mapping agent's state to action

- Value Function : Future rewards from being in a state and\or action when following a particular policy

 

Example of Mars Rover Stochastic Markov Model

\(s_n\) is state

\(\hat {r}\) is reward 

\(s_1\) \(s_2\) \(s_3\) \(s_4\) \(s_5\) \(s_6\) \(s_7\)
\(\hat{r}=1\) \(\hat{r}=0\) \(\hat{r}=0\) \(\hat{r}=0\) \(\hat{r}=0\) \(\hat{r}=0\) \(\hat{r}=10\)

Part of agent's transition model:

  - \(0.5 = P(s_1|s_1, Right) = P(s_2|s_1, Right)\)

  - \(0.5 = P(s_2|s_2, Right) = P(s_3|s_2, Right)... \)

Model may be wrong

 

Policy \(\pi\) determines how the agent chooses actions

\(\pi : S \rightarrow A\), mapping from states to actions

Deterministic policy : \(\pi(s) =a \)

Stochastic policy : \(\pi(a|s) = Pr(a_t =a | s_t = s) \)

 

Quick Question :

If Rover is in \(s_4\) and \( \pi(s_1)=\pi(s_2)=...\pi(s_7) = Right \),

then is this deterministic or stochastic policy?

 

Value Function \(V^\pi\) : expected discounted sum of future rewards under a particular policy \(\pi\)

\(V^\pi(s_t=s) = \mathbb{E}_\pi [r_t+\gamma r_{t+1}+\gamma^2 r_{t+2}+... |s_t = s] \)

Can be used to quantify goodness/badness of states and actions and decide how to act by comparing policies.

 

 

Key Challenges in learning to make sequence of good decisions

1. AI Planning (agent's internal computation)

  - Given model of how the world works : dynamics and reward model

  - Algorithm computes how to act in order to maximize expected reward : with no interaction with environment

2. Reinforcement learning

  - Agent doesn't know how world works

  - Interactions with world to implicitly/explicitly learn how world works

  - Agent improves policy (may involve planning)

 

Evaluation : Estimate/predict the expected rewards from following a given policy

Control : find the best policy(optimization)

 

Evaluation Example 

\(s_1\) \(s_2\) \(s_3\) \(s_4\) \(s_5\) \(s_6\) \(s_7\)
Right Right Right Right Right Right Right

-\( \pi(s_1)=\pi(s_2)=...=\pi(s_7) = Right \)

- \(\gamma = 0\)

- What is the value of this policy?

 

Answer 

- First, Value Function \(V^{\pi}(s_t = s) = \mathbb {E}_{\pi}[r_t +\gamma r_{t+1} + \gamma^2 r_{t+2} +... | s_t =s ]\)

- So, \(V^{\pi}(s_t=s)= r(s)\).

댓글