[강화학습] CS234 class1

https://www.youtube.com/watch?v=FgzM3zpZ55o&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=1

강화 학습 = Learn to make good sequences of decisions under uncertainty.

강화 학습은 5가지 category를 포함하고 있다.

1. Optimization - best outcome을 주는 optimal way를 찾는 것이 목표

2. Delayed Consequences - decision(action)에 의해 얻어지는 immediate benefit과 logner term benefit의 balance

3. Generalization

4. Learns from experience [Exploitation] - 이전에 알고 있던 action에 의해 얻어지는 benefit

5. Exploration - random 한 (이전에 알지 못했던) action에 의해 얻어지는 benefit

강화 학습과 다른 AI모델들의 차이점

	AI Planning	Supervised Learning	Unsupervised Learning	Imitation Learning	Reinforcement Learning
Optimization	O			O	O
Delayed Consequences	O			O	O
Generalization	O	O	O	O	O
Exploitation		O	O	O	O
Exploration					O

학습 목표

1. RL과 다른 ML의 특징적인 차이 -> Exploration이라고 위에 표로 정리됐네요

2. RL로 정의되는 응용문제들을 [State, Action, Dynamics, Reward] 형태로 정의하고 어떤 알고리즘(Dynamics)이 좋은지 찾기

3. RL 알고리즘을 평가할 수 있는 [Regret, Sample Complexity, Computational Complexity, Empirical Performance, Convergence]등을 이해하고 평가하기

4. Exploration과 Exploitation을 [Performance, Scalability, Complexity of Implementation, Theoretical Guarantees] 관점에서 비교 분석하기

	Exploitation	Exploration
Movie	Watch a favorite movie you've seen	watch a new movie
Advertising	show most effective ad of far	show a different ad
Driving	try fastest route given prior experience	try a different route

Example 1.

- 학생이 덧셈, 뺄셈을 모른다.

- AI tutor가 덧셈, 뺄셈 연습문제를 내준다.

- 학생이 맞으면 +1, 틀리면 -1을 reward로 준다.

이 process를 [state, action, reward] 형태로 모델링 해라, 어떤 policy가 expected discounted sum of reward를 optimize 할까?

Example 2.

- 학생이 덧셈, 뺄셈을 모른다.

- AI tutor가 덧셈, 뺄셈 activity를 준다.

- 학생이 맞으면 +1, 틀리면 -1을 reward로 준다.

이 process를 [state, action, reward] 형태로 모델링 해라, 이게 최선인가?

Markov Assumtion - decision making의 조상

- Information state : sufficient statistc of history

- State \(s_t\) is Markov <=> \(p(s_{t+1}| s_t, a_t) = p(s_{t+1}| h_t, a_t)\)

- Future is independent of past given present

왜 Markov Assumption이 유명할까?

1. Can always be satisfied

- Setting state as history is always Markov: \(s_t = h_t \)

2. In practice often assume most recent observation is sufficient statistic of history: \(s_t = o_t\)

3. State representation has big implications for:

- Computational complexity

- Data required

- Resulting performance

Q&A - 모든 history를 보관할 수 있으면 어떤 decision making 이든 Markov chain으로 표현 가능하다.

Bandit - simple example of MDP(Markov Decision Process)

현재 행동이 다음 action에 영향을 안 줌 -> No Delayed Reward

RL Algorithmm Components often includes one or more of Model, Policy, Value Function

- Model : Mathmematical models of dynamics and reward

- Policy : Function mapping agent's state to action

- Value Function : Future rewards from being in a state and\or action when following a particular policy

Example of Mars Rover Stochastic Markov Model

\(s_n\) is state

\(\hat {r}\) is reward

\(s_1\)	\(s_2\)	\(s_3\)	\(s_4\)	\(s_5\)	\(s_6\)	\(s_7\)
\(\hat{r}=1\)	\(\hat{r}=0\)	\(\hat{r}=0\)	\(\hat{r}=0\)	\(\hat{r}=0\)	\(\hat{r}=0\)	\(\hat{r}=10\)

Part of agent's transition model:

- \(0.5 = P(s_1|s_1, Right) = P(s_2|s_1, Right)\)

- \(0.5 = P(s_2|s_2, Right) = P(s_3|s_2, Right)... \)

Model may be wrong

Policy \(\pi\) determines how the agent chooses actions

\(\pi : S \rightarrow A\), mapping from states to actions

Deterministic policy : \(\pi(s) =a \)

Stochastic policy : \(\pi(a|s) = Pr(a_t =a | s_t = s) \)

Quick Question :

If Rover is in \(s_4\) and \( \pi(s_1)=\pi(s_2)=...\pi(s_7) = Right \),

then is this deterministic or stochastic policy?

Value Function \(V^\pi\) : expected discounted sum of future rewards under a particular policy \(\pi\)

\(V^\pi(s_t=s) = \mathbb{E}_\pi [r_t+\gamma r_{t+1}+\gamma^2 r_{t+2}+... |s_t = s] \)

Can be used to quantify goodness/badness of states and actions and decide how to act by comparing policies.

Key Challenges in learning to make sequence of good decisions

1. AI Planning (agent's internal computation)

- Given model of how the world works : dynamics and reward model

- Algorithm computes how to act in order to maximize expected reward : with no interaction with environment

2. Reinforcement learning

- Agent doesn't know how world works

- Interactions with world to implicitly/explicitly learn how world works

- Agent improves policy (may involve planning)

Evaluation : Estimate/predict the expected rewards from following a given policy

Control : find the best policy(optimization)

Evaluation Example

\(s_1\)	\(s_2\)	\(s_3\)	\(s_4\)	\(s_5\)	\(s_6\)	\(s_7\)
Right	Right	Right	Right	Right	Right	Right

-\( \pi(s_1)=\pi(s_2)=...=\pi(s_7) = Right \)

- \(\gamma = 0\)

- What is the value of this policy?

Answer

- First, Value Function \(V^{\pi}(s_t = s) = \mathbb {E}_{\pi}[r_t +\gamma r_{t+1} + \gamma^2 r_{t+2} +... | s_t =s ]\)

- So, \(V^{\pi}(s_t=s)= r(s)\).

'Have Done > Reinforcement Learning' 카테고리의 다른 글

[강화학습] CS 234 class 3 & class 4 (0)	2022.05.02
[강화학습] CS234 class 2 (0)	2022.04.26
[강화학습] Space-Invader 환경설정 후 학습하기 (0)	2022.04.14
[강화학습] OPEN AI GYM issue (0)	2022.04.11
[RL] Q - Learning (0)	2022.03.29

Problem Solver

[강화학습] CS234 class1

'Have Done > Reinforcement Learning' 카테고리의 다른 글

댓글

티스토리툴바

[강화학습] CS234 class1

'Have Done > Reinforcement Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바