https://www.youtube.com/watch?v=FgzM3zpZ55o&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=1
강화 학습 = Learn to make good sequences of decisions under uncertainty.
강화 학습은 5가지 category를 포함하고 있다.
1. Optimization - best outcome을 주는 optimal way를 찾는 것이 목표
2. Delayed Consequences - decision(action)에 의해 얻어지는 immediate benefit과 logner term benefit의 balance
3. Generalization
4. Learns from experience [Exploitation] - 이전에 알고 있던 action에 의해 얻어지는 benefit
5. Exploration - random 한 (이전에 알지 못했던) action에 의해 얻어지는 benefit
강화 학습과 다른 AI모델들의 차이점
AI Planning | Supervised Learning | Unsupervised Learning | Imitation Learning | Reinforcement Learning | |
Optimization | O | O | O | ||
Delayed Consequences |
O | O | O | ||
Generalization | O | O | O | O | O |
Exploitation | O | O | O | O | |
Exploration | O |
학습 목표
1. RL과 다른 ML의 특징적인 차이 -> Exploration이라고 위에 표로 정리됐네요
2. RL로 정의되는 응용문제들을 [State, Action, Dynamics, Reward] 형태로 정의하고 어떤 알고리즘(Dynamics)이 좋은지 찾기
3. RL 알고리즘을 평가할 수 있는 [Regret, Sample Complexity, Computational Complexity, Empirical Performance, Convergence]등을 이해하고 평가하기
4. Exploration과 Exploitation을 [Performance, Scalability, Complexity of Implementation, Theoretical Guarantees] 관점에서 비교 분석하기
Exploitation | Exploration | |
Movie | Watch a favorite movie you've seen | watch a new movie |
Advertising | show most effective ad of far | show a different ad |
Driving | try fastest route given prior experience | try a different route |
Example 1.
- 학생이 덧셈, 뺄셈을 모른다.
- AI tutor가 덧셈, 뺄셈 연습문제를 내준다.
- 학생이 맞으면 +1, 틀리면 -1을 reward로 준다.
이 process를 [state, action, reward] 형태로 모델링 해라, 어떤 policy가 expected discounted sum of reward를 optimize 할까?
Example 2.
- 학생이 덧셈, 뺄셈을 모른다.
- AI tutor가 덧셈, 뺄셈 activity를 준다.
- 학생이 맞으면 +1, 틀리면 -1을 reward로 준다.
이 process를 [state, action, reward] 형태로 모델링 해라, 이게 최선인가?
Markov Assumtion - decision making의 조상
- Information state : sufficient statistc of history
- State \(s_t\) is Markov <=> \(p(s_{t+1}| s_t, a_t) = p(s_{t+1}| h_t, a_t)\)
- Future is independent of past given present
왜 Markov Assumption이 유명할까?
1. Can always be satisfied
- Setting state as history is always Markov: \(s_t = h_t \)
2. In practice often assume most recent observation is sufficient statistic of history: \(s_t = o_t\)
3. State representation has big implications for:
- Computational complexity
- Data required
- Resulting performance
Q&A - 모든 history를 보관할 수 있으면 어떤 decision making 이든 Markov chain으로 표현 가능하다.
Bandit - simple example of MDP(Markov Decision Process)
현재 행동이 다음 action에 영향을 안 줌 -> No Delayed Reward
RL Algorithmm Components often includes one or more of Model, Policy, Value Function
- Model : Mathmematical models of dynamics and reward
- Policy : Function mapping agent's state to action
- Value Function : Future rewards from being in a state and\or action when following a particular policy
Example of Mars Rover Stochastic Markov Model
\(s_n\) is state
\(\hat {r}\) is reward
\(s_1\) | \(s_2\) | \(s_3\) | \(s_4\) | \(s_5\) | \(s_6\) | \(s_7\) |
\(\hat{r}=1\) | \(\hat{r}=0\) | \(\hat{r}=0\) | \(\hat{r}=0\) | \(\hat{r}=0\) | \(\hat{r}=0\) | \(\hat{r}=10\) |
Part of agent's transition model:
- \(0.5 = P(s_1|s_1, Right) = P(s_2|s_1, Right)\)
- \(0.5 = P(s_2|s_2, Right) = P(s_3|s_2, Right)... \)
Model may be wrong
Policy \(\pi\) determines how the agent chooses actions
\(\pi : S \rightarrow A\), mapping from states to actions
Deterministic policy : \(\pi(s) =a \)
Stochastic policy : \(\pi(a|s) = Pr(a_t =a | s_t = s) \)
Quick Question :
If Rover is in \(s_4\) and \( \pi(s_1)=\pi(s_2)=...\pi(s_7) = Right \),
then is this deterministic or stochastic policy?
Value Function \(V^\pi\) : expected discounted sum of future rewards under a particular policy \(\pi\)
\(V^\pi(s_t=s) = \mathbb{E}_\pi [r_t+\gamma r_{t+1}+\gamma^2 r_{t+2}+... |s_t = s] \)
Can be used to quantify goodness/badness of states and actions and decide how to act by comparing policies.
Key Challenges in learning to make sequence of good decisions
1. AI Planning (agent's internal computation)
- Given model of how the world works : dynamics and reward model
- Algorithm computes how to act in order to maximize expected reward : with no interaction with environment
2. Reinforcement learning
- Agent doesn't know how world works
- Interactions with world to implicitly/explicitly learn how world works
- Agent improves policy (may involve planning)
Evaluation : Estimate/predict the expected rewards from following a given policy
Control : find the best policy(optimization)
Evaluation Example
\(s_1\) | \(s_2\) | \(s_3\) | \(s_4\) | \(s_5\) | \(s_6\) | \(s_7\) |
Right | Right | Right | Right | Right | Right | Right |
-\( \pi(s_1)=\pi(s_2)=...=\pi(s_7) = Right \)
- \(\gamma = 0\)
- What is the value of this policy?
Answer
- First, Value Function \(V^{\pi}(s_t = s) = \mathbb {E}_{\pi}[r_t +\gamma r_{t+1} + \gamma^2 r_{t+2} +... | s_t =s ]\)
- So, \(V^{\pi}(s_t=s)= r(s)\).
'Have Done > Reinforcement Learning' 카테고리의 다른 글
[강화학습] CS 234 class 3 & class 4 (0) | 2022.05.02 |
---|---|
[강화학습] CS234 class 2 (0) | 2022.04.26 |
[강화학습] Space-Invader 환경설정 후 학습하기 (0) | 2022.04.14 |
[강화학습] OPEN AI GYM issue (0) | 2022.04.11 |
[RL] Q - Learning (0) | 2022.03.29 |
댓글