https://www.youtube.com/watch?v=FgzM3zpZ55o&list=PLRQmQC3wIq9yxKVK1qc0r2nPuInn92LmK&index=1
강화 학습 = Learn to make good sequences of decisions under uncertainty.
강화 학습은 5가지 category를 포함하고 있다.
1. Optimization - best outcome을 주는 optimal way를 찾는 것이 목표
2. Delayed Consequences - decision(action)에 의해 얻어지는 immediate benefit과 logner term benefit의 balance
3. Generalization
4. Learns from experience [Exploitation] - 이전에 알고 있던 action에 의해 얻어지는 benefit
5. Exploration - random 한 (이전에 알지 못했던) action에 의해 얻어지는 benefit
강화 학습과 다른 AI모델들의 차이점
AI Planning | Supervised Learning | Unsupervised Learning | Imitation Learning | Reinforcement Learning | |
Optimization | O | O | O | ||
Delayed Consequences |
O | O | O | ||
Generalization | O | O | O | O | O |
Exploitation | O | O | O | O | |
Exploration | O |
학습 목표
1. RL과 다른 ML의 특징적인 차이 -> Exploration이라고 위에 표로 정리됐네요
2. RL로 정의되는 응용문제들을 [State, Action, Dynamics, Reward] 형태로 정의하고 어떤 알고리즘(Dynamics)이 좋은지 찾기
3. RL 알고리즘을 평가할 수 있는 [Regret, Sample Complexity, Computational Complexity, Empirical Performance, Convergence]등을 이해하고 평가하기
4. Exploration과 Exploitation을 [Performance, Scalability, Complexity of Implementation, Theoretical Guarantees] 관점에서 비교 분석하기
Exploitation | Exploration | |
Movie | Watch a favorite movie you've seen | watch a new movie |
Advertising | show most effective ad of far | show a different ad |
Driving | try fastest route given prior experience | try a different route |
Example 1.
- 학생이 덧셈, 뺄셈을 모른다.
- AI tutor가 덧셈, 뺄셈 연습문제를 내준다.
- 학생이 맞으면 +1, 틀리면 -1을 reward로 준다.
이 process를 [state, action, reward] 형태로 모델링 해라, 어떤 policy가 expected discounted sum of reward를 optimize 할까?
Example 2.
- 학생이 덧셈, 뺄셈을 모른다.
- AI tutor가 덧셈, 뺄셈 activity를 준다.
- 학생이 맞으면 +1, 틀리면 -1을 reward로 준다.
이 process를 [state, action, reward] 형태로 모델링 해라, 이게 최선인가?
Markov Assumtion - decision making의 조상
- Information state : sufficient statistc of history
- State
- Future is independent of past given present
왜 Markov Assumption이 유명할까?
1. Can always be satisfied
- Setting state as history is always Markov:
2. In practice often assume most recent observation is sufficient statistic of history:
3. State representation has big implications for:
- Computational complexity
- Data required
- Resulting performance
Q&A - 모든 history를 보관할 수 있으면 어떤 decision making 이든 Markov chain으로 표현 가능하다.
Bandit - simple example of MDP(Markov Decision Process)
현재 행동이 다음 action에 영향을 안 줌 -> No Delayed Reward
RL Algorithmm Components often includes one or more of Model, Policy, Value Function
- Model : Mathmematical models of dynamics and reward
- Policy : Function mapping agent's state to action
- Value Function : Future rewards from being in a state and\or action when following a particular policy
Example of Mars Rover Stochastic Markov Model
Part of agent's transition model:
-
-
Model may be wrong
Policy
Deterministic policy :
Stochastic policy :
Quick Question :
If Rover is in
then is this deterministic or stochastic policy?
Value Function
Can be used to quantify goodness/badness of states and actions and decide how to act by comparing policies.
Key Challenges in learning to make sequence of good decisions
1. AI Planning (agent's internal computation)
- Given model of how the world works : dynamics and reward model
- Algorithm computes how to act in order to maximize expected reward : with no interaction with environment
2. Reinforcement learning
- Agent doesn't know how world works
- Interactions with world to implicitly/explicitly learn how world works
- Agent improves policy (may involve planning)
Evaluation : Estimate/predict the expected rewards from following a given policy
Control : find the best policy(optimization)
Evaluation Example
Right | Right | Right | Right | Right | Right | Right |
-
-
- What is the value of this policy?
Answer
- First, Value Function
- So,
'Have Done > Reinforcement Learning' 카테고리의 다른 글
[강화학습] CS 234 class 3 & class 4 (0) | 2022.05.02 |
---|---|
[강화학습] CS234 class 2 (0) | 2022.04.26 |
[강화학습] Space-Invader 환경설정 후 학습하기 (0) | 2022.04.14 |
[강화학습] OPEN AI GYM issue (0) | 2022.04.11 |
[RL] Q - Learning (0) | 2022.03.29 |
댓글