Deep Reinforcement learning

Untitled

RL limitation

Model-free 알고리즘은 sample inefficient 하다.
They require a lot of samples (sometimes millions of interactions) to learn something useful.
That’s why most of the successes in RL were achieved on games or in simulation only.
그래서 일반적으로 성능을 높이려면, training timestep 을 엄청 높게 가져가야 한다.
Reward engineering 또한 필요... (=reward shaping)
One last limitation of RL is the instability of training. That is to say, you can observe during training a huge drop in performance. → 학습이 불안정해서 학습 중 성능이 오지게 떨어질 수 있다...
- This behavior is particularly present in DDPG, That’s why its extension TD3 tries to tackle that issue.
- Other method, like TRPO or PPO make use of a trust region to minimize that problem by avoiding too large update.

Untitled

Box: A N-dimensional box that contains every point in the action space.
Discrete: A list of possible actions, where each timestep only one of the actions can be used.
MultiDiscrete: A list of possible actions, where each timestep only one action of each discrete set can be used.
MultiBinary: A list of possible actions, where each timestep any of the actions can be used in any combination.

Discrete, MultiDiscrete, Binary, MultiBinary

Discrete Actions - Single Process
- DQN : usually slower to train, but is the most sample efficient.
Discrete Actions - Multiprocessed
- PPO
- A2C