Merge branch 'master' of git://github.com/dennybritz/deeplearning-pap…

…ernotes
mikami235 · Dec 7, 2017 · c131899 · c131899
2 parents 60f2658 + 59005c4
commit c131899
Show file tree

Hide file tree

Showing 4 changed files with 60 additions and 3 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1 +1,2 @@
-.DS_Store
+.DS_Store
+.vscode
diff --git a/README.md b/README.md
@@ -138,7 +138,7 @@ Weakly-Supervised Classification and Localization of Common Thorax Diseases [[CV
 - Emergence of Locomotion Behaviours in Rich Environments [[arXiv](https://arxiv.org/abs/1707.02286)] [[article](https://deepmind.com/blog/producing-flexible-behaviours-simulated-environments/)]
 - Learning human behaviors from motion capture by adversarial imitation [[arXiv](https://arxiv.org/abs/1707.02201)] [[article](https://deepmind.com/blog/producing-flexible-behaviours-simulated-environments/)]
 - Robust Imitation of Diverse Behaviors [[arXiv](https://deepmind.com/documents/95/diverse_arxiv.pdf)] [[article](https://deepmind.com/blog/producing-flexible-behaviours-simulated-environments/)]
-- Hindsight Experience Replay [[arXiv](https://arxiv.org/abs/1707.01495)]
+- [Hindsight Experience Replay](notes/hindsight-ep.md) [[arXiv](https://arxiv.org/abs/1707.01495)]
 - Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks [[arXiv](https://arxiv.org/abs/1707.01836)] [[article](https://stanfordmlgroup.github.io/projects/ecg/)]
 - End-to-End Learning of Semantic Grasping [[arXiv](https://arxiv.org/abs/1707.01932)]
 - ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games [[arXiv](https://arxiv.org/abs/1707.01067)] [[code](https://github.com/facebookresearch/ELF)] [[article](https://code.facebook.com/posts/132985767285406/introducing-elf-an-extensive-lightweight-and-flexible-platform-for-game-research/)]
@@ -202,7 +202,7 @@ Weakly-Supervised Classification and Localization of Common Thorax Diseases [[CV
 - Learning to Skim Text [[arXiv](https://arxiv.org/abs/1704.06877)]
 - Get To The Point: Summarization with Pointer-Generator Networks [[arXiv](https://arxiv.org/abs/1704.04368)] [[code](https://github.com/abisee/pointer-generator)] [[article](http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html)]
 - Adversarial Neural Machine Translation [[arXiv](https://arxiv.org/abs/1704.06933)]
-- Deep Q-learning from Demonstrations [[arXiv](https://arxiv.org/abs/1704.03732)]
+- [Deep Q-learning from Demonstrations](notes/dqn-demonstrations.md) [[arXiv](https://arxiv.org/abs/1704.03732)]
 - Learning from Demonstrations for Real World Reinforcement Learning [[arXiv](https://arxiv.org/abs/1704.03732)]
 - DSLR-Quality Photos on Mobile Devices with Deep Convolutional Networks [[arXiv](https://arxiv.org/abs/1704.02470)] [[article](http://people.ee.ethz.ch/~ihnatova/)] [[code](https://github.com/aiff22/DPED)]
 - A Neural Representation of Sketch Drawings [[arXiv](https://arxiv.org/abs/1704.03477)] [[code](https://github.com/tensorflow/magenta/tree/master/magenta/models/sketch_rnn)] [[article](https://research.googleblog.com/2017/04/teaching-machines-to-draw.html)]

diff --git a/notes/dqn-demonstrations.md b/notes/dqn-demonstrations.md
@@ -0,0 +1,27 @@
+## [Deep Q-learning from Demonstrations](https://arxiv.org/abs/1704.03732)
+
+TLDR; The authors combine the DQN algorithm with human demonstration data, called DQfD. They pre-train the agent with a combination of four losses, supervised and td-losses, from human demonstration data. Once the agent starts interacting with the environment, both the human demonstration data and the transitions taken by the agent are kept in the same replay buffer. Transitions are sampled with prioritzed experience replay. The algorithms learns much faster than most other DQN variants, does not need large amounts of demonstration data, and achieves new high scores on some of the games.
+
+
+#### Key Points
+
+- Most real-world problems don't have good (or any) simulators. But we often have some sample plays from human controllers.
+- Four losses are used when learning from transitions:
+    - 1-step Q-Learning loss
+    - n-step Q-Learning loss
+    - supervised large-margin classification loss (this loss is only added for the demonstration transitions)
+    - L2 regularization loss
+- Difference from Imitation Learning
+    - Imitation Learning uses a pure supervised loss. It can never exceed the performance of the human demonstrator
+    - DQfD continues learning on-line and can learn to become better than the human policy
+- Replay Buffer
+    - Both agent and demonstration data is mixed in the same buffer
+    - Demonstration data is fixed in the buffer, never replaced by agent transitions. Extra probability is added to the demonstration data to encourage sampling it more often.
+- Human demonstration data ranges for 5k to 75k transitions depending on the game
+- Experiments show that the combination of the four losses is crucial, taking out n-step returns or supervised loss significantly degrades agent performance.
+- Very good performance especially on games that require longer-term planning (human demonstrations very useful here)
+
+#### Thoughts
+
+Very nice paper and good results. This is a relatively simple technique to bootstrap agents and speed up the learning process. I'm not really sure if the experimental results are fair with the hyperparameter tuning and extra data, and also no comparison to "better" techniques like Rainbow, A3C, Reactor, etc. The authors give good arguments for why they don't compare, I still would've liked to see the difference in scores.
+
diff --git a/notes/hindsight-ep.md b/notes/hindsight-ep.md
@@ -0,0 +1,29 @@
+## [Hindsight Experience Replay](https://arxiv.org/abs/1707.01495)
+
+TLDR; The authors present a novel way to deal with sparse rewards in Reinforcement Learning. The key idea (called HER, or Hindsight Experience Replay) is that when an agent does not achieve the desired goal during an episode, it still has learned to achieve *some other* goal, which it can learn about and generalize from. This is done by framing the RL problem in a multi-goal setting, and adding transitions with different goals (and rewards) to the experience buffer. When updating the policy, the additional goals with positive rewards lead to faster learning. Note that this requires an off-policy RL algorithm (such as Q-Learning).
+
+#### Key Points
+
+- Proper reward shaping can be difficult. Thus, it is important to develop algorithms that can learn from sparse binary reward signals.
+- HER requires an off-policy Reinforcement Learning algorithm. For example, DQN, etc.
+- Multi-Goal RL vs. "Standard RL"
+    - Policy depends on the goal
+    - Reward function depends on the goal
+    - Goal is sampled at the start of each episode
+- HER
+    - Assume that the goal is some *state* that the agent can achieve
+    - Needs a way to sample/generate a set of additional goals for an episode (hyperparameter)
+        - For example: The goal is the last state visited in the episode
+    - Store transitions with newly sampled goals (in addition to the original goal) in the replay buffer
+    - Induces a form of implicit curriculum as goals become more difficult
+        - Because the agent becomes better over time, the states it visits become "more difficult"
+- Experiments: Robot Arm simulation
+    - Clearly outperforms DDPG and DDPG with count-based exploration on binary rewards
+    - Works whether we care about a single or multiple goals
+    - Shows that shaped rewards may hinder exploration
+
+#### Notes/Questions
+
+- The idea that shaped rewards can hinder exploration is a good one, I really enjoyed that
+- How does this approach relate to model-based learning. While there is no direct relationship you learn to generalize across goals - Learning about the environment can have a similar effect.
+- Not really sold/convinced on the implicit curriculum learning. I see how it applies to some problems, but not to all. Just because an agent becomes better at achieving G, the states it visits are not necessarily more "difficult" to achieve. Maybe I'm missing something.