This is a deep reinforcement learning project that was made as a part of master thesis. It implements following deep reinforcement learning algorithms: Deep Q Network (DQN), Double Deep Q Network (DDQN), Dueling Double Deep Q Network (DDDQN) and Asynchronous advantage actor-critic (A3C).
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
In order to start this project the following requirements need to be installed.
- Python 3.4
- Keras
- Tensorflow-GPU (CPU version can be installed and it will work but it will be too slow to experience well enough results in this lifetime)
- scikit-image
- Numpy
- OpenAI Gym (with support for Atari games)
In order to install GPU version of Tensorflow run:
sudo pip3 install tensorflow-gpu
For more information about the installation procedure read the official Tensorflow documentation available at https://www.tensorflow.org/install/.
In order to install Keras run:
sudo pip3 install keras
In order to install scikit-image run:
sudo pip3 install scikit-image
In order to install OpenAI Gym with Atari games run:
git clone https://github.com/openai/gym.git
cd gym
sudo pip3 install -e .
sudo pip3 install -e '.[atari]' # make sure you have cmake installed
For more information read the official documentation available at https://github.com/openai/gym.
Next two sections will explain how to run value based and policy based algorithms.
In this section will be explained how to setup the value based algorithms (DQN, DDQN, DDDQN) to work.
Value based algorithms can be trained very easily. Training can be done in one of two ways. First way is to use 'config.json' file which stores all the configuration data that will be explained later. Second way to do it is by programmatically setting up configuration data inside AgentRunner.py script.
Configuration data that can be setup is:
- start_eps: starting epsilon value for epsilon greedy exploration strategy (original papers recommend 1.0)
- end_eps: final epsilon value for epsilon greedy exploration strategy (original papers recommend 0.1)
- observing_frames: number of frames to observe without any learning done
- exploring_frames: number of frames to perform learning
- replay_memory_size: size of the experience buffer (must be <= observing_frames) (30000 should be enough but the bigger the better)
- replay_batch_size: number of experiences to consider in one train batch (original papers recommend 32)
- learning_rate: learning rate of the AdamOptimizer (recommended value is 1e-4)
- log_freq: frequency of testing the agent (log_freq=10 means that agent will be tested every 10 learning epsiodes - testing is done by letting the agent play 5 consecutive games and the average reward and episode length are recorded)
- saving_freq: frequency of saving model parameters
- saving_dir: directory in which should logs and models be stored
- img_width: width of input image (original papers recommend 84)
- img_height: height of input image (original papers recommend 84)
- num_consecutive_frames: number of consecutive frames to stack in order to form one input to the neural network (num_consecutive_frames=3 means to use last 3 frames as a state representation => then the input to neural network is WxHx3) (original papers recommend 4)
- max_ep_length: maximum episode length
- game_name: name of the game to learn (only games that give image as a state representation are supported)
- gamma: reward decay factor
- update_freq: frequency at which to update target network (used in DDQN and DDDQN algorithms)
- log_filename: where to save logging file
- MemoryType: which memory to use (supported values are ExperienceReplayMemory, MemoryPrioritizedForgetting and PrioritizedExperienceReplayMemory)
- PEREps: epsilon parameter in prioritized experience replay memory
- PERAlfa: alfa parameter in prioritized experience replay memory
- ExplorationStrategy: which explorations strategy to use (supported values are EpsilonGreedyExplorationStrategy and BoltzmannExplorationStrategy)
-
tau: parameter that describes how fast will the target network update it's values to the primary network (parameters of target network
$\theta_{t}$ are updated to the parameters of primary network$\theta_{p}$ like this$\theta_{t}=\tau*\theta_{p} + (1-\tau)*\theta_{t})$
After setting things either by script AgentRunner.py or by configuration file 'config.json' training can be done by running the AgentRunner.py script like this:
python3 AgentRunner.py
Warning!: If you have setup parameters both inside 'config.json' and inside AgentRunner.py script, setup entries that are defined in the script will be used.
In order to see how the agent plays the game just start the TestAgent.py script and give it path to the model and the game you want it to play. For instance, if you want to load model model_episode2300.h5 with game BreakoutDeterministic-v4 you can start it like this:
python3 TestAgent.py model_episode2300.h5 BreakoutDeterministic-v4
Models will be saved into the saving_dir. Also, there will be a Tensorboard record in the Tensorboard folder that will keep record of the value of loss, average episode length and average episode reward.
In order to start Tensorboard all you need to do is to run:
cd <__saving_dir__>
cd Tensorboard
tensorboard --logdir='Tensorboard':Tensorboard
In this section will be explained how to setup the policy based algorithm (A3C) to work.
Since this version uses pure Tensorflow instead of Keras and is asynchoronous it is not integrated into the framework that was made for value based algorithms.
There are two versions of the A3C algorithm implemented in this repository. The one with LSTM and the one without it.
A3C algorithm is located in the Asynchronous folder. In order to configure parameters of A3C algorithm one must configure it inside the A3C.py script (LSTM version) or inside the A3C_no_lstm.py script (version without LSTM).
Configurable parameters are:
- IMG_WIDTH: width of input image
- IMG_HEIGHT: height of input image
- CNT_FRAMES: number of consecutive frames to form the state of the environment (this parameter is not available in the LSTM version)
- GLOBAL_SCOPE: name of the global scope
- VALUE_MODIFIER: value of scale for value loss
- POLICY_MODIFIER: value of scale for policy loss
- ENTROPY_MODIFIER: value of scale for entropy loss
- MAX_STEPS: how many steps to take into the account before making an update
- DISCOUNT: reward decay factor
- ENV_NAME: name of the game to learn
- MAX_EP_LENGTH: maximum length of episode (feel free to set it to some big number)
- LEARNING_RATE: learning rate of the Adam optimizer
- CLIP_VALUE: gradient clipping value (since this algorithm uses n-step return there is a greater posibility of exploding gradients)
- SAVE_DIR: directory in which should logs and models be stored
In order to start training of the LSTM version of the A3C algorithm you just need to run:
python3 A3C.py
In order to test LSTM version of the A3C algorithm you just need to run:
python3 A3C_test.py <model_path> <should_render> #should render is y/n character that indicates will the rendering be done or not
Testing is performed by playing the game NUM_GAMES times. NUM_GAMES can be changed in A3C_test.py. Also, IMG_WIDTH, IMG_HEIGHT, ENV_NAME and CNT_FRAMES can be configured too. Make sure to use the same IMG_WIDTH, IMG_HEIGHT and CNT_FRAMES as when training in order to avoid errors when loading model.
In order to check Tensorboard output you can start start_tensorboard.sh script once inside the Tensorboard directory by running:
. start_tensorboard.sh
start_tensorboard.sh script needs to be copied to the Tensorboard directory in order to make it work.
In order to start training of the A3C algorithm version without the LSTM layer you just need to run:
python3 A3C_no_lstm.py
In order to test version of the A3C algorithm without the LSTM layer you just need to run:
python3 A3C_no_lstm_test.py <model_path> <should_render> #should render is y/n character that indicates will the rendering be done or not
Testing is performed by playing the game NUM_GAMES times. NUM_GAMES can be changed in A3C_no_lstm_test.py. Also, IMG_WIDTH, IMG_HEIGHT, ENV_NAME and CNT_FRAMES can be configured too. Make sure to use the same IMG_WIDTH, IMG_HEIGHT and CNT_FRAMES as when training in order to avoid errors when loading model.
In order to check Tensorboard output you can start start_tensorboard.sh script once inside the Tensorboard directory by running:
. start_tensorboard.sh
start_tensorboard.sh script needs to be copied to the Tensorboard directory in order to make it work.
This project is licensed under the MIT License - see the LICENSE.md file for details
- Big thanks to Arthur Juliani for great series of posts about reinforcement learning named Simple Reinforcement Learning with Tensorflow series available at https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0.
- Big thanks to Chris aka cgnicholls whose post helped me gain more insight into the maths behind the A3C algorithm. His post can be found at https://cgnicholls.github.io/reinforcement-learning/2016/08/20/reinforcement-learning.html.
- Big thanks to Jaromír Janisch for a great tutorial related to Prioritized Experience Replay available at https://jaromiru.com/2016/11/07/lets-make-a-dqn-double-learning-and-prioritized-experience-replay/.
In this section will be mentioned the most important papers used for implementing algorithms used in this repository.
- Playing Atari with Deep Reinforcement Learning, https://arxiv.org/abs/1312.5602
- Prioritized Experience Replay, https://arxiv.org/abs/1511.05952
- Deep Reinforcement Learning with Double Q-learning, https://arxiv.org/abs/1509.06461
- Dueling Network Architectures for Deep Reinforcement Learning, https://arxiv.org/abs/1511.06581