Rl-atari implements a set of reinforcement learning algorithms to learn control policies for playing Atari games.
This implementation recreates the Deep Q-Networks (DQN) model and configuration as proposed first by Google DeepMind.
It combines computer vision (using convolutional neural networks) with reinforcement learning algorithms (Q-learning), in order to train an agent
that autonomously (read: without supervision) learns how to play a computer game from only pixel and reward inputs.
The following deep-RL variants and features are built out, the integrated combination of which is known as the Rainbow agent.
Training these agents takes a very long time, bringing with it an extended turnaround to obtain (new) results. New progress and figures will be pushed as they come in.
| Enabled | Algorithm | Reference |
|---|---|---|
| ✔️ | Deep Q-Learning (DQN) | https://arxiv.org/abs/1312.5602 https://www.nature.com/articles/nature14236 |
| ✔️ | Double Q-Learning (DDQN) | https://arxiv.org/abs/1509.06461 |
| ✔️ | Prioritized Experience Replay (PER) | https://arxiv.org/abs/1511.05952 |
| ✔️ | Multi-step Learning | https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutton.pdf |
| ✔️ | Dueling Network Architecture | https://arxiv.org/abs/1511.06581 |
| ✔️ | Noisy Network | https://arxiv.org/abs/1706.10295 |
| ✔️ | Distributional Network (C51) | https://arxiv.org/abs/1707.06887 |
| Async Advantage Actor-Critic (A3C) | https://arxiv.org/abs/1602.01783 | |
| ✔️ | Rainbow Agent | https://arxiv.org/abs/1710.02298 |
Virtual env and deps were managed with Conda and Poetry resp. Conda builds environment from the yaml file. Poetry installs the exact requirements and dependencies from the .lock and .toml files.
Custom functions are written to output comprehensive animations of agent playing episodes during training, with details such as Q values, Value and Advantage stream values, Z distribution and activation maps. Animation is generated from functions in utils.py.
Example 1: Dueling agent
The following shows a (partially) trained DQN agent playing Space Invaders attaining a training score of 2145. Displayed are:
- Left Original Atari frame.
- Right Preprocessed frame as viewed by agent.
- Overlayed on this are Conv. Neural Net saliency maps to display pixel attribution in agent decision making.
- Dueling network was used, where value stream is displayed in blue and advantage in red.
- Middle Max Q-value series as estimated by agent, along with the Value and Advantage stream values.
- Bottom Q-values for each action.
Note:
- Animation is sped up to 60 fps.
- saliency/activation maps can often be rather noisy, however some particular attentions stand out, such as agent focussing on the bonus (round) flying saucer.
6929_train_2115.0.mp4
Example 2: Distibutional agent
Idem ditto as example 1, however the Q-values have been replaced with a categorical value distribution.
23744_train_2460.0.mp4
Visual inspection of metrics during training is imperative to measure model performance, analyze agent behavior, and predict progress and runtimes. First plot below displays statistics aggregated per train episode, showing total reward score, episode length (in terms of played frames), mean of max Q values, mean TD-errors and runtime. The second plots show distribution of actions taken over time.
Plots are generated from functions in utils.py.
AI agents are build on top of convolutional neural networks coded up in Tensorflow 2. Both a small (2 conv layers) and a large network (3 conv layers) are supported.
Example 1: a large network with Dueling (V and A) streams, as well as noisy layers.
Example 2: of a large network with Dueling (V and A) streams, as well as distributional (C51) output.



