# Training PPO with decentralized averaging

This tutorial will walk you through the steps to set up collaborative training of an on-policy reinforcement learning algorighm [PPO](https://arxiv.org/pdf/1707.06347.pdf) to play Atari Breakout. It uses [stable-baselines3 implementation of PPO](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html), hyperparameters for the algorithm are taken from [rl-baselines3-zoo](https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/ppo.yml), collaborative training is built on `hivemind.Optimizer` to exchange information between peers.

## Preparation

* Install hivemind: `pip install git+https://github.com/learning-at-home/hivemind.git`
* Dependencies: `pip install -r requirements.txt`

## Running an experiment

### First peer
Run the first DHT peer to welcome trainers and record training statistics (e.g., loss and performance):
- In this example, we use [tensorboard](https://www.tensorflow.org/tensorboard) to plot training metrics. If you're unfamiliar with Tensorboard, here's a [quickstart tutorial](https://www.tensorflow.org/tensorboard/get_started).
- Run `python3 ppo.py`

```
$ python3 ppo.py
To connect other peers to this one, use --initial_peers /ip4/127.0.0.1/tcp/41926/p2p/QmUmiebP4BxdEPEpQb28cqyhaheDugFRn7M
CoLJr556xYt
A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd)
[Powered by Stella]
Using cuda device
Wrapping the env in a VecTransposeImage.
[W NNPACK.cpp:51] Could not initialize NNPACK! Reason: Unsupported hardware.
Jun 20 13:23:20.515 [INFO] Found no active peers: None
Jun 20 13:23:20.533 [INFO] Initializing optimizer manually since it has no tensors in state dict. To override this, prov
ide initialize_optimizer=False
Logging to logs/bs-256.target_bs-32768.n_envs-8.n_steps-128.n_epochs-1_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 521      |
|    ep_rew_mean     | 0        |
| time/              |          |
|    fps             | 582      |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 1024     |
| train/             |          |
|    timesteps       | 1024     |
---------------------------------
Jun 20 13:23:23.525 [INFO] ppo_hivemind accumulated 1024 samples for epoch #0 from 1 peers. ETA 52.20 sec (refresh in 1$
.00 sec)

```