foksly 1eb7522b89 minor update to readme.md		3 ani în urmă
..
README.md	1eb7522b89 minor update to readme.md	3 ani în urmă
ppo.py	2f42352555 add ppo example	3 ani în urmă
requirements.txt	2f42352555 add ppo example	3 ani în urmă

Training PPO with decentralized averaging

This tutorial will walk you through the steps to set up collaborative training of an on-policy reinforcement learning algorighm PPO to play Atari Breakout. It uses stable-baselines3 implementation of PPO, hyperparameters for the algorithm are taken from rl-baselines3-zoo, collaborative training is built on hivemind.Optimizer to exchange information between peers.

Preparation

Install hivemind: pip install git+https://github.com/learning-at-home/hivemind.git
Dependencies: pip install -r requirements.txt

Running an experiment

First peer

Run the first DHT peer to welcome trainers and record training statistics (e.g., loss and performance):

In this example, we use tensorboard to plot training metrics. If you're unfamiliar with Tensorboard, here's a quickstart tutorial.

Run python3 ppo.py

$ python3 ppo.py
To connect other peers to this one, use --initial_peers /ip4/127.0.0.1/tcp/41926/p2p/QmUmiebP4BxdEPEpQb28cqyhaheDugFRn7M
CoLJr556xYt
A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd)
[Powered by Stella]
Using cuda device
Wrapping the env in a VecTransposeImage.
[W NNPACK.cpp:51] Could not initialize NNPACK! Reason: Unsupported hardware.
Jun 20 13:23:20.515 [INFO] Found no active peers: None
Jun 20 13:23:20.533 [INFO] Initializing optimizer manually since it has no tensors in state dict. To override this, prov
ide initialize_optimizer=False
Logging to logs/bs-256.target_bs-32768.n_envs-8.n_steps-128.n_epochs-1_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 521      |
|    ep_rew_mean     | 0        |
| time/              |          |
|    fps             | 582      |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 1024     |
| train/             |          |
|    timesteps       | 1024     |
---------------------------------
Jun 20 13:23:23.525 [INFO] ppo_hivemind accumulated 1024 samples for epoch #0 from 1 peers. ETA 52.20 sec (refresh in 1$
.00 sec)

README.md

Training PPO with decentralized averaging

Preparation

Running an experiment

First peer