foksly 1eb7522b89 minor update to readme.md 3 ani în urmă
..
README.md 1eb7522b89 minor update to readme.md 3 ani în urmă
ppo.py 2f42352555 add ppo example 3 ani în urmă
requirements.txt 2f42352555 add ppo example 3 ani în urmă

README.md

Training PPO with decentralized averaging

This tutorial will walk you through the steps to set up collaborative training of an on-policy reinforcement learning algorighm PPO to play Atari Breakout. It uses stable-baselines3 implementation of PPO, hyperparameters for the algorithm are taken from rl-baselines3-zoo, collaborative training is built on hivemind.Optimizer to exchange information between peers.

Preparation

  • Install hivemind: pip install git+https://github.com/learning-at-home/hivemind.git
  • Dependencies: pip install -r requirements.txt

Running an experiment

First peer

Run the first DHT peer to welcome trainers and record training statistics (e.g., loss and performance):

  • In this example, we use tensorboard to plot training metrics. If you're unfamiliar with Tensorboard, here's a quickstart tutorial.
  • Run python3 ppo.py

    $ python3 ppo.py
    To connect other peers to this one, use --initial_peers /ip4/127.0.0.1/tcp/41926/p2p/QmUmiebP4BxdEPEpQb28cqyhaheDugFRn7M
    CoLJr556xYt
    A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd)
    [Powered by Stella]
    Using cuda device
    Wrapping the env in a VecTransposeImage.
    [W NNPACK.cpp:51] Could not initialize NNPACK! Reason: Unsupported hardware.
    Jun 20 13:23:20.515 [INFO] Found no active peers: None
    Jun 20 13:23:20.533 [INFO] Initializing optimizer manually since it has no tensors in state dict. To override this, prov
    ide initialize_optimizer=False
    Logging to logs/bs-256.target_bs-32768.n_envs-8.n_steps-128.n_epochs-1_1
    ---------------------------------
    | rollout/           |          |
    |    ep_len_mean     | 521      |
    |    ep_rew_mean     | 0        |
    | time/              |          |
    |    fps             | 582      |
    |    iterations      | 1        |
    |    time_elapsed    | 1        |
    |    total_timesteps | 1024     |
    | train/             |          |
    |    timesteps       | 1024     |
    ---------------------------------
    Jun 20 13:23:23.525 [INFO] ppo_hivemind accumulated 1024 samples for epoch #0 from 1 peers. ETA 52.20 sec (refresh in 1$
    .00 sec)