3 tahun lalu · 91d1d31796
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 
				 ## Hivemind: decentralized deep learning in PyTorch
			
 
				 
			
 
				 [![Documentation Status](https://readthedocs.org/projects/learning-at-home/badge/?version=latest)](https://learning-at-home.readthedocs.io/en/latest/?badge=latest)
			
 
				-[![PyPI version](https://img.shields.io/pypi/v/hivemind.svg)](https://pypi.org/project/hivemind/)
			
 
				+[![PyPI version](https://img.shields.io/pypi/v/hivemind.svg?color=blue)](https://pypi.org/project/hivemind/)
			
 
				 [![Discord](https://img.shields.io/static/v1?style=default&label=Discord&logo=discord&message=join)](https://discord.gg/uGugx9zYvN)
			
 
				 [![CI status](https://github.com/learning-at-home/hivemind/actions/workflows/run-tests.yml/badge.svg?branch=master)](https://github.com/learning-at-home/hivemind/actions)
			
 
				 ![Codecov](https://img.shields.io/codecov/c/github/learning-at-home/hivemind)
			
@@ -23,8 +23,8 @@ large model on hundreds of computers from different universities, companies, and
 
				 * Train neural networks of arbitrary size: parts of their layers are distributed across the participants with the
			
 
				   Decentralized Mixture-of-Experts ([paper](https://arxiv.org/abs/2002.04013)).
			
 
				 
			
 
				-To learn more about the ideas behind this library, see https://learning-at-home.github.io or read
			
 
				-the [NeurIPS 2020 paper](https://arxiv.org/abs/2002.04013).
			
 
				+To learn more about the ideas behind this library,
			
 
				+see the [full list](https://github.com/learning-at-home/hivemind/tree/refer-to-discord-in-docs#citation) of our papers below.
			
 
				 
			
 
				 ## Installation
			
 
				 
			
@@ -65,8 +65,8 @@ of [Go toolchain](https://golang.org/doc/install) (1.15 or higher).
 
				 
			
 
				 - __Linux__ is the default OS for which hivemind is developed and tested. We recommend Ubuntu 18.04+ (64-bit), but
			
 
				   other 64-bit distros should work as well. Legacy 32-bit is not recommended.
			
 
				-- __macOS 10.x__ mostly works but requires building hivemind from source, and some edge cases may fail. To ensure full
			
 
				-  compatibility, we recommend using [our Docker image](https://hub.docker.com/r/learningathome/hivemind).
			
 
				+- __macOS 10.x__ can run hivemind using [Docker](https://docs.docker.com/desktop/mac/install/).
			
 
				+  We recommend using [our Docker image](https://hub.docker.com/r/learningathome/hivemind).
			
 
				 - __Windows 10+ (experimental)__ can run hivemind
			
 
				   using [WSL](https://docs.microsoft.com/ru-ru/windows/wsl/install-win10). You can configure WSL to use GPU by
			
 
				   following sections 1–3 of [this guide](https://docs.nvidia.com/cuda/wsl-user-guide/index.html) by NVIDIA. After
			
@@ -83,13 +83,13 @@ of [Go toolchain](https://golang.org/doc/install) (1.15 or higher).
 
				 * API reference and additional tutorials are available
			
 
				   at [learning-at-home.readthedocs.io](https://learning-at-home.readthedocs.io)
			
 
				 
			
 
				-If you have any questions about installing and using hivemind, you can ask them in
			
 
				+If you have any questions about installing and using hivemind, feel free to ask them in
			
 
				 [our Discord chat](https://discord.gg/uGugx9zYvN) or file an [issue](https://github.com/learning-at-home/hivemind/issues).
			
 
				 
			
 
				 ## Contributing
			
 
				 
			
 
				 Hivemind is currently at the active development stage, and we welcome all contributions. Everything, from bug fixes and
			
 
				-documentation improvements to entirely new features, is equally appreciated.
			
 
				+documentation improvements to entirely new features, is appreciated.
			
 
				 
			
 
				 If you want to contribute to hivemind but don't know where to start, take a look at the
			
 
				 unresolved [issues](https://github.com/learning-at-home/hivemind/issues). Open a new issue or
			
@@ -105,9 +105,9 @@ our [guide](https://learning-at-home.readthedocs.io/en/latest/user/contributing.
 
				 
			
 
				 If you found hivemind or its underlying algorithms useful for your research, please cite the following source:
			
 
				 
			
 
				-```
			
 
				+```bibtex
			
 
				 @misc{hivemind,
			
 
				-  author = {Learning@home team},
			
 
				+  author = {Learning{@}home team},
			
 
				   title = {{H}ivemind: a {L}ibrary for {D}ecentralized {D}eep {L}earning},
			
 
				   year = 2020,
			
 
				   howpublished = {\url{https://github.com/learning-at-home/hivemind}},
			
@@ -118,17 +118,17 @@ Also, you can cite [the paper](https://arxiv.org/abs/2002.04013) that inspired t
 
				 (prototype implementation of hivemind available
			
 
				 at [mryab/learning-at-home](https://github.com/mryab/learning-at-home)):
			
 
				 
			
 
				-```
			
 
				+```bibtex
			
 
				 @inproceedings{ryabinin2020crowdsourced,
			
 
				- author = {Ryabinin, Max and Gusev, Anton},
			
 
				- booktitle = {Advances in Neural Information Processing Systems},
			
 
				- editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
			
 
				- pages = {3659--3672},
			
 
				- publisher = {Curran Associates, Inc.},
			
 
				- title = {Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts},
			
 
				- url = {https://proceedings.neurips.cc/paper/2020/file/25ddc0f8c9d3e22e03d3076f98d83cb2-Paper.pdf},
			
 
				- volume = {33},
			
 
				- year = {2020}
			
 
				+  author = {Ryabinin, Max and Gusev, Anton},
			
 
				+  booktitle = {Advances in Neural Information Processing Systems},
			
 
				+  editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
			
 
				+  pages = {3659--3672},
			
 
				+  publisher = {Curran Associates, Inc.},
			
 
				+  title = {Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts},
			
 
				+  url = {https://proceedings.neurips.cc/paper/2020/file/25ddc0f8c9d3e22e03d3076f98d83cb2-Paper.pdf},
			
 
				+  volume = {33},
			
 
				+  year = {2020}
			
 
				 }
			
 
				 ```
			
 
				 
			
@@ -137,40 +137,40 @@ at [mryab/learning-at-home](https://github.com/mryab/learning-at-home)):
 
				 
			
 
				 ["Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices"](https://arxiv.org/abs/2103.03239)
			
 
				 
			
 
				-```
			
 
				+```bibtex
			
 
				 @misc{ryabinin2021moshpit,
			
 
				-      title={Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices}, 
			
 
				-      author={Max Ryabinin and Eduard Gorbunov and Vsevolod Plokhotnyuk and Gennady Pekhimenko},
			
 
				-      year={2021},
			
 
				-      eprint={2103.03239},
			
 
				-      archivePrefix={arXiv},
			
 
				-      primaryClass={cs.LG}
			
 
				+  title = {Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices},
			
 
				+  author = {Max Ryabinin and Eduard Gorbunov and Vsevolod Plokhotnyuk and Gennady Pekhimenko},
			
 
				+  year = {2021},
			
 
				+  eprint = {2103.03239},
			
 
				+  archivePrefix = {arXiv},
			
 
				+  primaryClass = {cs.LG}
			
 
				 }
			
 
				 ```
			
 
				 
			
 
				 ["Distributed Deep Learning in Open Collaborations"](https://arxiv.org/abs/2106.10207)
			
 
				 
			
 
				-```
			
 
				+```bibtex
			
 
				 @misc{diskin2021distributed,
			
 
				-      title={Distributed Deep Learning in Open Collaborations}, 
			
 
				-      author={Michael Diskin and Alexey Bukhtiyarov and Max Ryabinin and Lucile Saulnier and Quentin Lhoest and Anton Sinitsin and Dmitry Popov and Dmitry Pyrkin and Maxim Kashirin and Alexander Borzunov and Albert Villanova del Moral and Denis Mazur and Ilia Kobelev and Yacine Jernite and Thomas Wolf and Gennady Pekhimenko},
			
 
				-      year={2021},
			
 
				-      eprint={2106.10207},
			
 
				-      archivePrefix={arXiv},
			
 
				-      primaryClass={cs.LG}
			
 
				+  title = {Distributed Deep Learning in Open Collaborations},
			
 
				+  author = {Michael Diskin and Alexey Bukhtiyarov and Max Ryabinin and Lucile Saulnier and Quentin Lhoest and Anton Sinitsin and Dmitry Popov and Dmitry Pyrkin and Maxim Kashirin and Alexander Borzunov and Albert Villanova del Moral and Denis Mazur and Ilia Kobelev and Yacine Jernite and Thomas Wolf and Gennady Pekhimenko},
			
 
				+  year = {2021},
			
 
				+  eprint = {2106.10207},
			
 
				+  archivePrefix = {arXiv},
			
 
				+  primaryClass = {cs.LG}
			
 
				 }
			
 
				 ```
			
 
				 
			
 
				 ["Secure Distributed Training at Scale"](https://arxiv.org/abs/2106.11257)
			
 
				 
			
 
				-```
			
 
				+```bibtex
			
 
				 @misc{gorbunov2021secure,
			
 
				-      title={Secure Distributed Training at Scale}, 
			
 
				-      author={Eduard Gorbunov and Alexander Borzunov and Michael Diskin and Max Ryabinin},
			
 
				-      year={2021},
			
 
				-      eprint={2106.11257},
			
 
				-      archivePrefix={arXiv},
			
 
				-      primaryClass={cs.LG}
			
 
				+  title = {Secure Distributed Training at Scale},
			
 
				+  author = {Eduard Gorbunov and Alexander Borzunov and Michael Diskin and Max Ryabinin},
			
 
				+  year = {2021},
			
 
				+  eprint = {2106.11257},
			
 
				+  archivePrefix = {arXiv},
			
 
				+  primaryClass = {cs.LG}
			
 
				 }
			
 
				 ```
			
 
				 
			
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -9,9 +9,9 @@ of computers, whether you're running a very capable computer or a less reliable
 
				 Learn how to create or join a Hivemind run in the `quickstart tutorial <./user/quickstart.html>`__ or browse the API
			
 
				 documentation below.
			
 
				 
			
 
				-| Hivemind is currently in active development, so expect some adventures. If you encounter any issues, please let us know
			
 
				-  `on github <https://github.com/learning-at-home/hivemind/issues>`__.
			
 
				-
			
 
				+| Hivemind is currently in active development, so expect some adventures. If you have any questions, feel free to ask them
			
 
				+  in `our Discord chat <https://discord.com/invite/uGugx9zYvN>`_ or
			
 
				+  file an `issue <https://github.com/learning-at-home/hivemind/issues>`__.
			
 
				 
			
 
				 **Table of contents:**
			
 
				 ~~~~~~~~~~~~~~~~~~~~~~
			
--- a/examples/albert/README.md
+++ b/examples/albert/README.md
@@ -22,15 +22,16 @@ Run the first DHT peer to welcome trainers and record training statistics (e.g.,
 
				   & Biases, here's a [quickstart tutorial](https://docs.wandb.ai/quickstart).
			
 
				 - Run `python run_training_monitor.py --experiment_prefix YOUR_EXPERIMENT_NAME --wandb_project YOUR_WANDB_PROJECT`
			
 
				 
			
 
				-  - `YOUR_EXPERIMENT_NAME` must be a unique name of this training run, e.g. `my-first-albert`. It cannot contain `.`
			
 
				+  - `YOUR_EXPERIMENT_NAME` must be a unique name of this training run, e.g. `my-albert-v1`. It cannot contain `.`
			
 
				     due to naming conventions.
			
 
				   - `YOUR_WANDB_PROJECT` is a name of wandb project used to track training metrics. Multiple experiments can have the
			
 
				     same project name.
			
 
				 
			
 
				 ```
			
 
				 $ python run_training_monitor.py --experiment_prefix my-albert-v1 --wandb_project Demo-run
			
 
				-[2021/06/17 16:26:36.083][INFO][root.log_visible_maddrs:54] Running a DHT peer. To connect other peers to this one over the Internet,
			
 
				+Oct 14 16:26:36.083 [INFO] [utils.log_visible_maddrs:47] Running a DHT peer. To connect other peers to this one over the Internet,
			
 
				 use --initial_peers /ip4/1.2.3.4/tcp/1337/p2p/XXXX /ip4/1.2.3.4/udp/31337/quic/p2p/XXXX
			
 
				+Oct 14 16:26:36.083 [INFO] [utils.log_visible_maddrs:50] Full list of visible multiaddresses: ...
			
 
				 wandb: Currently logged in as: XXX (use `wandb login --relogin` to force relogin)
			
 
				 wandb: Tracking run with wandb version 0.10.32
			
 
				 wandb: Syncing run dry-mountain-2
			
@@ -38,12 +39,12 @@ wandb:  View project at https://wandb.ai/XXX/Demo-run
 
				 wandb:  View run at https://wandb.ai/XXX/Demo-run/runs/YYY
			
 
				 wandb: Run data is saved locally in /path/to/run/data
			
 
				 wandb: Run `wandb offline` to turn off syncing.
			
 
				-[2021/04/19 02:26:41.064][INFO][optim.collaborative.fetch_collaboration_state:323] Found no active peers: None
			
 
				-[2021/04/19 02:26:44.068][INFO][optim.collaborative.fetch_collaboration_state:323] Found no active peers: None
			
 
				+Oct 14 16:26:41.064 [INFO] [optim.collaborative._fetch_state:448] Found no active peers: None
			
 
				+Oct 14 16:26:44.068 [INFO] [optim.collaborative._fetch_state:448] Found no active peers: None
			
 
				 ...
			
 
				-[2021/04/19 02:37:37.246][INFO][__main__.<module>:194] Step #1  loss = 11.05164
			
 
				-[2021/04/19 02:39:37.441][INFO][__main__.<module>:194] Step #2  loss = 11.03771
			
 
				-[2021/04/19 02:40:37.541][INFO][__main__.<module>:194] Step #3  loss = 11.02886
			
 
				+Oct 14 16:37:37.246 [INFO] [__main__.<module>:209] Step #1  loss = 11.05164
			
 
				+Oct 14 16:39:37.441 [INFO] [__main__.<module>:209] Step #2  loss = 11.03771
			
 
				+Oct 14 16:40:37.541 [INFO] [__main__.<module>:209] Step #3  loss = 11.02886
			
 
				 ```
			
 
				 
			
 
				 ### GPU trainers
			
@@ -88,14 +89,13 @@ See the ["Tips and tricks"](#tips-and-tricks) section for more information on se
 
				 As the peer begins training, it will periodically report training logs in the following form:
			
 
				 
			
 
				 ```
			
 
				-[...][INFO][...] Collaboration accumulated 448 samples from 17 peers; ETA 18.88 seconds (refresh in 15.73s.)
			
 
				-[...][INFO][...] Collaboration accumulated 4096 samples from 16 peers; ETA 0.00 seconds (refresh in 0.50s.)
			
 
				-[...][INFO][optim.collaborative.step:195] Averaged tensors successfully with 17 peers
			
 
				-[...][INFO][optim.collaborative.step:211] Optimizer step: done!
			
 
				-06/17/2021 18:58:23 - INFO - __main__ -   Step 0
			
 
				-06/17/2021 18:58:23 - INFO - __main__ -   Your current contribution: 892 samples
			
 
				-06/17/2021 18:58:23 - INFO - __main__ -   Local loss: 11.023
			
 
				-
			
 
				+... [INFO] [...] my-albert-v1 accumulated 448 samples from 17 peers for step #0. ETA 18.88 sec (refresh in 15.73 sec)
			
 
				+... [INFO] [...] my-albert-v1 accumulated 4096 samples from 16 peers for step #0. ETA 0.00 sec (refresh in 0.50 sec)
			
 
				+... [INFO] [optim.collaborative.step:283] Averaged tensors successfully with 17 peers
			
 
				+... [INFO] [optim.collaborative.step:317] Optimizer step: done!
			
 
				+Oct 14 18:58:03.750 [INFO] [__main__.on_step_end:141] Step 1
			
 
				+Oct 14 18:58:03.750 [INFO] [__main__.on_step_end:142] Your current contribution: 892 samples
			
 
				+Oct 14 18:58:03.750 [INFO] [__main__.on_step_end:143] Local loss: 11.023
			
 
				 ```
			
 
				 
			
 
				 __Sanity check:__ a healthy peer will periodically report `Averaged tensors successfully with [N > 1]` peers.
			
@@ -171,7 +171,7 @@ Here's an example of a full trainer script for Google Colab:
 
				 !curl -L YOUR_HOSTED_DATA | tar xzf -
			
 
				 !ulimit -n 4096 && python ./hivemind/examples/albert/run_trainer.py \
			
 
				     --experiment_prefix YOUR_EXPERIMENT_NAME --initial_peers ONE_OR_MORE_PEERS \
			
 
				-    --logging_first_step --output_dir ./outputs --overwrite_output_dir --logging_dir ./logs \
			
 
				+    --logging_dir ./logs --logging_first_step --output_dir ./outputs --overwrite_output_dir \
			
 
				     --client_mode --averaging_expiration 10 --batch_size_lead 300 --gradient_accumulation_steps 1
			
 
				 ```