AIForce/hivemind @ allreduce_options

Nav apraksta

Alexander Borzunov fcad8d0d63 Enable log handler in benchmarks and run_server (#380)		4 gadi atpakaļ
.github	4890a75036 Optimize unary handlers with persistent connections to P2P daemon (#328)	4 gadi atpakaļ
benchmarks	fcad8d0d63 Enable log handler in benchmarks and run_server (#380)	4 gadi atpakaļ
docs	f8d280e7b9 Fix typo in dht.md (#345)	4 gadi atpakaļ
examples	fcad8d0d63 Enable log handler in benchmarks and run_server (#380)	4 gadi atpakaļ
hivemind	fcad8d0d63 Enable log handler in benchmarks and run_server (#380)	4 gadi atpakaļ
tests	fcad8d0d63 Enable log handler in benchmarks and run_server (#380)	4 gadi atpakaļ
.gitignore	4890a75036 Optimize unary handlers with persistent connections to P2P daemon (#328)	4 gadi atpakaļ
.readthedocs.yml	c450a43fd0 Fix flaky test_remote_module_call, extract requirements for docs/tests (#118)	5 gadi atpakaļ
CONTRIBUTING.md	bedfa6eefb Reorder imports with isort (#326)	4 gadi atpakaļ
Dockerfile	dfbc401196 Add Dockerfile, refactor tests (#245)	4 gadi atpakaļ
LICENSE	f386fb4d42 Create LICENSE	5 gadi atpakaļ
README.md	4e586845b9 Update Discord link (#338)	4 gadi atpakaļ
codecov.yml	fdf92e5dc4 Fix Codecov integration with Github Actions (#291)	4 gadi atpakaļ
pyproject.toml	bedfa6eefb Reorder imports with isort (#326)	4 gadi atpakaļ
requirements-dev.txt	bedfa6eefb Reorder imports with isort (#326)	4 gadi atpakaļ
requirements-docs.txt	5233b6c085 Split hivemind.client into hivemind.averaging and hivemind.moe (#304)	4 gadi atpakaļ
requirements.txt	b6fbae478c Remove use of packaging module (#284)	4 gadi atpakaļ
setup.py	fb3f57b03c Parametrize max message size for persistent connections (#376)	4 gadi atpakaļ

Hivemind: decentralized deep learning in PyTorch

Hivemind is a PyTorch library for decentralized deep learning across the Internet. Its intended usage is training one large model on hundreds of computers from different universities, companies, and volunteers.

Key Features

Distributed training without a master node: Distributed Hash Table allows connecting computers in a decentralized network.
Fault-tolerant backpropagation: forward and backward passes succeed even if some nodes are unresponsive or take too long to respond.
Decentralized parameter averaging: iteratively aggregate updates from multiple workers without the need to synchronize across the entire network (paper).
Train neural networks of arbitrary size: parts of their layers are distributed across the participants with the Decentralized Mixture-of-Experts (paper).

To learn more about the ideas behind this library, see https://learning-at-home.github.io or read the NeurIPS 2020 paper.

Installation

Before installing, make sure that your environment has Python 3.7+ and PyTorch 1.6.0 or newer. They can be installed either natively or with Anaconda.

You can get the latest release with pip or build hivemind from source.

With pip

If your versions of Python and PyTorch match the requirements, you can install hivemind from pip:

pip install hivemind

From source

To install hivemind from source, simply run the following:

git clone https://github.com/learning-at-home/hivemind.git
cd hivemind
pip install .

If you would like to verify that your installation is working properly, you can install with pip install -e .[dev] instead. Then, you can run the tests with pytest tests/.

By default, hivemind uses the precompiled binary of the go-libp2p-daemon library. If you face compatibility issues or want to build the binary yourself, you can recompile it by running pip install . --global-option="--buildgo". Before running the compilation, please ensure that your machine has a recent version of Go toolchain (1.15 or higher).

System requirements

Linux is the default OS for which hivemind is developed and tested. We recommend Ubuntu 18.04+ (64-bit), but other 64-bit distros should work as well. Legacy 32-bit is not recommended.
macOS 10.x mostly works but requires building hivemind from source, and some edge cases may fail. To ensure full compatibility, we recommend using our Docker image.
Windows 10+ (experimental) can run hivemind using WSL. You can configure WSL to use GPU by following sections 1–3 of this guide by NVIDIA. After that, you can simply follow the instructions above to install with pip or from source.

Documentation

The quickstart tutorial walks through installation and a training a simple neural network with several peers.
examples/albert contains the starter kit and instructions for training a Transformer masked language model collaboratively.
The Mixture-of-Experts tutorial covers the usage of Decentralized Mixture-of-Experts layers.
API reference and additional tutorials are available at learning-at-home.readthedocs.io

If you have any questions about installing and using hivemind, you can ask them in our Discord chat or file an issue.

Contributing

Hivemind is currently at the active development stage, and we welcome all contributions. Everything, from bug fixes and documentation improvements to entirely new features, is equally appreciated.

If you want to contribute to hivemind but don't know where to start, take a look at the unresolved issues. Open a new issue or join our chat room in case you want to discuss new functionality or report a possible bug. Bug fixes are always welcome, but new features should be preferably discussed with maintainers beforehand.

If you want to start contributing to the source code of hivemind, please see the contributing guidelines first. To learn more about other ways to contribute, read our guide.

Citation

If you found hivemind or its underlying algorithms useful for your research, please cite the following source:

@misc{hivemind,
  author = {Learning@home team},
  title = {{H}ivemind: a {L}ibrary for {D}ecentralized {D}eep {L}earning},
  year = 2020,
  howpublished = {\url{https://github.com/learning-at-home/hivemind}},
}

Also, you can cite the paper that inspired the creation of this library (prototype implementation of hivemind available at mryab/learning-at-home):

@inproceedings{ryabinin2020crowdsourced,
 author = {Ryabinin, Max and Gusev, Anton},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
 pages = {3659--3672},
 publisher = {Curran Associates, Inc.},
 title = {Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts},
 url = {https://proceedings.neurips.cc/paper/2020/file/25ddc0f8c9d3e22e03d3076f98d83cb2-Paper.pdf},
 volume = {33},
 year = {2020}
}

Additional publications

["Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices"](https://arxiv.org/abs/2103.03239) ``` @misc{ryabinin2021moshpit, title={Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices}, author={Max Ryabinin and Eduard Gorbunov and Vsevolod Plokhotnyuk and Gennady Pekhimenko}, year={2021}, eprint={2103.03239}, archivePrefix={arXiv}, primaryClass={cs.LG} } ``` ["Distributed Deep Learning in Open Collaborations"](https://arxiv.org/abs/2106.10207) ``` @misc{diskin2021distributed, title={Distributed Deep Learning in Open Collaborations}, author={Michael Diskin and Alexey Bukhtiyarov and Max Ryabinin and Lucile Saulnier and Quentin Lhoest and Anton Sinitsin and Dmitry Popov and Dmitry Pyrkin and Maxim Kashirin and Alexander Borzunov and Albert Villanova del Moral and Denis Mazur and Ilia Kobelev and Yacine Jernite and Thomas Wolf and Gennady Pekhimenko}, year={2021}, eprint={2106.10207}, archivePrefix={arXiv}, primaryClass={cs.LG} } ``` ["Secure Distributed Training at Scale"](https://arxiv.org/abs/2106.11257) ``` @misc{gorbunov2021secure, title={Secure Distributed Training at Scale}, author={Eduard Gorbunov and Alexander Borzunov and Michael Diskin and Max Ryabinin}, year={2021}, eprint={2106.11257}, archivePrefix={arXiv}, primaryClass={cs.LG} } ```

We also maintain a list of related projects and acknowledgements.

README.md