Developer zone
Collaborating best practices:
Hivemind is still in the early stage of development, we expect only a handful of collaborators with individual roles.
- Before you write any code, please contact us to avoid duplicate work:
- Report bugs and propose new features via issues. We don't have strict templates at this point;
- If you decide to implement a feature or fix a bug, first leave a comment in the appropriate issue or create a
new one;
- Please follow Contributor Convent v2.0.
- When you code, follow the best practices:
- The code must follow PEP8 unless absolutely necessary.
We recommend pycharm IDE;
- All user-facing interfaces must be documented with docstrings and/or sphinx;
- We highly encourage the use of typing, where applicable;
- After you write the code, make sure others can use it:
- Any function exposed to a user must have a docstring compatible with readthedocs;
- For new features, please write test(s) to make sure your functionality won't be broken by subsequent changes;
- If you face any challenges or want feedback, please submit a draft pull request.
Developer quickstart
First, install hivemind in the development mode, preferably with python 3.8 on linux/mac_OS.
git clone https://github.com/learning-at-home/hivemind
cd hivemind
pip install -e .
To run tests, you will also need to pip install -e .[dev]
.
You can run all tests with pytest ./tests
or choose a specific set, e.g. pytest ./tests/test_dht.py
.
To build docs locally,
pip install -e .[docs]
- make sure you ran setup.py (see above)
cd ./docs && make html
The documentation root will be available in ./docs/_build/html/index.html
Benchmark throughput
You can use this benchmark to check the performance impact of your changes to hivemind.client and server.
The benchmark will start one server without dht with several experts, and then spawn trainer processes that bombard the server with requests.
The two main statistics in this benchmark samples/s and startup time.
python benchmark_throughput.py --preset default
(aka ffn_forward_backward
)
Console outputs
```sh
Benchmark finished, status:Success
Server parameters: num_experts=16, num_handlers=64, max_batch_size=8192, expert_cls=ffn, hid_dim=1024, device=cuda
Client parameters: num_clients=128, num_batches_per_client=16, batch_size=2048, backprop=True
Results:
Server startup took 10.965 s. (3.075 s. experts + 7.889 s. networking)
Processed 4194304 examples in 146.750
Throughput for forward + backward passes: 28581.213 samples / s.
Benchmarking took 157.948 s.
Using device: cuda
GeForce GTX 1080 Ti
Memory Usage:
Allocated: 6.0 GB
Cached: 7.7 GB
```
python benchmark_throughput.py --preset ffn_forward
Console outputs
```sh
Benchmark finished, status:Success
Server parameters: num_experts=16, num_handlers=64, max_batch_size=8192, expert_cls=ffn, hid_dim=1024, device=cuda
Client parameters: num_clients=128, num_batches_per_client=16, batch_size=2048, backprop=False
Results:
Server startup took 19.941 s. (3.065 s. experts + 16.877 s. networking)
Processed 4194304 examples in 42.973
Throughput for forward passes: 97604.282 samples / s.
Benchmarking took 63.167 s.
Using device: cuda
GeForce GTX 1080 Ti
Memory Usage:
Allocated: 1.5 GB
Cached: 3.2 GB
```
All tests were performed on a single machine with ubuntu server 18.04 x64, msi 1080ti turbo, xeon gold 6149,
384Gb LRDIMM (6x64G), python3.8, torch1.6.0 (pip-installed), grpcio 1.31.0 ,
the results have around +-5% fluctuation between consecutive runs.
#### Benchmark DHT
In turn, [this benchmark](https://github.com/learning-at-home/hivemind/blob/master/tests/benchmark_dht.py) can be used
to measure performance impact of changes to hivemind.dht. It spawns a DHT with `num_peers` participants,
then chooses one peer that will declare `num_experts` total experts in batches of `expert_batch_size`.
Then, another peer will consecutively get all peers and check if they are there.
Here's a run with 1024 participants on the same machine that was used for benchmark_throughput:
`python benchmark_dht.py --num_peers 1024 --num_experts 16384 --expert_batch_size 64 --expiration 99999 --increase_file_limit`
Console outputs
```sh
Increasing file limit - soft 1024=>32768, hard 1048576=>32768
Creating peers...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [01:45<00:00, 9.74it/s]
Sampled 16384 unique ids (after deduplication)
Storing peers to dht in batches of 64...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [12:07<00:00, 2.84s/it]
Store success rate: 100.0% (48920 / 48920)
Mean store time: 0.01487, Total: 727.46
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [01:48<00:00, 2.35it/s]
Get success rate: 100.0 (16384 / 16384)
Mean get time: 0.00664, Total: 108.73952
Node survival rate: 100.000%
```
The three main statistics in this benchmark are total store time, total get time and get success rate.
Please also note that this benchmark does not emulate node failure, latency and does not benefit from caching.
If one wants to account for these factors, one must introduce them manually by changing the code.
Tips & tricks
- You can find a wealth of pytorch debugging tricks at their contributing page.
- Hivemind is optimized for development in pycharm CE 2019.3 or newer.
- When working on tests, please mark "tests" as sources root.