This page describes the benchmark scripts that can be used to measure the performance impact of different changes to hivemind.
You can use this benchmark to check the performance impact of your changes to hivemind.moe. The benchmark will start one server without DHT with several experts, and then spawn trainer processes that load the server with requests. The two main statistics in this benchmark samples/s and startup time.
python benchmark_throughput.py --preset default
(aka ffn_forward_backward
)
python benchmark_throughput.py --preset ffn_forward
In turn, this benchmark can be used
to measure performance impact of changes to hivemind.dht. It spawns a DHT with num_peers
participants, then chooses
one peer that will declare num_experts
total experts in batches of expert_batch_size
. Then, another peer will
consecutively get all peers and check if they are there.
Here's a run with 1024 participants on the same machine that was used for benchmark_throughput:
python benchmark_dht.py --num_peers 1024 --num_experts 16384 --expert_batch_size 64 --expiration 99999 --increase_file_limit
Console outputs
Increasing file limit - soft 1024=>32768, hard 1048576=>32768
Creating peers...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [01:45<00:00, 9.74it/s]
Sampled 16384 unique ids (after deduplication)
Storing peers to dht in batches of 64...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [12:07<00:00, 2.84s/it]
Store success rate: 100.0% (48920 / 48920)
Mean store time: 0.01487, Total: 727.46
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [01:48<00:00, 2.35it/s]
Get success rate: 100.0 (16384 / 16384)
Mean get time: 0.00664, Total: 108.73952
Node survival rate: 100.000%
The three main statistics in this benchmark are total store time, total get time and get success rate. Please also note that this benchmark does not emulate node failure, latency and does not benefit from caching. If one wants to account for these factors, one must introduce them manually by changing the code.