|
@@ -4,212 +4,184 @@ This tutorial will teach you how to install `hivemind`, host your own experts an
|
|
|
|
|
|
## Installation
|
|
|
|
|
|
-Just `pip install hivemind` to get the latest release.
|
|
|
+Just `pip install hivemind` to get the latest release (requires Python 3.7 or newer).
|
|
|
|
|
|
You can also install the bleeding edge version from GitHub:
|
|
|
|
|
|
```
|
|
|
git clone https://github.com/learning-at-home/hivemind
|
|
|
cd hivemind
|
|
|
-pip install .
|
|
|
+pip install -e .
|
|
|
```
|
|
|
+
|
|
|
+## Decentralized Training
|
|
|
|
|
|
-You can also install it in the editable mode with `pip install -e .`.
|
|
|
+Hivemind is a set of building blocks for decentralized training.
|
|
|
+In this tutorial, we'll use two of these blocks to train a simple neural network to classify CIFAR-10 images.
|
|
|
+We assume that you are already familiar with the official [CIFAR-10 example](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html)
|
|
|
+from the PyTorch website.
|
|
|
|
|
|
-* __Dependencies:__ Hivemind requires Python 3.7+.
|
|
|
- The [requirements](https://github.com/learning-at-home/hivemind/blob/master/requirements.txt) are installed
|
|
|
- automatically.
|
|
|
-* __OS support:__ Linux and macOS should just work. We do not officially support Windows, but you are welcome to
|
|
|
- contribute your windows build :)
|
|
|
+We build on top of the official example to spin up distributed training of a two-layer neural network by averaging weights.
|
|
|
+For simplicity, this tutorial will use two non-GPU peers running on the same machine. If you get to the end of this
|
|
|
+tutorial, we'll give you an example of actual distributed training of Transformers ;)
|
|
|
|
|
|
-## Host a server
|
|
|
-
|
|
|
-`hivemind.moe.Server` hosts one or several experts (PyTorch modules) for remote access. These experts are responsible for
|
|
|
-most of the model parameters and computation. The server can be started using either Python or
|
|
|
-[a shell script](https://github.com/learning-at-home/hivemind/blob/master/hivemind/hivemind_cli/run_server.py). We'll use the shell
|
|
|
-for now. To host a server with default experts, run this in your shell:
|
|
|
+For now, let's run our first training peer:
|
|
|
+```python
|
|
|
+import torch
|
|
|
+import torch.nn as nn
|
|
|
+import torch.nn.functional as F
|
|
|
+from torchvision import datasets, transforms
|
|
|
+from tqdm.auto import tqdm
|
|
|
|
|
|
-```sh
|
|
|
-python hivemind/hivemind_cli/run_server.py --expert_cls ffn --hidden_dim 512 --num_experts 5 --expert_pattern "expert.[0:5]" \
|
|
|
- --listen_on 0.0.0.0:1337 --dht_port 1338
|
|
|
-# note: if you omit listen_on and/or dht_port, they will be chosen automatically and printed to stdout.
|
|
|
-```
|
|
|
+import hivemind
|
|
|
|
|
|
-<details style="margin-top:-24px; margin-bottom: 16px;">
|
|
|
- <summary><i>Console outputs</i></summary>
|
|
|
-
|
|
|
-```sh
|
|
|
-[2020/08/26 11:54:52.645][INFO][server.create:101] Bootstrapping DHT node, initial peers = []
|
|
|
-[2020/08/26 11:54:52.660][INFO][server.create:105] Running dht node on port 1338
|
|
|
-[2020/08/26 11:54:53.182][INFO][server.task_pool.run:130] expert.0_forward starting, pid=19382
|
|
|
-[2020/08/26 11:54:53.182][INFO][server.task_pool.run:130] expert.0_forward starting, pid=19382
|
|
|
-[2020/08/26 11:54:53.189][INFO][server.task_pool.run:130] expert.0_backward starting, pid=19384
|
|
|
-[2020/08/26 11:54:53.189][INFO][server.task_pool.run:130] expert.0_backward starting, pid=19384
|
|
|
-[2020/08/26 11:54:53.196][INFO][server.task_pool.run:130] expert.1_forward starting, pid=19386
|
|
|
-[2020/08/26 11:54:53.196][INFO][server.task_pool.run:130] expert.1_forward starting, pid=19386
|
|
|
-[2020/08/26 11:54:53.206][INFO][server.task_pool.run:130] expert.1_backward starting, pid=19388
|
|
|
-[2020/08/26 11:54:53.206][INFO][server.task_pool.run:130] expert.1_backward starting, pid=19388
|
|
|
-[2020/08/26 11:54:53.212][INFO][server.task_pool.run:130] expert.2_forward starting, pid=19390
|
|
|
-[2020/08/26 11:54:53.212][INFO][server.task_pool.run:130] expert.2_forward starting, pid=19390
|
|
|
-[2020/08/26 11:54:53.218][INFO][server.task_pool.run:130] expert.2_backward starting, pid=19392
|
|
|
-[2020/08/26 11:54:53.218][INFO][server.task_pool.run:130] expert.2_backward starting, pid=19392
|
|
|
-[2020/08/26 11:54:53.225][INFO][server.task_pool.run:130] expert.3_forward starting, pid=19394
|
|
|
-[2020/08/26 11:54:53.225][INFO][server.task_pool.run:130] expert.3_forward starting, pid=19394
|
|
|
-[2020/08/26 11:54:53.232][INFO][server.task_pool.run:130] expert.3_backward starting, pid=19396
|
|
|
-[2020/08/26 11:54:53.232][INFO][server.task_pool.run:130] expert.3_backward starting, pid=19396
|
|
|
-[2020/08/26 11:54:53.235][INFO][server.task_pool.run:130] expert.4_forward starting, pid=19398
|
|
|
-[2020/08/26 11:54:53.235][INFO][server.task_pool.run:130] expert.4_forward starting, pid=19398
|
|
|
-[2020/08/26 11:54:53.241][INFO][server.task_pool.run:130] expert.4_backward starting, pid=19400
|
|
|
-[2020/08/26 11:54:53.241][INFO][server.task_pool.run:130] expert.4_backward starting, pid=19400
|
|
|
-[2020/08/26 11:54:53.244][INFO][server.runtime.run:60] Started
|
|
|
-[2020/08/26 11:54:53.244][INFO][server.runtime.run:60] Started
|
|
|
-[2020/08/26 11:54:53.245][INFO][server.create:136] Server started at 0.0.0.0:1337
|
|
|
-[2020/08/26 11:54:53.245][INFO][server.create:137] Got 5 active experts of type ffn: ['expert.0', 'expert.1', 'expert.2', 'expert.3', 'expert.4']
|
|
|
-```
|
|
|
+# Create dataset and model, same as in the basic tutorial
|
|
|
+# For this basic tutorial, we download only the training set
|
|
|
+transform = transforms.Compose(
|
|
|
+ [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
|
|
|
|
|
|
-</details>
|
|
|
+trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
|
|
|
|
|
|
+model = nn.Sequential(nn.Conv2d(3, 16, (5, 5)), nn.MaxPool2d(2, 2), nn.ReLU(),
|
|
|
+ nn.Conv2d(16, 32, (5, 5)), nn.MaxPool2d(2, 2), nn.ReLU(),
|
|
|
+ nn.Flatten(), nn.Linear(32 * 5 * 5, 10))
|
|
|
+opt = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
|
|
|
|
|
|
-This server accepts requests to experts on port 1337 and start a DHT peer on port 1338. In total, it serves 5
|
|
|
-feedforward experts with ReLU and LayerNorm
|
|
|
-(see
|
|
|
-architecture [here](https://github.com/learning-at-home/hivemind/blob/master/hivemind/server/layers/__init__.py#L7-L21))
|
|
|
-.
|
|
|
|
|
|
-You can create additional servers in the same decentralized network using `--initial_peers` argument:
|
|
|
+# Create DHT: a decentralized key-value storage shared between peers
|
|
|
+dht = hivemind.DHT(start=True)
|
|
|
+print("To join the training, use initial_peers =", [str(addr) for addr in dht.get_visible_maddrs()])
|
|
|
|
|
|
-```sh
|
|
|
-python hivemind/hivemind_cli/run_server.py --expert_cls ffn --hidden_dim 512 --num_experts 10 --expert_pattern "expert.[5:250]" \
|
|
|
- --initial_peers localhost:1338
|
|
|
+# Set up a decentralized optimizer that will average with peers in background
|
|
|
+opt = hivemind.optim.DecentralizedOptimizer(
|
|
|
+ opt, # wrap the SGD optimizer defined above
|
|
|
+ dht, # use a DHT that is connected with other peers
|
|
|
+ average_parameters=True, # periodically average model weights in opt.step
|
|
|
+ average_gradients=False, # do not average accumulated gradients
|
|
|
+ prefix='my_cifar_run', # unique identifier of this collaborative run
|
|
|
+ target_group_size=16, # maximum concurrent peers for this run
|
|
|
+ verbose=True # print logs incessently
|
|
|
+)
|
|
|
+# Note: if you intend to use GPU, switch to it only after the decentralized optimizer is created
|
|
|
+
|
|
|
+with tqdm() as progressbar:
|
|
|
+ while True:
|
|
|
+ for x_batch, y_batch in torch.utils.data.DataLoader(trainset, shuffle=True, batch_size=256):
|
|
|
+ opt.zero_grad()
|
|
|
+ loss = F.cross_entropy(model(x_batch), y_batch)
|
|
|
+ loss.backward()
|
|
|
+ opt.step()
|
|
|
+
|
|
|
+ progressbar.desc = f"loss = {loss.item():.3f}"
|
|
|
+ progressbar.update()
|
|
|
```
|
|
|
|
|
|
-<details style="margin-top:-24px; margin-bottom: 16px;">
|
|
|
- <summary>Console outputs</summary>
|
|
|
-
|
|
|
-```sh
|
|
|
-[2020/08/26 13:15:05.078][INFO][server.create:103] Bootstrapping DHT node, initial peers = ['localhost:1338']
|
|
|
-[2020/08/26 13:15:05.101][INFO][server.create:107] Running dht node on port 44291
|
|
|
-expert.[5:250]
|
|
|
-[2020/08/26 13:15:06.326][INFO][server.task_pool.run:130] expert.113_forward starting, pid=29517
|
|
|
-[2020/08/26 13:15:06.326][INFO][server.task_pool.run:130] expert.113_forward starting, pid=29517
|
|
|
-[2020/08/26 13:15:06.333][INFO][server.task_pool.run:130] expert.113_backward starting, pid=29519
|
|
|
-[2020/08/26 13:15:06.333][INFO][server.task_pool.run:130] expert.113_backward starting, pid=29519
|
|
|
-[2020/08/26 13:15:06.340][INFO][server.task_pool.run:130] expert.149_forward starting, pid=29521
|
|
|
-[2020/08/26 13:15:06.340][INFO][server.task_pool.run:130] expert.149_forward starting, pid=29521
|
|
|
-[2020/08/26 13:15:06.352][INFO][server.task_pool.run:130] expert.149_backward starting, pid=29523
|
|
|
-[2020/08/26 13:15:06.352][INFO][server.task_pool.run:130] expert.149_backward starting, pid=29523
|
|
|
-[2020/08/26 13:15:06.363][INFO][server.task_pool.run:130] expert.185_forward starting, pid=29525
|
|
|
-[2020/08/26 13:15:06.363][INFO][server.task_pool.run:130] expert.185_forward starting, pid=29525
|
|
|
-[2020/08/26 13:15:06.375][INFO][server.task_pool.run:130] expert.185_backward starting, pid=29527
|
|
|
-[2020/08/26 13:15:06.375][INFO][server.task_pool.run:130] expert.185_backward starting, pid=29527
|
|
|
-[2020/08/26 13:15:06.381][INFO][server.task_pool.run:130] expert.189_forward starting, pid=29529
|
|
|
-[2020/08/26 13:15:06.381][INFO][server.task_pool.run:130] expert.189_forward starting, pid=29529
|
|
|
-[2020/08/26 13:15:06.388][INFO][server.task_pool.run:130] expert.189_backward starting, pid=29531
|
|
|
-[2020/08/26 13:15:06.388][INFO][server.task_pool.run:130] expert.189_backward starting, pid=29531
|
|
|
-[2020/08/26 13:15:06.400][INFO][server.task_pool.run:130] expert.191_forward starting, pid=29533
|
|
|
-[2020/08/26 13:15:06.400][INFO][server.task_pool.run:130] expert.191_forward starting, pid=29533
|
|
|
-[2020/08/26 13:15:06.407][INFO][server.task_pool.run:130] expert.191_backward starting, pid=29535
|
|
|
-[2020/08/26 13:15:06.407][INFO][server.task_pool.run:130] expert.191_backward starting, pid=29535
|
|
|
-[2020/08/26 13:15:06.415][INFO][server.task_pool.run:130] expert.196_forward starting, pid=29537
|
|
|
-[2020/08/26 13:15:06.415][INFO][server.task_pool.run:130] expert.196_forward starting, pid=29537
|
|
|
-[2020/08/26 13:15:06.426][INFO][server.task_pool.run:130] expert.196_backward starting, pid=29539
|
|
|
-[2020/08/26 13:15:06.426][INFO][server.task_pool.run:130] expert.196_backward starting, pid=29539
|
|
|
-[2020/08/26 13:15:06.435][INFO][server.task_pool.run:130] expert.225_forward starting, pid=29541
|
|
|
-[2020/08/26 13:15:06.435][INFO][server.task_pool.run:130] expert.225_forward starting, pid=29541
|
|
|
-[2020/08/26 13:15:06.445][INFO][server.task_pool.run:130] expert.225_backward starting, pid=29543
|
|
|
-[2020/08/26 13:15:06.445][INFO][server.task_pool.run:130] expert.225_backward starting, pid=29543
|
|
|
-[2020/08/26 13:15:06.454][INFO][server.task_pool.run:130] expert.227_forward starting, pid=29545
|
|
|
-[2020/08/26 13:15:06.454][INFO][server.task_pool.run:130] expert.227_forward starting, pid=29545
|
|
|
-[2020/08/26 13:15:06.467][INFO][server.task_pool.run:130] expert.227_backward starting, pid=29547
|
|
|
-[2020/08/26 13:15:06.467][INFO][server.task_pool.run:130] expert.227_backward starting, pid=29547
|
|
|
-[2020/08/26 13:15:06.475][INFO][server.task_pool.run:130] expert.36_forward starting, pid=29549
|
|
|
-[2020/08/26 13:15:06.475][INFO][server.task_pool.run:130] expert.36_forward starting, pid=29549
|
|
|
-[2020/08/26 13:15:06.482][INFO][server.task_pool.run:130] expert.36_backward starting, pid=29551
|
|
|
-[2020/08/26 13:15:06.482][INFO][server.task_pool.run:130] expert.36_backward starting, pid=29551
|
|
|
-[2020/08/26 13:15:06.497][INFO][server.task_pool.run:130] expert.58_forward starting, pid=29553
|
|
|
-[2020/08/26 13:15:06.497][INFO][server.task_pool.run:130] expert.58_forward starting, pid=29553
|
|
|
-[2020/08/26 13:15:06.507][INFO][server.task_pool.run:130] expert.58_backward starting, pid=29555
|
|
|
-[2020/08/26 13:15:06.507][INFO][server.task_pool.run:130] expert.58_backward starting, pid=29555
|
|
|
-[2020/08/26 13:15:06.509][INFO][server.runtime.run:60] Started
|
|
|
-[2020/08/26 13:15:06.509][INFO][server.runtime.run:60] Started
|
|
|
-[2020/08/26 13:15:06.510][INFO][server.create:166] Server started at 0.0.0.0:40089
|
|
|
-[2020/08/26 13:15:06.510][INFO][server.create:167] Got 10 active experts of type ffn: ['expert.113', 'expert.149', 'expert.185', 'expert.189', 'expert.191', 'expert.196', 'expert.225', 'expert.227', 'expert.36', 'expert.58']
|
|
|
+
|
|
|
+As you can see, this code is regular PyTorch with one notable exception: it wraps your regular optimizer with a
|
|
|
+`DecentralizedOptimizer`. This optimizer uses `DHT` to find other peers and tries to exchange weights them. When you run
|
|
|
+the code (please do so), you will see the following output:
|
|
|
+
|
|
|
+```shell
|
|
|
+To join the training, use initial_peers = ['/ip4/127.0.0.1/tcp/XXX/p2p/YYY']
|
|
|
+[...] Starting a new averaging round with current parameters.
|
|
|
```
|
|
|
|
|
|
-</details>
|
|
|
+This is `DecentralizedOptimizer` telling you that it's looking for peers. Since there are no peers, we'll need to create
|
|
|
+them ourselves.
|
|
|
|
|
|
-Here and below, if you are running on a different machine, replace `localhost:1338` with your original server's public
|
|
|
-IP address (e.g. `12.34.56.78:1338`). Hivemind supports both ipv4 and ipv6 protocols and uses the same notation
|
|
|
-as [gRPC](https://grpc.io/docs/languages/python/basics/#starting-the-server).
|
|
|
+Copy the entire script (or notebook) and modify this line:
|
|
|
|
|
|
-## Train the experts
|
|
|
+```python
|
|
|
+# old version:
|
|
|
+dht = hivemind.DHT(start=True)
|
|
|
|
|
|
-Now let's put these experts to work. Create a python console (or a jupyter) and run:
|
|
|
+# new version: added initial_peers
|
|
|
+dht = hivemind.DHT(initial_peers=['/ip4/127.0.0.1/tcp/COPY_FULL_ADDRESS_FROM_PEER1_OUTPUTS'], start=True)
|
|
|
+```
|
|
|
+<details style="margin-top:-16px; margin-bottom: 16px;">
|
|
|
+ <summary>Here's the full code of the second peer</summary>
|
|
|
|
|
|
```python
|
|
|
import torch
|
|
|
+import torch.nn as nn
|
|
|
+import torch.nn.functional as F
|
|
|
+from torchvision import datasets, transforms
|
|
|
+from tqdm.auto import tqdm
|
|
|
+
|
|
|
import hivemind
|
|
|
|
|
|
-dht = hivemind.DHT(initial_peers=["localhost:1338"], listen=False, start=True)
|
|
|
-# note: listen=False means that your peer will operate in "client only" mode:
|
|
|
-# this means that it can request other peers, but will not accept requests in return
|
|
|
+# Create dataset and model, same as in the basic tutorial
|
|
|
+# For this basic tutorial, we download only the training set
|
|
|
+transform = transforms.Compose(
|
|
|
+ [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
|
|
|
+
|
|
|
+trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
|
|
|
+
|
|
|
+model = nn.Sequential(nn.Conv2d(3, 16, (5, 5)), nn.MaxPool2d(2, 2), nn.ReLU(),
|
|
|
+ nn.Conv2d(16, 32, (5, 5)), nn.MaxPool2d(2, 2), nn.ReLU(),
|
|
|
+ nn.Flatten(), nn.Linear(32 * 5 * 5, 10))
|
|
|
+opt = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
|
|
|
+
|
|
|
+# Create DHT: a decentralized key-value storage shared between peers
|
|
|
+dht = hivemind.DHT(initial_peers=[COPY_FROM_ANOTHER_PEER_OUTPUTS], start=True)
|
|
|
+print("To join the training, use initial_peers =", [str(addr) for addr in dht.get_visible_maddrs()])
|
|
|
+
|
|
|
+# Set up a decentralized optimizer that will average with peers in background
|
|
|
+opt = hivemind.optim.DecentralizedOptimizer(
|
|
|
+ opt, # wrap the SGD optimizer defined above
|
|
|
+ dht, # use a DHT that is connected with other peers
|
|
|
+ average_parameters=True, # periodically average model weights in opt.step
|
|
|
+ average_gradients=False, # do not average accumulated gradients
|
|
|
+ prefix='my_cifar_run', # unique identifier of this collaborative run
|
|
|
+ target_group_size=16, # maximum concurrent peers for this run
|
|
|
+ verbose=True # print logs incessently
|
|
|
+)
|
|
|
|
|
|
-expert1, expert4 = hivemind.get_experts(dht, ["expert.1", "expert.4"])
|
|
|
-assert expert1 is not None and expert4 is not None, "server hasn't declared experts (yet?)"
|
|
|
-```
|
|
|
+opt.averager.load_state_from_peers()
|
|
|
|
|
|
-The experts (e.g. `expert1`) can be used as a pytorch module with autograd support:
|
|
|
+# Note: if you intend to use GPU, switch to it only after the decentralized optimizer is created
|
|
|
+with tqdm() as progressbar:
|
|
|
+ while True:
|
|
|
+ for x_batch, y_batch in torch.utils.data.DataLoader(trainset, shuffle=True, batch_size=256):
|
|
|
+ opt.zero_grad()
|
|
|
+ loss = F.cross_entropy(model(x_batch), y_batch)
|
|
|
+ loss.backward()
|
|
|
+ opt.step()
|
|
|
|
|
|
-```python
|
|
|
-dummy = torch.randn(3, 512)
|
|
|
-out = expert1(dummy) # forward pass
|
|
|
-out.sum().backward() # backward pass
|
|
|
+ progressbar.desc = f"loss = {loss.item():.3f}"
|
|
|
+ progressbar.update()
|
|
|
```
|
|
|
+</details>
|
|
|
|
|
|
-When called, expert1 will submit a request to the corresponding server (which you created above) and return the output
|
|
|
-tensor(s) or raise an exception. During backward, pytorch will submit the backward requests for the experts as they
|
|
|
-appear in the computation graph.
|
|
|
-
|
|
|
-By default, the experts will automatically update their parameters with one step of SGD after each backward pass. This
|
|
|
-allows you to quickly run training using both local and remote layers:
|
|
|
|
|
|
-```python
|
|
|
-# generate dummy data
|
|
|
-x = torch.randn(3, 512)
|
|
|
-y = 0.01 * x.sum(dim=-1, keepdim=True)
|
|
|
+Instead of setting up a new DHT, the second peer will link up with the existing DHT node from the first peer.
|
|
|
+If you run the second peer, you will see that both first and second peer will periodically report averaging parameters:
|
|
|
|
|
|
-# local torch module
|
|
|
-proj_out = torch.nn.Sequential(
|
|
|
- torch.nn.Linear(512, 3)
|
|
|
-)
|
|
|
-opt = torch.optim.SGD(proj_out.parameters(), lr=0.01)
|
|
|
-
|
|
|
-for i in range(100):
|
|
|
- prediction = proj_out(expert1(expert4(x)))
|
|
|
- loss = torch.mean(abs(prediction - y))
|
|
|
- print(loss.item())
|
|
|
- opt.zero_grad()
|
|
|
- loss.backward()
|
|
|
- opt.step()
|
|
|
+```shell
|
|
|
+[...] Starting a new averaging round with current parameters.
|
|
|
+[...] Finished averaging round in with 2 peers.
|
|
|
```
|
|
|
|
|
|
-Finally, you can create a Mixture-of-Experts layer over these experts:
|
|
|
+This message means that the optimizer has averaged model parameters with another peer in background and applied them
|
|
|
+during one of the calls to `opt.step()`. You can start more peers by replicating the same code as the second peer,
|
|
|
+using either the first or second peer as `initial_peers`.
|
|
|
|
|
|
+The only issue with this code is that each new peer starts with a different untrained network blends its un-trained
|
|
|
+parameters with other peers, reseting their progress. You can see this effect as a spike increase in training loss
|
|
|
+immediately after new peer joins training. To avoid this problem, the second peer can download the
|
|
|
+current model/optimizer state from an existing peer right before it begins training on minibatches:
|
|
|
```python
|
|
|
-import nest_asyncio
|
|
|
-
|
|
|
-nest_asyncio.apply() # asyncio patch for jupyter. for now, we recommend using MoE from console
|
|
|
-dmoe = hivemind.RemoteMixtureOfExperts(in_features=512, uid_prefix="expert", grid_size=(5,),
|
|
|
- dht=dht, k_best=2)
|
|
|
-
|
|
|
-out = dmoe(torch.randn(3, 512))
|
|
|
-out.sum().backward()
|
|
|
+opt.averager.load_state_from_peers()
|
|
|
```
|
|
|
|
|
|
-The `dmoe` layer dynamically selects the right experts using a linear gating function. It will then dispatch parallel
|
|
|
-forward (and backward) requests to those experts and collect results. You can find more details on how DMoE works in
|
|
|
-Section 2.3 of the [paper](https://arxiv.org/abs/2002.04013)
|
|
|
+Congrats, you've just started a pocket-sized experiment with decentralized deep learning!
|
|
|
|
|
|
-Congratulations, you've made it through the basic tutorial. Give yourself a pat on the back :)
|
|
|
+However, this is just the bare minimum of what hivemind can do. In [this example](https://github.com/learning-at-home/hivemind/tree/master/examples/albert),
|
|
|
+we show how to use a more advanced version of DecentralizedOptimizer to collaboratively train a large Transformer over the internet.
|
|
|
|
|
|
-More advanced tutorials are coming soon :)
|
|
|
+If you want to learn more about each individual component,
|
|
|
+- Learn how to use `hivemind.DHT` using this basic [DHT tutorial](https://learning-at-home.readthedocs.io/en/latest/user/dht.html),
|
|
|
+- Learn the underlying math behind DecentralizedOptimizer in
|
|
|
+ [(Li et al. 2020)](https://arxiv.org/abs/2005.00124) and [(Ryabinin et al. 2021)](https://arxiv.org/abs/2103.03239).
|
|
|
+- Read about setting up Mixture-of-Experts training in [this guide](https://learning-at-home.readthedocs.io/en/latest/user/moe.html),
|
|
|
+
|