|
@@ -35,7 +35,7 @@ This snippet shows how to **(a)** generate text with BLOOM and **(b)** solve a s
|
|
|
```python
|
|
|
# Initialize distributed BLOOM and connect to the swarm
|
|
|
model = DistributedBloomForCausalLM.from_pretrained(
|
|
|
- "bigscience/distributed-bloom", tuning_mode="ptune", initial_peers=SEE_BELOW
|
|
|
+ "bigscience/bloom-petals", tuning_mode="ptune", initial_peers=SEE_BELOW
|
|
|
) # Embeddings & prompts are on your device, BLOOM blocks are distributed
|
|
|
|
|
|
print("Generated:", model.generate(tokenized_prefix, max_new_tokens=5))
|
|
@@ -78,90 +78,86 @@ This is important because it's technically possible for peers serving model laye
|
|
|
|
|
|
## Installation
|
|
|
|
|
|
-🚧 **Note:** These are short instructions for running a private swarm with a test 6B version of BLOOM. We will replace them with instructions involving the full 176B BLOOM and more detailed explanations soon (in a day or two).
|
|
|
-
|
|
|
---------------------------------------------------------------------------------
|
|
|
-
|
|
|
-```bash
|
|
|
-conda install -y -c conda-forge cudatoolkit-dev==11.3.1 cudatoolkit==11.3.1 cudnn==8.2.1.32
|
|
|
-pip install torch==1.12.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
|
|
|
+Here's how to install the dependencies with conda:
|
|
|
+```
|
|
|
+conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
|
|
|
+pip install bitsandbytes==0.33.2 # for 8-bit quantization
|
|
|
pip install -r requirements.txt
|
|
|
-pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda113
|
|
|
```
|
|
|
|
|
|
+This script uses Anaconda to install cuda-enabled PyTorch.
|
|
|
+If you don't have anaconda, you can get it from [here](https://www.anaconda.com/products/distribution).
|
|
|
+If you don't want anaconda, you can install PyTorch [any other way](https://pytorch.org/get-started/locally/).
|
|
|
+If you want to run models with 8-bit weights, please install **PyTorch with CUDA 11** or newer for compatility with [bitsandbytes](https://github.com/timDettmers/bitsandbytes).
|
|
|
+
|
|
|
+__OS support:__ currently, PETALS only supports Linux operating systems. On Windows 11, you can run PETALS with GPU enabled inside WSL2 ([read more](https://learn.microsoft.com/en-us/windows/ai/directml/gpu-cuda-in-wsl).
|
|
|
+For macOS, you can *probably* run everything normally if you manage to install dependencies, but we do not guarantee this.
|
|
|
+
|
|
|
+
|
|
|
### Basic functionality
|
|
|
|
|
|
-All tests is run on localhost
|
|
|
+This is a toy example running on a local machine without GPU with a small bloom model.
|
|
|
+For a more detailed instruction with larger models, see ["Launch your own swarm"](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm).
|
|
|
|
|
|
-First, run one or more servers like this:
|
|
|
+First, run a couple of servers, each in a separate shell. First server runs like this
|
|
|
```bash
|
|
|
-# minimalistic server with non-trained bloom blocks
|
|
|
-python -m cli.run_server --converted_model_name_or_path bigscience/test-bloomd-6b3 \
|
|
|
- --block_indices 3:5 --torch_dtype float32 --identity_path ./server1.id --host_maddrs /ip4/127.0.0.1/tcp/31337
|
|
|
-# when running multiple servers:
|
|
|
-# - give each server a unique --identity_path (or remote --identity_path arg when debugging)
|
|
|
-# - if running multiple servers on the same machine, give each a unique port (last integer in --host_maddrs, 0 means random port)
|
|
|
-# - when running over the internet, change --host_maddrs according to https://learning-at-home.readthedocs.io/en/latest/user/dht.html#running-across-the-internet
|
|
|
-# - each server except first should have --initial_peers pointing to one of pre-existing servers
|
|
|
+python -m cli.run_server bloom-testing/test-bloomd-560m-main --num_blocks 8 --torch_dtype float32 \
|
|
|
+ --host_maddrs /ip4/127.0.0.1/tcp/31337 # use port 31337, local connections only
|
|
|
```
|
|
|
|
|
|
-Then open a python notebook or console and run:
|
|
|
-```python
|
|
|
-import torch
|
|
|
-import hivemind
|
|
|
-from src import DistributedBloomConfig, get_remote_module
|
|
|
-
|
|
|
-
|
|
|
-dht = hivemind.DHT(
|
|
|
- initial_peers=[TODO_COPY_FULL_ADDRESS_FROM_ANY_OF_THE_SERVERS], # e.g. /ip4/127.0.0.1/...
|
|
|
- client_mode=True, start=True,
|
|
|
-)
|
|
|
-config = DistributedBloomConfig.from_pretrained("bigscience/test-bloom-6b3")
|
|
|
-layer3, layer4 = get_remote_module(dht, ['bigscience/test-bloomd-6b3.3', 'bigscience/test-bloomd-6b3.4'], config)
|
|
|
-assert layer3 is not None and layer4 is not None, "one or both layers were not found in DHT"
|
|
|
-# test forward/backward, two blocks
|
|
|
-outputs = layer4(layer3(torch.randn(1, 64, 4096)))
|
|
|
-loss = (outputs * torch.randn_like(outputs)).norm()
|
|
|
-loss.backward()
|
|
|
-
|
|
|
-# test inference, one block
|
|
|
-with layer3.inference_session(max_length=10) as sess:
|
|
|
- for i in range(10):
|
|
|
- res = sess.step(torch.ones(1, 1, 4096))
|
|
|
+Once you run the server, it will print out a ton of information, including a line like this:
|
|
|
+```bash
|
|
|
+Mon Day 01:23:45.678 [INFO] Running DHT node on ['/ip4/127.0.0.1/tcp/31337/p2p/ALongStringOfCharacters'], initial peers = []
|
|
|
```
|
|
|
|
|
|
-
|
|
|
-### Convert regular BLOOM into distributed
|
|
|
+You can use this address (/ip4/whatever/else) to connect additional servers. Open another terminal and run:
|
|
|
```bash
|
|
|
-
|
|
|
-# convert model from HF hub to a distributed format (can take hours depending on your connection!)
|
|
|
-MY_WRITE_TOKEN=TODO_WRITE_TOKEN_FROM_https://huggingface.co/settings/token
|
|
|
-python -m cli.convert_model --model bigscience/bloom-6b3 \
|
|
|
- --output_path ./converted_model --output_repo bigscience/test-bloomd-6b3 \
|
|
|
- --use_auth_token $MY_WRITE_TOKEN # ^-- todo replace output repo with something you have access to
|
|
|
+python -m cli.run_server bloom-testing/test-bloomd-560m-main --num_blocks 8 --torch_dtype float32 \
|
|
|
+ --host_maddrs /ip4/127.0.0.1/tcp/0 --initial_peers /ip4/127.0...<TODO! copy the address of another server>
|
|
|
+# e.g. --initial_peers /ip4/127.0.0.1/tcp/31337/p2p/QmS1GecIfYouAreReadingThisYouNeedToCopyYourServerAddressCBBq
|
|
|
```
|
|
|
|
|
|
+You can assign `--initial_peers` to one or multiple addresses of other servers, not necessarily the first one.
|
|
|
+The only requirement is that at least one of them is alive, i.e. running at the time.
|
|
|
|
|
|
-### Test local vs remote block (allclose)
|
|
|
+Before you proceed, __please run 3 servers__ for a total of 24 blocks (3x8). If you are running a different model,
|
|
|
+make sure your servers have enough total `--num_blocks` to cover that model.
|
|
|
|
|
|
-To test distributed inference, run one or more servers, then open a new shell and run pytest with environment variables:
|
|
|
-```bash
|
|
|
-# shell A: serve model
|
|
|
-python -m cli.run_server --converted_model_name_or_path bigscience/test-bloomd-6b3 \
|
|
|
- --torch_dtype float32 --identity_path ./server1.id --host_maddrs /ip4/127.0.0.1/tcp/31337
|
|
|
|
|
|
-# shell B:
|
|
|
-export PYTHONPATH=.
|
|
|
-export INITIAL_PEERS="/ip4/TODO_COPY_INITIAL_PEERS_FROM_SERVER_OUTPUT"
|
|
|
-export MODEL_NAME="bigscience/test-bloomd-6b3"
|
|
|
+Once your have enough servers, you can use them to train and/or inference the model:
|
|
|
+```python
|
|
|
+import torch
|
|
|
+import torch.nn.functional as F
|
|
|
+import transformers
|
|
|
+from src import DistributedBloomForCausalLM
|
|
|
+
|
|
|
+initial_peers = [TODO_put_one_or_more_server_addresses_here] # e.g. ["/ip4/127.0.0.1/tcp/more/stuff/here"]
|
|
|
+tokenizer = transformers.BloomTokenizerFast.from_pretrained("bloom-testing/test-bloomd-560m-main")
|
|
|
|
|
|
-# test individual random blocks for exact match
|
|
|
-pytest tests/test_block_exact_match.py
|
|
|
+model = DistributedBloomForCausalLM.from_pretrained(
|
|
|
+ "bloom-testing/test-bloomd-560m-main", initial_peers=initial_peers, low_cpu_mem_usage=True, torch_dtype=torch.float32
|
|
|
+) # this model has only embeddings / logits, all transformer blocks rely on remote servers
|
|
|
+inputs = tokenizer("a cat sat", return_tensors="pt")["input_ids"]
|
|
|
+remote_outputs = model.generate(inputs, max_length=10)
|
|
|
+print(tokenizer.decode(remote_outputs[0])) # "a cat sat in the back of the car,"
|
|
|
|
|
|
-# test the full model
|
|
|
-pytest tests/test_full_model.py
|
|
|
+model = DistributedBloomForCausalLM.from_pretrained(
|
|
|
+ "bloom-testing/test-bloomd-560m-main", initial_peers=initial_peers, low_cpu_mem_usage=True, torch_dtype=torch.float32
|
|
|
+) # this model has only embeddings / logits, all transformer blocks rely on remote servers
|
|
|
+
|
|
|
+# "train" input embeddings by backprop through distributed transformer blocks
|
|
|
+model.transformer.word_embeddings.weight.requires_grad = True
|
|
|
+outputs = model.forward(input_ids=inputs)
|
|
|
+loss = F.cross_entropy(outputs.logits.flatten(0, 1), inputs.flatten())
|
|
|
+loss.backward()
|
|
|
+print("Gradients (norm):", model.transformer.word_embeddings.weight.grad.norm())
|
|
|
```
|
|
|
|
|
|
+Of course, this is a simplified code snippet. For actual training, see our example on "deep" prompt-tuning here: [examples/prompt-tuning-personachat.ipynb](./examples/prompt-tuning-personachat.ipynb).
|
|
|
+
|
|
|
+Here's a [more advanced tutorial](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm) that covers 8-bit quantization and best practices for running PETALS.
|
|
|
+
|
|
|
+
|
|
|
--------------------------------------------------------------------------------
|
|
|
|
|
|
<p align="center">
|