3 jaren geleden · e30be704f3
--- a/docs/index.md
+++ b/docs/index.md
@@ -2,115 +2,69 @@
 
				 
			
 
				 This tutorial will walk you through the steps of setting up your private swarm to inference and fine-tune BLOOM.
			
 
				 
			
 
				-### I. Get the model ready
			
 
				 
			
 
				-<p align="center">
			
 
				-    <img src="https://i.imgur.com/7eR7Pan.png" width="400"><br>
			
 
				-    Decentralized platform for running 100B+ language models<br><br>
			
 
				-    <a href="https://github.com/bigscience-workshop/petals/actions">
			
 
				-        <img src="https://github.com/bigscience-workshop/petals/actions/workflows/run-tests.yaml/badge.svg?branch=main">
			
 
				-    </a>
			
 
				-    <a href="https://github.com/psf/black">
			
 
				-        <img src="https://img.shields.io/badge/code%20style-black-000000.svg">
			
 
				-    </a>
			
 
				-</p>
			
 
				-
			
 
				-## Key features
			
 
				-
			
 
				-- Run inference or fine-tune large language models like [BLOOM-176B](https://huggingface.co/bigscience/bloom) by joining compute resources with people all over the Internet. No need to have high-end GPUs.
			
 
				-- It's difficult to fit the whole BLOOM-176B into GPU memory [unless](https://twitter.com/Tim_Dettmers/status/1559892918395031552) you have multiple high-end GPUs. Instead, **Petals** allows to load and serve a small part of the model, then team up with people serving all the other parts to run inference or fine-tuning.
			
 
				-- This way, one inference step takes ≈ 1 sec — much faster than possible with offloading. Enough for chatbots and other interactive apps.
			
 
				-- Beyond traditional language model APIs — you can employ any fine-tuning and sampling methods by executing custom paths through the model or accessing its hidden states. This allows for the comforts of an API with the flexibility of PyTorch.
			
 
				-
			
 
				-<p align="center">
			
 
				-    <b><a href="https://arxiv.org/pdf/2209.01188.pdf">[Read paper]</a></b> | <b><a href="https://petals.ml/">[View website]</a></b>
			
 
				-</p>
			
 
				-
			
 
				-## How it works?
			
 
				+### Spin up a server
			
 
				 
			
 
				-<p align="center">
			
 
				-    <img src="https://i.imgur.com/RTYF3yW.png" width="800">
			
 
				-</p>
			
 
				+Before you can use a model, you (or someone) needs to host its transformer blocks. In PETALS, this is done by running a
			
 
				+server: each server hosts one or multiple transformer blocks and connect to each other to form the full model. 
			
 
				 
			
 
				-### 🛠️ Examples
			
 
				+__Run the first server:__ every swarm begins with one server. You can start a basic server with this script:
			
 
				 
			
 
				-Petals integrates seamlessly with PyTorch and the Hugging Face [Transformers](https://github.com/huggingface/transformers) library.
			
 
				+```bash
			
 
				 
			
 
				-This snippet shows how to **(a)** generate text with BLOOM and **(b)** solve a sequence classification task via soft prompt tuning:
			
 
				+export CUDA_VISIBLE_DEVICES=  # choose a GPU index (e.g. "0") or leave blank to run on CPU 
			
 
				+export IPV4=$(dig -4 TXT +short o-o.myaddr.l.google.com @ns1.google.com |  tr -d '"')
			
 
				+echo "My IP:[ " $IPV4 " ] - must be non-empty"
			
 
				+# if IP is empty, you can set it manually. To test PETALS on your local machine, export IPV4=127.0.0.1
			
 
				 
			
 
				-```python
			
 
				-# Initialize distributed BLOOM and connect to the swarm
			
 
				-model = DistributedBloomForCausalLM.from_pretrained(
			
 
				-    "bigscience/distributed-bloom", tuning_mode="ptune", initial_peers=SEE_BELOW
			
 
				-)  # Embeddings & prompts are on your device, BLOOM blocks are distributed
			
 
				-
			
 
				-print("Generated:", model.generate(tokenized_prefix, max_new_tokens=5))
			
 
				+export PORT=12345 # pick a free and open port; if you're not sure what it means, please see the "Details" section below
			
 
				 
			
 
				-# Training (updates only local prompts / adapters)
			
 
				-optimizer = torch.optim.AdamW(model.parameters())
			
 
				-for input_ids, labels in data_loader:
			
 
				-    outputs = model.forward(input_ids)
			
 
				-    loss = cross_entropy(outputs.logits, labels)
			
 
				-    optimizer.zero_grad()
			
 
				-    loss.backward()
			
 
				-    optimizer.step()
			
 
				+python -m cli.run_server \
			
 
				+ --identity_path ./serverA.id  --host_maddrs /ip4/$IPV4/tcp/$PORT /ip4/$IPV4/udp/6789/$PORT \
			
 
				+ --converted_model_name_or_path bigscience/test-bloomd-6b3 `# model name on huggingface hub; must be converted first` \
			
 
				+ --num_blocks 8 `# serve this many transformer layers; layer indices are determined automatically` \
			
 
				+ --throughput 1 `# server's performance, used for load-balancing; leave blank to auto-detect with speedtest`
			
 
				 ```
			
 
				 
			
 
				-### 🚧 This project is in active development
			
 
				-
			
 
				-Be careful: some features may not work, interfaces may change, and we have no detailed docs yet (see [roadmap](https://github.com/bigscience-workshop/petals/issues/12)).
			
 
				-
			
 
				-A stable version of the code and a public swarm open to everyone will be released in November 2022. You can [subscribe](https://petals.ml/) to be emailed when it happens or fill in [this form](https://forms.gle/TV3wtRPeHewjZ1vH9) to help the public launch by donating GPU time. In the meantime, you can launch and use your own private swarm.
			
 
				+* __TODO__ example outputs as in hivemind/moe
			
 
				+* __TODO__ describe outputs and explain to --initial_peers!
			
 
				 
			
 
				-### 🔒 Privacy and security
			
 
				+This initial server has 8 out of 30 total blocks. To run the full model, we will need to add more servers.
			
 
				 
			
 
				-If you work with sensitive data, you should only use a private swarm (or a subset of servers in the public swarm) hosted by people and institutions you trust, who are authorized to process this data.
			
 
				+__Additional servers__ can join the swarm using the ```--initial_peers``` option. The new server runs 
			
 
				+You can open a new console and run a similar console script, but with two changes:
			
 
				 
			
 
				-This is important because it's technically possible for peers serving model layers to recover input data or model outputs. Also, if there are malicious peers, they may alter their outputs to influence the model outputs. See a more detailed discussion in Section 4 of our [paper](https://arxiv.org/pdf/2209.01188.pdf).
			
 
				+1. ```--initial_peers /ip4/.../tcp/...``` - copy the address string from the running server.
			
 
				+  If there are multiple active servers, you can use any one or both of them: the swarm is fully decentralzied.
			
 
				+2. Replace ```--identity_path``` and ```--host_maddrs``` with a unique name and address for each server. When testing 
			
 
				+  on a local machine, you can also remove these options altogether.
			
 
				 
			
 
				-## FAQ
			
 
				+For example, this is how your second server could look like:
			
 
				 
			
 
				-1. **What's the motivation for people to host model layers in the public swarm?**
			
 
				-
			
 
				-    People who run inference and fine-tuning themselves get a certain speedup if they host a part of the model locally. Some may be also motivated to "give back" to the community helping them to run the model (similarly to how [BitTorrent](https://en.wikipedia.org/wiki/BitTorrent) users help others by sharing data they have already downloaded).
			
 
				-
			
 
				-    Since it may be not enough for everyone, we are also working on introducing explicit __incentives__ ("bloom points") for people donating their GPU time to the public swarm. Once this system is ready, people who earned these points will be able to spend them on inference/fine-tuning with higher priority or increased security guarantees, or (maybe) exchange them for other rewards.
			
 
				-
			
 
				-2. **Why is the platform named "Petals"?**
			
 
				-
			
 
				-    "Petals" is a metaphor for people serving different parts of the model. Together, they host the entire language model &mdash; [BLOOM](https://huggingface.co/bigscience/bloom).
			
 
				-
			
 
				-    While our platform focuses on BLOOM now, we aim to support more [foundation models](https://arxiv.org/abs/2108.07258) in future.
			
 
				+```bash
			
 
				+python -m cli.run_server --converted_model_name_or_path bigscience/test-bloomd-6b3 \
			
 
				+  --num_blocks 8 --initial_peers /ip4/127.0.0.1/tcp/12345/p2p/QmcTODOReplaceThisWithTheActualAddressOfAnotherServer
			
 
				+```
			
 
				 
			
 
				-## Installation
			
 
				+__TODO capture outputs__
			
 
				 
			
 
				-🚧 **Note:** These are short instructions for running a private swarm with a test 6B version of BLOOM. We will replace them with instructions involving the full 176B BLOOM and more detailed explanations soon (in a day or two).
			
 
				+Note that the second server chose a different subset of layers (8-15), since layers (0-7) have already been served.
			
 
				+To cover the entire 30-layer model, __please run 2 servers like this__, or run one server with `--num_blocks 14`. 
			
 
				 
			
 
				---------------------------------------------------------------------------------
			
 
				+Running the full model requires 12-24GB of RAM between the servers, depending on your numeric precision.
			
 
				+For large models, PETALS can reduce memory usage with 8-bit quantization - see "Running at scale" section on how to use that.
			
 
				 
			
 
				-```bash
			
 
				-conda install -y -c conda-forge cudatoolkit-dev==11.3.1 cudatoolkit==11.3.1 cudnn==8.2.1.32
			
 
				-pip install torch==1.12.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
			
 
				-pip install -r requirements.txt
			
 
				-pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda113
			
 
				-```
			
 
				 
			
 
				-### Basic functionality
			
 
				+__Details:__
			
 
				+* TODO about ports and how to open them
			
 
				+* __The host_maddrs__ line contains the so-called multiaddresses: learn more about them in [this guide](https://docs.libp2p.io/concepts/addressing/).
			
 
				+* TODO about identity
			
 
				 
			
 
				-All tests is run on localhost
			
 
				 
			
 
				-First, run one or more servers like this:
			
 
				-```bash
			
 
				-# minimalistic server with non-trained bloom blocks
			
 
				-python -m cli.run_server --converted_model_name_or_path bigscience/test-bloomd-6b3 \
			
 
				-  --block_indices 3:5 --torch_dtype float32 --identity_path ./server1.id --host_maddrs /ip4/127.0.0.1/tcp/31337
			
 
				-# when running multiple servers:
			
 
				-# - give each server a unique --identity_path (or remote --identity_path arg when debugging)
			
 
				-# - if running multiple servers on the same machine, give each a unique port (last integer in --host_maddrs, 0 means random port)
			
 
				-# - when running over the internet, change --host_maddrs according to https://learning-at-home.readthedocs.io/en/latest/user/dht.html#running-across-the-internet
			
 
				-# - each server except first should have --initial_peers pointing to one of pre-existing servers
			
 
				-```
			
 
				+### Use the model
			
 
				 
			
 
				+* TODO disclaimer - 6b3 is for testing and is not very efficient; petals is optimized for 100B+
			
 
				+ 
			
 
				 Then open a python notebook or console and run:
			
 
				 ```python
			
 
				 import torch
			
@@ -137,7 +91,51 @@ with layer3.inference_session(max_length=10) as sess:
 
				 ```
			
 
				 
			
 
				 
			
 
				-### Convert regular BLOOM into distributed
			
 
				+```python
			
 
				+# Initialize distributed BLOOM and connect to the swarm
			
 
				+model = DistributedBloomForCausalLM.from_pretrained(
			
 
				+    "bigscience/distributed-bloom", tuning_mode="ptune", initial_peers=SEE_BELOW
			
 
				+)  # Embeddings & prompts are on your device, BLOOM blocks are distributed
			
 
				+
			
 
				+print("Generated:", model.generate(tokenized_prefix, max_new_tokens=5))
			
 
				+
			
 
				+# Training (updates only local prompts / adapters)
			
 
				+optimizer = torch.optim.AdamW(model.parameters())
			
 
				+for input_ids, labels in data_loader:
			
 
				+    outputs = model.forward(input_ids)
			
 
				+    loss = cross_entropy(outputs.logits, labels)
			
 
				+    optimizer.zero_grad()
			
 
				+    loss.backward()
			
 
				+    optimizer.step()
			
 
				+```
			
 
				+
			
 
				+__TODO link to Artek's training notebook__
			
 
				+
			
 
				+### Running at scale
			
 
				+
			
 
				+This section contains a step-by-step guide to spin up the large bloom (176B parameters) on servers
			
 
				+
			
 
				+- TODO about hivemind-dht as persistent initial peers  
			
 
				+- TODO about cheap servers (e.g. hetzner)
			
 
				+```
			
 
				+ --torch_dtype bfloat16  # we recomment model's original dtype for GPU and float32 for CPU
			
 
				+ --load_in_8bit   # requires Turing or newer gpu, see LLM.8bit paper https://arxiv.org/abs/2208.07339
			
 
				+ # ^-- remove load_in_8bit when running on CPU or an older GPU (e.g. 1080Ti or V100)
			
 
				+ 
			
 
				+```
			
 
				+__TODO activation quantization__
			
 
				+
			
 
				+__TODO blocks per GPU memory__
			
 
				+
			
 
				+
			
 
				+
			
 
				+### Deploy your own model with PETALS 
			
 
				+
			
 
				+To run PETALS servers with your own model, you need to convert the model weights into a PETALS-compatible format.
			
 
				+This conversion splits each individual block into a separate branch. This allows each peer to download only the
			
 
				+layers they need, instead of the entire 350GB model.
			
 
				+
			
 
				+For BLOOM models, you can convert them using the following script:
			
 
				 ```bash
			
 
				 
			
 
				 # convert model from HF hub to a distributed format (can take hours depending on your connection!)
			
@@ -147,28 +145,27 @@ python -m cli.convert_model --model bigscience/bloom-6b3  \
 
				   --use_auth_token $MY_WRITE_TOKEN  # ^-- todo replace output repo with something you have access to
			
 
				 ```
			
 
				 
			
 
				+If you want to run a non-BLOOM model (e.g. [OPT](https://arxiv.org/abs/2205.01068) or [YALM](https://github.com/yandex/YaLM-100B)),
			
 
				+you will need to edit the code a bit.
			
 
				+Currently, PETALS uses a vanilla implementation of BLOOM in `src/bloom`, so it is possible to replace it with other models from Hugging Face transformers. 
			
 
				 
			
 
				-### Test local vs remote block (allclose)
			
 
				+Assuming your model is already is compatible with Hugging Face, you will need 3 extra steps:
			
 
				 
			
 
				-To test distributed inference, run one or more servers, then open a new shell and run pytest with environment variables:
			
 
				-```bash
			
 
				-# shell A: serve model
			
 
				-python -m cli.run_server --converted_model_name_or_path bigscience/test-bloomd-6b3 \
			
 
				-  --torch_dtype float32 --identity_path ./server1.id --host_maddrs /ip4/127.0.0.1/tcp/31337
			
 
				+1. Edit `cli/convert_model.py` to partition your model checkpoint into individual blocks and non-transformer layers.
			
 
				+   Once you are done, run this script to convert your model and upload it to Hugging Face. If your model is private,
			
 
				+   you can use your internal storage instead (see next step).
			
 
				+2. In `src/bloom/from_pretrained.py`, edit `load_pretrained_block` to load a single block of your custom model.
			
 
				+  Your block should be able to run `.forward(hidden_states=..., use_cache=true_or_false, layer_past=optional_tensors)`.
			
 
				+  After this step, you should be able to launch a server with the new model name.
			
 
				+3. Open `src/client/remote_model.py` and change `DistributedBloomModel` to load the model of your choice.
			
 
				+  Create non-transformer layers (e.g. embeddings and logits) as usual. Instead of loading transformer blocks,
			
 
				+  create a RemoteSequential instance. 
			
 
				 
			
 
				-# shell B:
			
 
				-export PYTHONPATH=.
			
 
				-export INITIAL_PEERS="/ip4/TODO_COPY_INITIAL_PEERS_FROM_SERVER_OUTPUT"
			
 
				-export MODEL_NAME="bigscience/test-bloomd-6b3"
			
 
				+Once you are done, run `tests/test_full_model.py` to verify that your conversion went correctly.
			
 
				+In future, we hope to streamline this process, making it possible to serve any language model available on Hugging Face.
			
 
				+If you with this future to come sooner and willing to work on a pull-request, please contact us.
			
 
				 
			
 
				-# test individual random blocks for exact match
			
 
				-pytest tests/test_block_exact_match.py
			
 
				-
			
 
				-# test the full model
			
 
				-pytest tests/test_full_model.py
			
 
				-```
			
 
				 
			
 
				---------------------------------------------------------------------------------
			
 
				 
			
 
				 <p align="center">
			
 
				     This project is a part of the <a href="https://bigscience.huggingface.co/">BigScience</a> research workshop.