4 years ago · 247ae25c9c
--- a/examples/albert/README.md
+++ b/examples/albert/README.md
@@ -9,7 +9,7 @@ using `hivemind.CollaborativeOptimizer` to exchange information between peers.
 
				 
			
 
				 * Install hivemind: `pip install git+https://github.com/learning-at-home/hivemind.git`
			
 
				 * Dependencies: `pip install -r requirements.txt`
			
 
				-* Preprocess data: `python tokenize_wikitext103.py`
			
 
				+* Preprocess data: `./tokenize_wikitext103.py`
			
 
				 * Upload the data to a publicly available location or ask volunteers to preprocess it locally
			
 
				 
			
 
				 ## Running an experiment
			
@@ -20,18 +20,16 @@ Run the first DHT peer to welcome trainers and record training statistics (e.g.,
 
				 
			
 
				 - In this example, we use [wandb.ai](https://wandb.ai/site) to plot training metrics. If you're unfamiliar with Weights
			
 
				   & Biases, here's a [quickstart tutorial](https://docs.wandb.ai/quickstart).
			
 
				-- Run `python run_training_monitor.py --experiment_prefix YOUR_EXPERIMENT_NAME --wandb_project YOUR_WANDB_PROJECT`
			
 
				+- Run `./run_training_monitor.py --wandb_project YOUR_WANDB_PROJECT`
			
 
				 
			
 
				-  - `YOUR_EXPERIMENT_NAME` must be a unique name of this training run, e.g. `my-albert-v1`. It cannot contain `.`
			
 
				-    due to naming conventions.
			
 
				   - `YOUR_WANDB_PROJECT` is a name of wandb project used to track training metrics. Multiple experiments can have the
			
 
				     same project name.
			
 
				 
			
 
				 ```
			
 
				-$ python run_training_monitor.py --experiment_prefix my-albert-v1 --wandb_project Demo-run
			
 
				-Oct 14 16:26:36.083 [INFO] [utils.log_visible_maddrs:47] Running a DHT peer. To connect other peers to this one over the Internet,
			
 
				+$ ./run_training_monitor.py --wandb_project Demo-run
			
 
				+Oct 14 16:26:36.083 [INFO] Running a DHT peer. To connect other peers to this one over the Internet,
			
 
				 use --initial_peers /ip4/1.2.3.4/tcp/1337/p2p/XXXX /ip4/1.2.3.4/udp/31337/quic/p2p/XXXX
			
 
				-Oct 14 16:26:36.083 [INFO] [utils.log_visible_maddrs:50] Full list of visible multiaddresses: ...
			
 
				+Oct 14 16:26:36.083 [INFO] Full list of visible multiaddresses: ...
			
 
				 wandb: Currently logged in as: XXX (use `wandb login --relogin` to force relogin)
			
 
				 wandb: Tracking run with wandb version 0.10.32
			
 
				 wandb: Syncing run dry-mountain-2
			
@@ -39,12 +37,12 @@ wandb:  View project at https://wandb.ai/XXX/Demo-run
 
				 wandb:  View run at https://wandb.ai/XXX/Demo-run/runs/YYY
			
 
				 wandb: Run data is saved locally in /path/to/run/data
			
 
				 wandb: Run `wandb offline` to turn off syncing.
			
 
				-Oct 14 16:26:41.064 [INFO] [optim.collaborative._fetch_state:448] Found no active peers: None
			
 
				-Oct 14 16:26:44.068 [INFO] [optim.collaborative._fetch_state:448] Found no active peers: None
			
 
				+Oct 14 16:26:41.064 [INFO] Found no active peers: None
			
 
				+Oct 14 16:26:44.068 [INFO] Found no active peers: None
			
 
				 ...
			
 
				-Oct 14 16:37:37.246 [INFO] [__main__.<module>:209] Step #1  loss = 11.05164
			
 
				-Oct 14 16:39:37.441 [INFO] [__main__.<module>:209] Step #2  loss = 11.03771
			
 
				-Oct 14 16:40:37.541 [INFO] [__main__.<module>:209] Step #3  loss = 11.02886
			
 
				+Oct 14 16:37:37.246 [INFO] Step #1  loss = 11.05164
			
 
				+Oct 14 16:39:37.441 [INFO] Step #2  loss = 11.03771
			
 
				+Oct 14 16:40:37.541 [INFO] Step #3  loss = 11.02886
			
 
				 ```
			
 
				 
			
 
				 ### GPU trainers
			
@@ -57,8 +55,8 @@ To join the collaboration with a GPU trainer,
 
				   (see [default paths](./arguments.py#L117-L134) for reference)
			
 
				 - Run:
			
 
				   ```bash
			
 
				-  python run_trainer.py \
			
 
				-      --experiment_prefix YOUR_EXPERIMENT_NAME --initial_peers ONE_OR_MORE_PEERS \
			
 
				+  ./run_trainer.py \
			
 
				+      --initial_peers ONE_OR_MORE_PEERS \
			
 
				       --logging_first_step --output_dir ./outputs --overwrite_output_dir --logging_dir ./logs
			
 
				   ```
			
 
				 
			
@@ -89,16 +87,18 @@ See the ["Tips and tricks"](#tips-and-tricks) section for more information on se
 
				 As the peer begins training, it will periodically report training logs in the following form:
			
 
				 
			
 
				 ```
			
 
				-... [INFO] [...] my-albert-v1 accumulated 448 samples from 17 peers for step #0. ETA 18.88 sec (refresh in 15.73 sec)
			
 
				-... [INFO] [...] my-albert-v1 accumulated 4096 samples from 16 peers for step #0. ETA 0.00 sec (refresh in 0.50 sec)
			
 
				-... [INFO] [optim.collaborative.step:283] Averaged tensors successfully with 17 peers
			
 
				-... [INFO] [optim.collaborative.step:317] Optimizer step: done!
			
 
				-Oct 14 18:58:03.750 [INFO] [__main__.on_step_end:141] Step 1
			
 
				-Oct 14 18:58:03.750 [INFO] [__main__.on_step_end:142] Your current contribution: 892 samples
			
 
				-Oct 14 18:58:03.750 [INFO] [__main__.on_step_end:143] Local loss: 11.023
			
 
				+Dec 28 00:15:31.482 [INFO] albert accumulated 4056 samples for epoch #0 from 2 peers. ETA 0.75 sec (refresh in 0.50 sec)
			
 
				+Dec 28 00:15:31.990 [INFO] albert accumulated 4072 samples for epoch #0 from 2 peers. ETA 0.24 sec (refresh in 0.50 sec)
			
 
				+...
			
 
				+Dec 28 00:15:32.857 [INFO] Step #1
			
 
				+Dec 28 00:15:32.857 [INFO] Your current contribution: 2144 samples
			
 
				+Dec 28 00:15:32.857 [INFO] Performance: 20.924 samples/sec
			
 
				+Dec 28 00:15:32.857 [INFO] Local loss: 11.06709
			
 
				+Dec 28 00:15:33.580 [INFO] Averaged gradients with 2 peers
			
 
				+Dec 28 00:15:38.336 [INFO] Averaged parameters with 2 peers
			
 
				 ```
			
 
				 
			
 
				-__Sanity check:__ a healthy peer will periodically report `Averaged tensors successfully with [N > 1]` peers.
			
 
				+__Sanity check:__ a healthy peer will periodically report `Averaged gradients/parameters with [N > 1]` peers.
			
 
				 
			
 
				 For convenience, you can view (and share!) the learning curves of your collaborative experiments in wandb:
			
 
				 
			
@@ -169,8 +169,8 @@ Here's an example of a full trainer script for Google Colab:
 
				 !pip install transformers datasets sentencepiece torch_optimizer==0.1.0
			
 
				 !git clone https://github.com/learning-at-home/hivemind && cd hivemind && pip install -e .
			
 
				 !curl -L YOUR_HOSTED_DATA | tar xzf -
			
 
				-!ulimit -n 4096 && python ./hivemind/examples/albert/run_trainer.py \
			
 
				-    --experiment_prefix YOUR_EXPERIMENT_NAME --initial_peers ONE_OR_MORE_PEERS \
			
 
				+!ulimit -n 4096 && ./hivemind/examples/albert/run_trainer.py \
			
 
				+    --initial_peers ONE_OR_MORE_PEERS \
			
 
				     --logging_dir ./logs --logging_first_step --output_dir ./outputs --overwrite_output_dir \
			
 
				     --client_mode --averaging_expiration 10 --batch_size_lead 300 --gradient_accumulation_steps 1
			
 
				 ```
			
--- a/examples/albert/arguments.py
+++ b/examples/albert/arguments.py
@@ -7,7 +7,7 @@ from transformers import TrainingArguments
 
				 @dataclass
			
 
				 class BaseTrainingArguments:
			
 
				     experiment_prefix: str = field(
			
 
				-        metadata={"help": "A unique 'name' of this experiment, used to store metadata on the DHT"}
			
 
				+        default="albert", metadata={"help": "A unique 'name' of this experiment, used to store metadata on the DHT"}
			
 
				     )
			
 
				     initial_peers: List[str] = field(
			
 
				         default_factory=list,
			
--- a/examples/albert/requirements.txt
+++ b/examples/albert/requirements.txt
@@ -4,4 +4,4 @@ torch_optimizer==0.1.0
 
				 wandb==0.10.26
			
 
				 sentencepiece
			
 
				 requests
			
 
				-nltk==3.6.2
			
 
				+nltk==3.6.5