3 years ago · d809e303c5
--- a/examples/albert/README.md
+++ b/examples/albert/README.md
@@ -20,15 +20,16 @@ Run the first DHT peer to welcome trainers and record training statistics (e.g.,
 
				 
			
 
				 - In this example, we use [wandb.ai](https://wandb.ai/site) to plot training metrics. If you're unfamiliar with Weights
			
 
				   & Biases, here's a [quickstart tutorial](https://docs.wandb.ai/quickstart).
			
 
				-- Run `python run_training_monitor.py --experiment_prefix NAME_YOUR_EXPERIMENT --wandb_project WANDB_PROJECT_HERE`
			
 
				-- `NAME_YOUR_EXPERIMENT` must be a unique name of this training run, e.g. `my-first-albert`. It cannot contain `.`
			
 
				-  due to naming conventions.
			
 
				-- `WANDB_PROJECT_HERE` is a name of wandb project used to track training metrics. Multiple experiments can have the
			
 
				-  same project name.
			
 
				+- Run `python run_training_monitor.py --experiment_prefix YOUR_EXPERIMENT_NAME --wandb_project YOUR_WANDB_PROJECT`
			
 
				+
			
 
				+  - `YOUR_EXPERIMENT_NAME` must be a unique name of this training run, e.g. `my-first-albert`. It cannot contain `.`
			
 
				+    due to naming conventions.
			
 
				+  - `YOUR_WANDB_PROJECT` is a name of wandb project used to track training metrics. Multiple experiments can have the
			
 
				+    same project name.
			
 
				 
			
 
				 ```
			
 
				 $ python run_training_monitor.py --experiment_prefix my-albert-v1 --wandb_project Demo-run
			
 
				-[2021/06/17 16:26:36.083][INFO][root.log_visible_maddrs:54] Running a DHT peer. To connect other peers to this one over the Internet, 
			
 
				+[2021/06/17 16:26:36.083][INFO][root.log_visible_maddrs:54] Running a DHT peer. To connect other peers to this one over the Internet,
			
 
				 use --initial_peers /ip4/1.2.3.4/tcp/1337/p2p/XXXX /ip4/1.2.3.4/udp/31337/quic/p2p/XXXX
			
 
				 wandb: Currently logged in as: XXX (use `wandb login --relogin` to force relogin)
			
 
				 wandb: Tracking run with wandb version 0.10.32
			
@@ -56,8 +57,8 @@ To join the collaboration with a GPU trainer,
 
				 - Run:
			
 
				   ```bash
			
 
				   python run_trainer.py \
			
 
				-  --experiment_prefix SAME_AS_IN_RUN_TRAINING_MONITOR --initial_peers ONE_OR_MORE_PEERS --seed 42 \
			
 
				-  --logging_first_step --logging_steps 100  --output_dir ./outputs --overwrite_output_dir --logging_dir ./logs
			
 
				+      --experiment_prefix YOUR_EXPERIMENT_NAME --initial_peers ONE_OR_MORE_PEERS \
			
 
				+      --logging_first_step --output_dir ./outputs --overwrite_output_dir --logging_dir ./logs
			
 
				   ```
			
 
				 
			
 
				   Here, `ONE_OR_MORE_PEERS` stands for multiaddresses of one or multiple existing peers (training monitors or existing
			
@@ -135,7 +136,7 @@ incoming connections (e.g. when in colab or behind a firewall), add `--client_mo
 
				 below). In case of high network latency, you may want to increase `--averaging_expiration` by a few seconds or
			
 
				 set `--batch_size_lead` to start averaging a bit earlier than the rest of the collaboration. GPU-wise, each peer should
			
 
				 be able to process one local microbatch each 0.5–1 seconds (see trainer's progress bar). To achieve that, we
			
 
				-recommend tuning `--per_device_train_batch_size` and `--gradient_accumulation_steps`. 
			
 
				+recommend tuning `--per_device_train_batch_size` and `--gradient_accumulation_steps`.
			
 
				 
			
 
				 The example trainer supports
			
 
				 multiple GPUs via DataParallel. However, using advanced distributed training strategies (
			
@@ -155,7 +156,7 @@ collaborative experiment. Here's how to best use them:
 
				 - Most free GPUs are running behind a firewall, which requires you to run trainer with `--client_mode` (see example
			
 
				   below). Such peers can only exchange gradients if there is at least one non-client-mode peer (GPU server or desktop
			
 
				   with public IP). We recommend using a few preemptible instances with the cheapest GPU you can find. For example, we
			
 
				-  tested this code on preemptible 
			
 
				+  tested this code on preemptible
			
 
				   [`g4dn.xlarge`](https://aws.amazon.com/blogs/aws/now-available-ec2-instances-g4-with-nvidia-t4-tensor-core-gpus/)
			
 
				   nodes for around $0.15/h apiece with 8 AWS nodes and up to 61 Colab/Kaggle participants.
			
 
				 - You can create starter notebooks to make it more convenient for collaborators to join your training
			
@@ -169,10 +170,9 @@ Here's an example of a full trainer script for Google Colab:
 
				 !git clone https://github.com/learning-at-home/hivemind && cd hivemind && pip install -e .
			
 
				 !curl -L YOUR_HOSTED_DATA | tar xzf -
			
 
				 !ulimit -n 4096 && python ./hivemind/examples/albert/run_trainer.py \
			
 
				- --client_mode --initial_peers ONE_OR_MORE_PEERS  --averaging_expiration 10 \
			
 
				- --batch_size_lead 300 --per_device_train_batch_size 4 --gradient_accumulation_steps 1 \
			
 
				- --logging_first_step --logging_steps 100  --output_dir ./outputs --overwrite_output_dir --logging_dir ./logs \
			
 
				- --experiment_prefix EXPERIMENT_NAME_HERE --seed 42
			
 
				+    --experiment_prefix YOUR_EXPERIMENT_NAME --initial_peers ONE_OR_MORE_PEERS \
			
 
				+    --logging_first_step --output_dir ./outputs --overwrite_output_dir --logging_dir ./logs \
			
 
				+    --client_mode --averaging_expiration 10 --batch_size_lead 300 --gradient_accumulation_steps 1
			
 
				 ```
			
 
				 
			
 
				 ### Using IPFS