Procházet zdrojové kódy

Update and simplify the ALBERT example (#437)

1. I think that running several experiments in one DHT is advanced functionality and should not be the point of this readme.

    If so, we can just set a reasonable default for `--experiment_prefix` and remove it from all commands in readme for the sake of simplicity.

2. `python` points to Python 2 in many Linux distributions (e.g., Ubuntu 20.04). It's better to run scripts just as `./run_trainer.py` (`git clone` preserves the executable bit, so there's no need to do `chmod +x`).

3. The log examples are updated to the actual examples (with hivemind.Optimizer and without the caller in the INFO messages).

4. nltk is updated to 3.6.5 (the latest version before the bug, so it does not disappoint dependabot).
Alexander Borzunov před 3 roky
rodič
revize
247ae25c9c

+ 24 - 24
examples/albert/README.md

@@ -9,7 +9,7 @@ using `hivemind.CollaborativeOptimizer` to exchange information between peers.
 
 * Install hivemind: `pip install git+https://github.com/learning-at-home/hivemind.git`
 * Dependencies: `pip install -r requirements.txt`
-* Preprocess data: `python tokenize_wikitext103.py`
+* Preprocess data: `./tokenize_wikitext103.py`
 * Upload the data to a publicly available location or ask volunteers to preprocess it locally
 
 ## Running an experiment
@@ -20,18 +20,16 @@ Run the first DHT peer to welcome trainers and record training statistics (e.g.,
 
 - In this example, we use [wandb.ai](https://wandb.ai/site) to plot training metrics. If you're unfamiliar with Weights
   & Biases, here's a [quickstart tutorial](https://docs.wandb.ai/quickstart).
-- Run `python run_training_monitor.py --experiment_prefix YOUR_EXPERIMENT_NAME --wandb_project YOUR_WANDB_PROJECT`
+- Run `./run_training_monitor.py --wandb_project YOUR_WANDB_PROJECT`
 
-  - `YOUR_EXPERIMENT_NAME` must be a unique name of this training run, e.g. `my-albert-v1`. It cannot contain `.`
-    due to naming conventions.
   - `YOUR_WANDB_PROJECT` is a name of wandb project used to track training metrics. Multiple experiments can have the
     same project name.
 
 ```
-$ python run_training_monitor.py --experiment_prefix my-albert-v1 --wandb_project Demo-run
-Oct 14 16:26:36.083 [INFO] [utils.log_visible_maddrs:47] Running a DHT peer. To connect other peers to this one over the Internet,
+$ ./run_training_monitor.py --wandb_project Demo-run
+Oct 14 16:26:36.083 [INFO] Running a DHT peer. To connect other peers to this one over the Internet,
 use --initial_peers /ip4/1.2.3.4/tcp/1337/p2p/XXXX /ip4/1.2.3.4/udp/31337/quic/p2p/XXXX
-Oct 14 16:26:36.083 [INFO] [utils.log_visible_maddrs:50] Full list of visible multiaddresses: ...
+Oct 14 16:26:36.083 [INFO] Full list of visible multiaddresses: ...
 wandb: Currently logged in as: XXX (use `wandb login --relogin` to force relogin)
 wandb: Tracking run with wandb version 0.10.32
 wandb: Syncing run dry-mountain-2
@@ -39,12 +37,12 @@ wandb:  View project at https://wandb.ai/XXX/Demo-run
 wandb:  View run at https://wandb.ai/XXX/Demo-run/runs/YYY
 wandb: Run data is saved locally in /path/to/run/data
 wandb: Run `wandb offline` to turn off syncing.
-Oct 14 16:26:41.064 [INFO] [optim.collaborative._fetch_state:448] Found no active peers: None
-Oct 14 16:26:44.068 [INFO] [optim.collaborative._fetch_state:448] Found no active peers: None
+Oct 14 16:26:41.064 [INFO] Found no active peers: None
+Oct 14 16:26:44.068 [INFO] Found no active peers: None
 ...
-Oct 14 16:37:37.246 [INFO] [__main__.<module>:209] Step #1  loss = 11.05164
-Oct 14 16:39:37.441 [INFO] [__main__.<module>:209] Step #2  loss = 11.03771
-Oct 14 16:40:37.541 [INFO] [__main__.<module>:209] Step #3  loss = 11.02886
+Oct 14 16:37:37.246 [INFO] Step #1  loss = 11.05164
+Oct 14 16:39:37.441 [INFO] Step #2  loss = 11.03771
+Oct 14 16:40:37.541 [INFO] Step #3  loss = 11.02886
 ```
 
 ### GPU trainers
@@ -57,8 +55,8 @@ To join the collaboration with a GPU trainer,
   (see [default paths](./arguments.py#L117-L134) for reference)
 - Run:
   ```bash
-  python run_trainer.py \
-      --experiment_prefix YOUR_EXPERIMENT_NAME --initial_peers ONE_OR_MORE_PEERS \
+  ./run_trainer.py \
+      --initial_peers ONE_OR_MORE_PEERS \
       --logging_first_step --output_dir ./outputs --overwrite_output_dir --logging_dir ./logs
   ```
 
@@ -89,16 +87,18 @@ See the ["Tips and tricks"](#tips-and-tricks) section for more information on se
 As the peer begins training, it will periodically report training logs in the following form:
 
 ```
-... [INFO] [...] my-albert-v1 accumulated 448 samples from 17 peers for step #0. ETA 18.88 sec (refresh in 15.73 sec)
-... [INFO] [...] my-albert-v1 accumulated 4096 samples from 16 peers for step #0. ETA 0.00 sec (refresh in 0.50 sec)
-... [INFO] [optim.collaborative.step:283] Averaged tensors successfully with 17 peers
-... [INFO] [optim.collaborative.step:317] Optimizer step: done!
-Oct 14 18:58:03.750 [INFO] [__main__.on_step_end:141] Step 1
-Oct 14 18:58:03.750 [INFO] [__main__.on_step_end:142] Your current contribution: 892 samples
-Oct 14 18:58:03.750 [INFO] [__main__.on_step_end:143] Local loss: 11.023
+Dec 28 00:15:31.482 [INFO] albert accumulated 4056 samples for epoch #0 from 2 peers. ETA 0.75 sec (refresh in 0.50 sec)
+Dec 28 00:15:31.990 [INFO] albert accumulated 4072 samples for epoch #0 from 2 peers. ETA 0.24 sec (refresh in 0.50 sec)
+...
+Dec 28 00:15:32.857 [INFO] Step #1
+Dec 28 00:15:32.857 [INFO] Your current contribution: 2144 samples
+Dec 28 00:15:32.857 [INFO] Performance: 20.924 samples/sec
+Dec 28 00:15:32.857 [INFO] Local loss: 11.06709
+Dec 28 00:15:33.580 [INFO] Averaged gradients with 2 peers
+Dec 28 00:15:38.336 [INFO] Averaged parameters with 2 peers
 ```
 
-__Sanity check:__ a healthy peer will periodically report `Averaged tensors successfully with [N > 1]` peers.
+__Sanity check:__ a healthy peer will periodically report `Averaged gradients/parameters with [N > 1]` peers.
 
 For convenience, you can view (and share!) the learning curves of your collaborative experiments in wandb:
 
@@ -169,8 +169,8 @@ Here's an example of a full trainer script for Google Colab:
 !pip install transformers datasets sentencepiece torch_optimizer==0.1.0
 !git clone https://github.com/learning-at-home/hivemind && cd hivemind && pip install -e .
 !curl -L YOUR_HOSTED_DATA | tar xzf -
-!ulimit -n 4096 && python ./hivemind/examples/albert/run_trainer.py \
-    --experiment_prefix YOUR_EXPERIMENT_NAME --initial_peers ONE_OR_MORE_PEERS \
+!ulimit -n 4096 && ./hivemind/examples/albert/run_trainer.py \
+    --initial_peers ONE_OR_MORE_PEERS \
     --logging_dir ./logs --logging_first_step --output_dir ./outputs --overwrite_output_dir \
     --client_mode --averaging_expiration 10 --batch_size_lead 300 --gradient_accumulation_steps 1
 ```

+ 1 - 1
examples/albert/arguments.py

@@ -7,7 +7,7 @@ from transformers import TrainingArguments
 @dataclass
 class BaseTrainingArguments:
     experiment_prefix: str = field(
-        metadata={"help": "A unique 'name' of this experiment, used to store metadata on the DHT"}
+        default="albert", metadata={"help": "A unique 'name' of this experiment, used to store metadata on the DHT"}
     )
     initial_peers: List[str] = field(
         default_factory=list,

+ 1 - 1
examples/albert/requirements.txt

@@ -4,4 +4,4 @@ torch_optimizer==0.1.0
 wandb==0.10.26
 sentencepiece
 requests
-nltk==3.6.2
+nltk==3.6.5