Skip to content

Commit

Permalink
docs: change mpirun with msrun and add other notices
Browse files Browse the repository at this point in the history
  • Loading branch information
ChongWei905 committed Sep 12, 2024
1 parent e38f625 commit 1812092
Show file tree
Hide file tree
Showing 58 changed files with 194 additions and 175 deletions.
17 changes: 13 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,15 +124,24 @@ It is easy to train your model on a standard or customized dataset using `train.

- Distributed Training

For large datasets like ImageNet, it is necessary to do training in distributed mode on multiple devices. This can be achieved with `mpirun` and parallel features supported by MindSpore.
For large datasets like ImageNet, it is necessary to do training in distributed mode on multiple devices. This can be achieved with `msrun` and parallel features supported by MindSpore.

```shell
# distributed training
# assume you have 4 GPUs/NPUs
mpirun -n 4 python train.py --distribute \
msrun --bind_core=True --worker_num 4 python train.py --distribute \
--model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet
```
> Notes: If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Notice that if you are using msrun startup with 2 devices, please add `--bind_core=True` to avoid known performance error. For example:

```shell
msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \
--log_dir=msrun_log --join=True --cluster_time_out=300 \
python train.py --distribute --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet
```

> For more information, please refer to https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/startup_method.html

Detailed parameter definitions can be seen in `config.py` and checked by running `python train.py --help'.
Expand All @@ -143,7 +152,7 @@ It is easy to train your model on a standard or customized dataset using `train.
You can configure your model and other components either by specifying external parameters or by writing a yaml config file. Here is an example of training using a preset yaml file.
```shell
mpirun --allow-run-as-root -n 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
msrun --bind_core=True --worker_num 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
```
**Pre-defined Training Strategies:**
Expand Down
16 changes: 13 additions & 3 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,15 +117,25 @@ python infer.py --model=swin_tiny --image_path='./dog.jpg'

- 分布式训练

对于像ImageNet这样的大型数据集,有必要在多个设备上以分布式模式进行训练。基于MindSpore对分布式相关功能的良好支持,用户可以使用`mpirun`来进行模型的分布式训练。
对于像ImageNet这样的大型数据集,有必要在多个设备上以分布式模式进行训练。基于MindSpore对分布式相关功能的良好支持,用户可以使用`msrun`来进行模型的分布式训练。

```shell
# 分布式训练
# 假设你有4张GPU或者NPU卡
mpirun --allow-run-as-root -n 4 python train.py --distribute \
msrun --bind_core=True --worker_num 4 python train.py --distribute \
--model densenet121 --dataset imagenet --data_dir ./datasets/imagenet
```

注意,如果在两卡环境下选用msrun作为启动方式,请添加配置项 `--bind_core=True` 增加绑核操作以优化两卡性能,范例代码如下:

```shell
msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \
--log_dir=msrun_log --join=True --cluster_time_out=300 \
python train.py --distribute --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet
```

> 如需更多操作指导,请参考 https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/startup_method.html

完整的参数列表及说明在`config.py`中定义,可运行`python train.py --help`快速查看。

如需恢复训练,请指定`--ckpt_path``--ckpt_save_dir`参数,脚本将加载路径中的模型权重和优化器状态,并恢复中断的训练进程。
Expand All @@ -135,7 +145,7 @@ python infer.py --model=swin_tiny --image_path='./dog.jpg'
您可以编写yaml文件或设置外部参数来指定配置数据、模型、优化器等组件及其超参数。以下是使用预设的训练策略(yaml文件)进行模型训练的示例。

```shell
mpirun --allow-run-as-root -n 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
msrun --bind_core=True --worker_num 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
```

**预定义的训练策略**
Expand Down
5 changes: 2 additions & 3 deletions configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,17 +59,16 @@ Illustration:

#### Training Script Format

For consistency, it is recommended to provide distributed training commands based on `mpirun -n {num_devices} python train.py`, instead of using shell script such as `distrubuted_train.sh`.
For consistency, it is recommended to provide distributed training commands based on `msrun --bind_core=True --worker_num {num_devices} python train.py`, instead of using shell script such as `distrubuted_train.sh`.

```shell
# standalone training on a gpu or ascend device
python train.py --config configs/densenet/densenet_121_gpu.yaml --data_dir /path/to/dataset --distribute False

# distributed training on gpu or ascend divices
mpirun -n 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet

```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

#### URL and Hyperlink Format
Please use **absolute path** in the hyperlink or url for linking the target resource in the readme file and table.
5 changes: 2 additions & 3 deletions configs/bit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,10 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/bit/bit_resnet50_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/bit/bit_resnet50_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/cmt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,11 +54,10 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/cmt/cmt_small_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/cmt/cmt_small_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/coat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,12 +48,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/coat/coat_lite_tiny_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/coat/coat_lite_tiny_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/convit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,10 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/convit/convit_tiny_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/convit/convit_tiny_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/convnext/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/convnext/convnext_tiny_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/convnext/convnext_tiny_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/convnextv2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,12 +63,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/convnextv2/convnextv2_tiny_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/convnextv2/convnextv2_tiny_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/crossvit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,11 +62,10 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/crossvit/crossvit_15_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/crossvit/crossvit_15_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/densenet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,11 +80,10 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/dpn/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,11 +69,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distrubted training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/dpn/dpn92_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/dpn/dpn92_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/edgenext/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/edgenext/edgenext_small_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/edgenext/edgenext_small_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/efficientnet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,11 +78,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 64 python train.py --config configs/efficientnet/efficientnet_b0_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/efficientnet/efficientnet_b0_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/ghostnet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,12 +63,12 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/ghostnet/ghostnet_100_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/ghostnet/ghostnet_100_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/googlenet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,12 +65,12 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/googlenet/googlenet_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/googlenet/googlenet_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/halonet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,12 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/halonet/halonet_50t_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/halonet/halonet_50t_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/hrnet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,11 +77,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/hrnet/hrnet_w32_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/hrnet/hrnet_w32_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/inceptionv3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,12 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/inceptionv3/inception_v3_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/inceptionv3/inception_v3_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/inceptionv4/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,12 +62,12 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/inceptionv4/inception_v4_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/inceptionv4/inception_v4_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
Loading

0 comments on commit 1812092

Please sign in to comment.