Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: change mpirun with msrun and add other notices (merge into main) #805

Merged
merged 1 commit into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 13 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,15 +124,24 @@ It is easy to train your model on a standard or customized dataset using `train.

- Distributed Training

For large datasets like ImageNet, it is necessary to do training in distributed mode on multiple devices. This can be achieved with `mpirun` and parallel features supported by MindSpore.
For large datasets like ImageNet, it is necessary to do training in distributed mode on multiple devices. This can be achieved with `msrun` and parallel features supported by MindSpore.

```shell
# distributed training
# assume you have 4 GPUs/NPUs
mpirun -n 4 python train.py --distribute \
msrun --bind_core=True --worker_num 4 python train.py --distribute \
--model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet
```
> Notes: If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Notice that if you are using msrun startup with 2 devices, please add `--bind_core=True` to improve performance. For example:

```shell
msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \
--log_dir=msrun_log --join=True --cluster_time_out=300 \
python train.py --distribute --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet
```

> For more information, please refer to https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/startup_method.html

Detailed parameter definitions can be seen in `config.py` and checked by running `python train.py --help'.

Expand All @@ -143,7 +152,7 @@ It is easy to train your model on a standard or customized dataset using `train.
You can configure your model and other components either by specifying external parameters or by writing a yaml config file. Here is an example of training using a preset yaml file.

```shell
mpirun --allow-run-as-root -n 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
msrun --bind_core=True --worker_num 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
```

**Pre-defined Training Strategies:**
Expand Down
16 changes: 13 additions & 3 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,15 +117,25 @@ python infer.py --model=swin_tiny --image_path='./dog.jpg'

- 分布式训练

对于像ImageNet这样的大型数据集,有必要在多个设备上以分布式模式进行训练。基于MindSpore对分布式相关功能的良好支持,用户可以使用`mpirun`来进行模型的分布式训练。
对于像ImageNet这样的大型数据集,有必要在多个设备上以分布式模式进行训练。基于MindSpore对分布式相关功能的良好支持,用户可以使用`msrun`来进行模型的分布式训练。

```shell
# 分布式训练
# 假设你有4张GPU或者NPU卡
mpirun --allow-run-as-root -n 4 python train.py --distribute \
msrun --bind_core=True --worker_num 4 python train.py --distribute \
--model densenet121 --dataset imagenet --data_dir ./datasets/imagenet
```

注意,如果在两卡环境下选用msrun作为启动方式,请添加配置项 `--bind_core=True` 增加绑核操作以优化两卡性能,范例代码如下:

```shell
msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \
--log_dir=msrun_log --join=True --cluster_time_out=300 \
python train.py --distribute --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet
```

> 如需更多操作指导,请参考 https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/startup_method.html

完整的参数列表及说明在`config.py`中定义,可运行`python train.py --help`快速查看。

如需恢复训练,请指定`--ckpt_path`和`--ckpt_save_dir`参数,脚本将加载路径中的模型权重和优化器状态,并恢复中断的训练进程。
Expand All @@ -135,7 +145,7 @@ python infer.py --model=swin_tiny --image_path='./dog.jpg'
您可以编写yaml文件或设置外部参数来指定配置数据、模型、优化器等组件及其超参数。以下是使用预设的训练策略(yaml文件)进行模型训练的示例。

```shell
mpirun --allow-run-as-root -n 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
msrun --bind_core=True --worker_num 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
```

**预定义的训练策略**
Expand Down
5 changes: 2 additions & 3 deletions configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,17 +59,16 @@ Illustration:

#### Training Script Format

For consistency, it is recommended to provide distributed training commands based on `mpirun -n {num_devices} python train.py`, instead of using shell script such as `distrubuted_train.sh`.
For consistency, it is recommended to provide distributed training commands based on `msrun --bind_core=True --worker_num {num_devices} python train.py`, instead of using shell script such as `distrubuted_train.sh`.

```shell
# standalone training on a gpu or ascend device
python train.py --config configs/densenet/densenet_121_gpu.yaml --data_dir /path/to/dataset --distribute False

# distributed training on gpu or ascend divices
mpirun -n 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet

```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

#### URL and Hyperlink Format
Please use **absolute path** in the hyperlink or url for linking the target resource in the readme file and table.
5 changes: 2 additions & 3 deletions configs/bit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,10 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/bit/bit_resnet50_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/bit/bit_resnet50_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/cmt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,11 +54,10 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/cmt/cmt_small_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/cmt/cmt_small_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/coat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,12 +48,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/coat/coat_lite_tiny_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/coat/coat_lite_tiny_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/convit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,10 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/convit/convit_tiny_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/convit/convit_tiny_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/convnext/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/convnext/convnext_tiny_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/convnext/convnext_tiny_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/convnextv2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,12 +63,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/convnextv2/convnextv2_tiny_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/convnextv2/convnextv2_tiny_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/crossvit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,11 +62,10 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/crossvit/crossvit_15_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/crossvit/crossvit_15_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
5 changes: 2 additions & 3 deletions configs/densenet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,11 +80,10 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/dpn/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,11 +69,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distrubted training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/dpn/dpn92_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/dpn/dpn92_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/edgenext/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/edgenext/edgenext_small_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/edgenext/edgenext_small_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/efficientnet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,11 +78,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 64 python train.py --config configs/efficientnet/efficientnet_b0_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/efficientnet/efficientnet_b0_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/ghostnet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,12 +63,12 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/ghostnet/ghostnet_100_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/ghostnet/ghostnet_100_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/googlenet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,12 +65,12 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/googlenet/googlenet_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/googlenet/googlenet_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/halonet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,12 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/halonet/halonet_50t_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/halonet/halonet_50t_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/hrnet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,11 +77,11 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/hrnet/hrnet_w32_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/hrnet/hrnet_w32_ascend.yaml --data_dir /path/to/imagenet
```
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/inceptionv3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,12 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/inceptionv3/inception_v3_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/inceptionv3/inception_v3_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
6 changes: 3 additions & 3 deletions configs/inceptionv4/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,12 +62,12 @@ It is easy to reproduce the reported results with the pre-defined training recip

```shell
# distributed training on multiple GPU/Ascend devices
mpirun -n 8 python train.py --config configs/inceptionv4/inception_v4_ascend.yaml --data_dir /path/to/imagenet
msrun --bind_core=True --worker_num 8 python train.py --config configs/inceptionv4/inception_v4_ascend.yaml --data_dir /path/to/imagenet
```

> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.

Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.

Similarly, you can train the model on multiple GPU devices with the above `msrun` command.

For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).

Expand Down
Loading
Loading