mindspore-lab · Ash-Lee233 · Sep 12, 2024 · Sep 12, 2024
diff --git a/README.md b/README.md
@@ -124,15 +124,24 @@ It is easy to train your model on a standard or customized dataset using `train.
 
 - Distributed Training
 
-    For large datasets like ImageNet, it is necessary to do training in distributed mode on multiple devices. This can be achieved with `mpirun` and parallel features supported by MindSpore.
+    For large datasets like ImageNet, it is necessary to do training in distributed mode on multiple devices. This can be achieved with `msrun` and parallel features supported by MindSpore.
 
     ```shell
     # distributed training
     # assume you have 4 GPUs/NPUs
-    mpirun -n 4 python train.py --distribute \
+    msrun --bind_core=True --worker_num 4 python train.py --distribute \
         --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet
     ```
-    > Notes: If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
+
+    Notice that if you are using msrun startup with 2 devices, please add `--bind_core=True` to improve performance. For example:
+
+    ```shell
+    msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \
+    --log_dir=msrun_log --join=True --cluster_time_out=300 \
+    python train.py --distribute --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet
+    ```
+
+   > For more information, please refer to https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/startup_method.html
 
     Detailed parameter definitions can be seen in `config.py` and checked by running `python train.py --help'.
 
@@ -143,7 +152,7 @@ It is easy to train your model on a standard or customized dataset using `train.
     You can configure your model and other components either by specifying external parameters or by writing a yaml config file. Here is an example of training using a preset yaml file.
 
     ```shell
-    mpirun --allow-run-as-root -n 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
+    msrun --bind_core=True --worker_num 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
     ```
 
     **Pre-defined Training Strategies:**

diff --git a/README_CN.md b/README_CN.md
@@ -117,15 +117,25 @@ python infer.py --model=swin_tiny --image_path='./dog.jpg'
 
 - 分布式训练
 
-    对于像ImageNet这样的大型数据集，有必要在多个设备上以分布式模式进行训练。基于MindSpore对分布式相关功能的良好支持，用户可以使用`mpirun`来进行模型的分布式训练。
+    对于像ImageNet这样的大型数据集，有必要在多个设备上以分布式模式进行训练。基于MindSpore对分布式相关功能的良好支持，用户可以使用`msrun`来进行模型的分布式训练。
 
     ```shell
     # 分布式训练
     # 假设你有4张GPU或者NPU卡
-    mpirun --allow-run-as-root -n 4 python train.py --distribute \
+    msrun --bind_core=True --worker_num 4 python train.py --distribute \
         --model densenet121 --dataset imagenet --data_dir ./datasets/imagenet
     ```
 
+    注意，如果在两卡环境下选用msrun作为启动方式，请添加配置项 `--bind_core=True` 增加绑核操作以优化两卡性能，范例代码如下：
+
+    ```shell
+    msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \
+    --log_dir=msrun_log --join=True --cluster_time_out=300 \
+    python train.py --distribute --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet
+    ```
+
+   > 如需更多操作指导，请参考 https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/startup_method.html
+
     完整的参数列表及说明在`config.py`中定义，可运行`python train.py --help`快速查看。
 
     如需恢复训练，请指定`--ckpt_path`和`--ckpt_save_dir`参数，脚本将加载路径中的模型权重和优化器状态，并恢复中断的训练进程。
@@ -135,7 +145,7 @@ python infer.py --model=swin_tiny --image_path='./dog.jpg'
     您可以编写yaml文件或设置外部参数来指定配置数据、模型、优化器等组件及其超参数。以下是使用预设的训练策略（yaml文件）进行模型训练的示例。
 
     ```shell
-    mpirun --allow-run-as-root -n 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
+    msrun --bind_core=True --worker_num 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
     ```
 
     **预定义的训练策略**

diff --git a/configs/README.md b/configs/README.md
@@ -59,17 +59,16 @@ Illustration:
 
 #### Training Script Format
 
-For consistency, it is recommended to provide distributed training commands based on `mpirun -n {num_devices} python train.py`, instead of using shell script such as `distrubuted_train.sh`.
+For consistency, it is recommended to provide distributed training commands based on `msrun --bind_core=True --worker_num {num_devices} python train.py`, instead of using shell script such as `distrubuted_train.sh`.
 
   ```shell
   # standalone training on a gpu or ascend device
   python train.py --config configs/densenet/densenet_121_gpu.yaml --data_dir /path/to/dataset --distribute False
 
   # distributed training on gpu or ascend divices
-  mpirun -n 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet
+  msrun --bind_core=True --worker_num 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet
 
   ```
-  > If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
 #### URL and Hyperlink Format
 Please use **absolute path** in the hyperlink or url for linking the target resource in the readme file and table.
diff --git a/configs/bit/README.md b/configs/bit/README.md
@@ -58,11 +58,10 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/bit/bit_resnet50_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/bit/bit_resnet50_ascend.yaml --data_dir /path/to/imagenet
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/cmt/README.md b/configs/cmt/README.md
@@ -54,11 +54,10 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/cmt/cmt_small_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/cmt/cmt_small_ascend.yaml --data_dir /path/to/imagenet
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/coat/README.md b/configs/coat/README.md
@@ -48,12 +48,11 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/coat/coat_lite_tiny_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/coat/coat_lite_tiny_ascend.yaml --data_dir /path/to/imagenet
 ```
 
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/convit/README.md b/configs/convit/README.md
@@ -68,11 +68,10 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/convit/convit_tiny_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/convit/convit_tiny_ascend.yaml --data_dir /path/to/imagenet
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/convnext/README.md b/configs/convnext/README.md
@@ -66,12 +66,11 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/convnext/convnext_tiny_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/convnext/convnext_tiny_ascend.yaml --data_dir /path/to/imagenet
 ```
 
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/convnextv2/README.md b/configs/convnextv2/README.md
@@ -63,12 +63,11 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/convnextv2/convnextv2_tiny_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/convnextv2/convnextv2_tiny_ascend.yaml --data_dir /path/to/imagenet
 ```
 
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/crossvit/README.md b/configs/crossvit/README.md
@@ -62,11 +62,10 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/crossvit/crossvit_15_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/crossvit/crossvit_15_ascend.yaml --data_dir /path/to/imagenet
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/densenet/README.md b/configs/densenet/README.md
@@ -80,11 +80,10 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/dpn/README.md b/configs/dpn/README.md
@@ -69,11 +69,11 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distrubted training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/dpn/dpn92_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/dpn/dpn92_ascend.yaml --data_dir /path/to/imagenet
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/edgenext/README.md b/configs/edgenext/README.md
@@ -68,11 +68,11 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/edgenext/edgenext_small_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/edgenext/edgenext_small_ascend.yaml --data_dir /path/to/imagenet
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/efficientnet/README.md b/configs/efficientnet/README.md
@@ -78,11 +78,11 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 64 python train.py --config configs/efficientnet/efficientnet_b0_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/efficientnet/efficientnet_b0_ascend.yaml --data_dir /path/to/imagenet
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/ghostnet/README.md b/configs/ghostnet/README.md
@@ -63,12 +63,12 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/ghostnet/ghostnet_100_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/ghostnet/ghostnet_100_ascend.yaml --data_dir /path/to/imagenet
 ```
 
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/googlenet/README.md b/configs/googlenet/README.md
@@ -65,12 +65,12 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/googlenet/googlenet_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/googlenet/googlenet_ascend.yaml --data_dir /path/to/imagenet
 ```
 
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/halonet/README.md b/configs/halonet/README.md
@@ -66,12 +66,12 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/halonet/halonet_50t_ascend.yaml  --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/halonet/halonet_50t_ascend.yaml  --data_dir /path/to/imagenet
 ```
 
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/hrnet/README.md b/configs/hrnet/README.md
@@ -77,11 +77,11 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/hrnet/hrnet_w32_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/hrnet/hrnet_w32_ascend.yaml --data_dir /path/to/imagenet
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/inceptionv3/README.md b/configs/inceptionv3/README.md
@@ -66,12 +66,12 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/inceptionv3/inception_v3_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/inceptionv3/inception_v3_ascend.yaml --data_dir /path/to/imagenet
 ```
 
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
 

diff --git a/configs/inceptionv4/README.md b/configs/inceptionv4/README.md
@@ -62,12 +62,12 @@ It is easy to reproduce the reported results with the pre-defined training recip
 
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config configs/inceptionv4/inception_v4_ascend.yaml --data_dir /path/to/imagenet
+msrun --bind_core=True --worker_num 8 python train.py --config configs/inceptionv4/inception_v4_ascend.yaml --data_dir /path/to/imagenet
 ```
 
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
+
+Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).