diff --git a/README.md b/README.md index 2bf09d49c..d7ebf826d 100644 --- a/README.md +++ b/README.md @@ -124,15 +124,24 @@ It is easy to train your model on a standard or customized dataset using `train. - Distributed Training - For large datasets like ImageNet, it is necessary to do training in distributed mode on multiple devices. This can be achieved with `mpirun` and parallel features supported by MindSpore. + For large datasets like ImageNet, it is necessary to do training in distributed mode on multiple devices. This can be achieved with `msrun` and parallel features supported by MindSpore. ```shell # distributed training # assume you have 4 GPUs/NPUs - mpirun -n 4 python train.py --distribute \ + msrun --bind_core=True --worker_num 4 python train.py --distribute \ --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet ``` - > Notes: If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. + + Notice that if you are using msrun startup with 2 devices, please add `--bind_core=True` to improve performance. For example: + + ```shell + msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \ + --log_dir=msrun_log --join=True --cluster_time_out=300 \ + python train.py --distribute --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet + ``` + + > For more information, please refer to https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/startup_method.html Detailed parameter definitions can be seen in `config.py` and checked by running `python train.py --help'. @@ -143,7 +152,7 @@ It is easy to train your model on a standard or customized dataset using `train. You can configure your model and other components either by specifying external parameters or by writing a yaml config file. Here is an example of training using a preset yaml file. ```shell - mpirun --allow-run-as-root -n 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml + msrun --bind_core=True --worker_num 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml ``` **Pre-defined Training Strategies:** diff --git a/README_CN.md b/README_CN.md index 7977d4987..b474b09fa 100644 --- a/README_CN.md +++ b/README_CN.md @@ -117,15 +117,25 @@ python infer.py --model=swin_tiny --image_path='./dog.jpg' - 分布式训练 - 对于像ImageNet这样的大型数据集,有必要在多个设备上以分布式模式进行训练。基于MindSpore对分布式相关功能的良好支持,用户可以使用`mpirun`来进行模型的分布式训练。 + 对于像ImageNet这样的大型数据集,有必要在多个设备上以分布式模式进行训练。基于MindSpore对分布式相关功能的良好支持,用户可以使用`msrun`来进行模型的分布式训练。 ```shell # 分布式训练 # 假设你有4张GPU或者NPU卡 - mpirun --allow-run-as-root -n 4 python train.py --distribute \ + msrun --bind_core=True --worker_num 4 python train.py --distribute \ --model densenet121 --dataset imagenet --data_dir ./datasets/imagenet ``` + 注意,如果在两卡环境下选用msrun作为启动方式,请添加配置项 `--bind_core=True` 增加绑核操作以优化两卡性能,范例代码如下: + + ```shell + msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \ + --log_dir=msrun_log --join=True --cluster_time_out=300 \ + python train.py --distribute --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet + ``` + + > 如需更多操作指导,请参考 https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/startup_method.html + 完整的参数列表及说明在`config.py`中定义,可运行`python train.py --help`快速查看。 如需恢复训练,请指定`--ckpt_path`和`--ckpt_save_dir`参数,脚本将加载路径中的模型权重和优化器状态,并恢复中断的训练进程。 @@ -135,7 +145,7 @@ python infer.py --model=swin_tiny --image_path='./dog.jpg' 您可以编写yaml文件或设置外部参数来指定配置数据、模型、优化器等组件及其超参数。以下是使用预设的训练策略(yaml文件)进行模型训练的示例。 ```shell - mpirun --allow-run-as-root -n 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml + msrun --bind_core=True --worker_num 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml ``` **预定义的训练策略** diff --git a/configs/README.md b/configs/README.md index ecf3bad1a..5e4fd6556 100644 --- a/configs/README.md +++ b/configs/README.md @@ -59,17 +59,16 @@ Illustration: #### Training Script Format -For consistency, it is recommended to provide distributed training commands based on `mpirun -n {num_devices} python train.py`, instead of using shell script such as `distrubuted_train.sh`. +For consistency, it is recommended to provide distributed training commands based on `msrun --bind_core=True --worker_num {num_devices} python train.py`, instead of using shell script such as `distrubuted_train.sh`. ```shell # standalone training on a gpu or ascend device python train.py --config configs/densenet/densenet_121_gpu.yaml --data_dir /path/to/dataset --distribute False # distributed training on gpu or ascend divices - mpirun -n 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet + msrun --bind_core=True --worker_num 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet ``` - > If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. #### URL and Hyperlink Format Please use **absolute path** in the hyperlink or url for linking the target resource in the readme file and table. diff --git a/configs/bit/README.md b/configs/bit/README.md index 93613d0a7..bb09f71ab 100644 --- a/configs/bit/README.md +++ b/configs/bit/README.md @@ -58,11 +58,10 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/bit/bit_resnet50_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/bit/bit_resnet50_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/cmt/README.md b/configs/cmt/README.md index aecb01dbe..4c2dd2fb9 100644 --- a/configs/cmt/README.md +++ b/configs/cmt/README.md @@ -54,11 +54,10 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/cmt/cmt_small_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/cmt/cmt_small_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/coat/README.md b/configs/coat/README.md index fdd9fbb94..cef0f69b7 100644 --- a/configs/coat/README.md +++ b/configs/coat/README.md @@ -48,12 +48,11 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/coat/coat_lite_tiny_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/coat/coat_lite_tiny_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun` -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/convit/README.md b/configs/convit/README.md index 735f635ff..3ac41caca 100644 --- a/configs/convit/README.md +++ b/configs/convit/README.md @@ -68,11 +68,10 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/convit/convit_tiny_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/convit/convit_tiny_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/convnext/README.md b/configs/convnext/README.md index 210b49635..ad64d9bb8 100644 --- a/configs/convnext/README.md +++ b/configs/convnext/README.md @@ -66,12 +66,11 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/convnext/convnext_tiny_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/convnext/convnext_tiny_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/convnextv2/README.md b/configs/convnextv2/README.md index 81076c9fa..4f7dcd38d 100644 --- a/configs/convnextv2/README.md +++ b/configs/convnextv2/README.md @@ -63,12 +63,11 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/convnextv2/convnextv2_tiny_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/convnextv2/convnextv2_tiny_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/crossvit/README.md b/configs/crossvit/README.md index 9063dc077..1c8c130eb 100644 --- a/configs/crossvit/README.md +++ b/configs/crossvit/README.md @@ -62,11 +62,10 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/crossvit/crossvit_15_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/crossvit/crossvit_15_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/densenet/README.md b/configs/densenet/README.md index 98d6d2de8..ffa1cdef4 100644 --- a/configs/densenet/README.md +++ b/configs/densenet/README.md @@ -80,11 +80,10 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/dpn/README.md b/configs/dpn/README.md index bc36c190c..fc742004f 100644 --- a/configs/dpn/README.md +++ b/configs/dpn/README.md @@ -69,11 +69,11 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distrubted training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/dpn/dpn92_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/dpn/dpn92_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/edgenext/README.md b/configs/edgenext/README.md index 7550d6144..89c1516b2 100644 --- a/configs/edgenext/README.md +++ b/configs/edgenext/README.md @@ -68,11 +68,11 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/edgenext/edgenext_small_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/edgenext/edgenext_small_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/efficientnet/README.md b/configs/efficientnet/README.md index 3f7297249..ed9da8c40 100644 --- a/configs/efficientnet/README.md +++ b/configs/efficientnet/README.md @@ -78,11 +78,11 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 64 python train.py --config configs/efficientnet/efficientnet_b0_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/efficientnet/efficientnet_b0_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/ghostnet/README.md b/configs/ghostnet/README.md index e6f952da7..2db1f7133 100644 --- a/configs/ghostnet/README.md +++ b/configs/ghostnet/README.md @@ -63,12 +63,12 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/ghostnet/ghostnet_100_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/ghostnet/ghostnet_100_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/googlenet/README.md b/configs/googlenet/README.md index 6f39e1a7d..25b138505 100644 --- a/configs/googlenet/README.md +++ b/configs/googlenet/README.md @@ -65,12 +65,12 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/googlenet/googlenet_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/googlenet/googlenet_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/halonet/README.md b/configs/halonet/README.md index c2025dafd..6b68dbf26 100644 --- a/configs/halonet/README.md +++ b/configs/halonet/README.md @@ -66,12 +66,12 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/halonet/halonet_50t_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/halonet/halonet_50t_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/hrnet/README.md b/configs/hrnet/README.md index 32980034d..9e7aeb2c6 100644 --- a/configs/hrnet/README.md +++ b/configs/hrnet/README.md @@ -77,11 +77,11 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/hrnet/hrnet_w32_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/hrnet/hrnet_w32_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/inceptionv3/README.md b/configs/inceptionv3/README.md index 5e41a625e..2ebddbf9c 100644 --- a/configs/inceptionv3/README.md +++ b/configs/inceptionv3/README.md @@ -66,12 +66,12 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/inceptionv3/inception_v3_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/inceptionv3/inception_v3_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/inceptionv4/README.md b/configs/inceptionv4/README.md index 1ff4d8d77..c76c5dae0 100644 --- a/configs/inceptionv4/README.md +++ b/configs/inceptionv4/README.md @@ -62,12 +62,12 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/inceptionv4/inception_v4_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/inceptionv4/inception_v4_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/mixnet/README.md b/configs/mixnet/README.md index 290b166f8..001364d35 100644 --- a/configs/mixnet/README.md +++ b/configs/mixnet/README.md @@ -66,12 +66,12 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distrubted training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/mixnet/mixnet_s_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/mixnet/mixnet_s_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/mnasnet/README.md b/configs/mnasnet/README.md index 2fe74c075..4b8183119 100644 --- a/configs/mnasnet/README.md +++ b/configs/mnasnet/README.md @@ -61,12 +61,12 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/mnasnet/mnasnet_0.75_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/mnasnet/mnasnet_0.75_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/mobilenetv1/README.md b/configs/mobilenetv1/README.md index 742fae375..778f1403f 100644 --- a/configs/mobilenetv1/README.md +++ b/configs/mobilenetv1/README.md @@ -61,12 +61,12 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/mobilenetv1/mobilenet_v1_0.25_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/mobilenetv1/mobilenet_v1_0.25_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/mobilenetv2/README.md b/configs/mobilenetv2/README.md index 24dc13091..ec2c1bdd3 100644 --- a/configs/mobilenetv2/README.md +++ b/configs/mobilenetv2/README.md @@ -63,12 +63,12 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/mobilenetv2/mobilenet_v2_0.75_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/mobilenetv2/mobilenet_v2_0.75_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/mobilenetv3/README.md b/configs/mobilenetv3/README.md index f79e79829..0b87493b5 100644 --- a/configs/mobilenetv3/README.md +++ b/configs/mobilenetv3/README.md @@ -63,12 +63,12 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/mobilenetv3/mobilenet_v3_small_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/mobilenetv3/mobilenet_v3_small_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/mobilevit/README.md b/configs/mobilevit/README.md index 94343cefe..16579283e 100644 --- a/configs/mobilevit/README.md +++ b/configs/mobilevit/README.md @@ -61,11 +61,11 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/mobilevit/mobilevit_xx_small_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/mobilevit/mobilevit_xx_small_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/nasnet/README.md b/configs/nasnet/README.md index 82ea0eb4b..3ee1b4a55 100644 --- a/configs/nasnet/README.md +++ b/configs/nasnet/README.md @@ -75,11 +75,11 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/nasnet/nasnet_a_4x1056_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/nasnet/nasnet_a_4x1056_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/pit/README.md b/configs/pit/README.md index e1b932f1b..d4e509da1 100644 --- a/configs/pit/README.md +++ b/configs/pit/README.md @@ -63,12 +63,12 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/pit/pit_xs_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/pit/pit_xs_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/poolformer/README.md b/configs/poolformer/README.md index 4a9ff4887..678fc0797 100644 --- a/configs/poolformer/README.md +++ b/configs/poolformer/README.md @@ -61,11 +61,11 @@ It is easy to reproduce the reported results with the pre-defined training recip ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/poolformer/poolformer_s12_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/poolformer/poolformer_s12_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/pvt/README.md b/configs/pvt/README.md index c9bda7259..6cf4d334b 100644 --- a/configs/pvt/README.md +++ b/configs/pvt/README.md @@ -62,13 +62,12 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/pvt/pvt_tiny_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/pvt/pvt_tiny_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. > If use Ascend 910 devices, need to open SATURATION_MODE via `export MS_ASCEND_CHECK_OVERFLOW_MODE="SATURATION_MODE"` -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/pvtv2/README.md b/configs/pvtv2/README.md index 335903859..72928a27e 100644 --- a/configs/pvtv2/README.md +++ b/configs/pvtv2/README.md @@ -67,12 +67,12 @@ Ascend 910 devices, please run ```shell # distrubted training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/pvtv2/pvt_v2_b0_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/pvtv2/pvt_v2_b0_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/regnet/README.md b/configs/regnet/README.md index 1f6921f55..5f14169a6 100644 --- a/configs/regnet/README.md +++ b/configs/regnet/README.md @@ -71,12 +71,11 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/regnet/regnet_x_800mf_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/regnet/regnet_x_800mf_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/repmlp/README.md b/configs/repmlp/README.md index 88fa95ec1..de1b93beb 100644 --- a/configs/repmlp/README.md +++ b/configs/repmlp/README.md @@ -68,12 +68,11 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/repmlp/repmlp_t224_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/repmlp/repmlp_t224_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/repvgg/README.md b/configs/repvgg/README.md index 689c2f306..4319fa71c 100644 --- a/configs/repvgg/README.md +++ b/configs/repvgg/README.md @@ -86,12 +86,11 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/repvgg/repvgg_a1_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/repvgg/repvgg_a1_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/res2net/README.md b/configs/res2net/README.md index 148584c83..637504dc3 100644 --- a/configs/res2net/README.md +++ b/configs/res2net/README.md @@ -68,12 +68,12 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/res2net/res2net_50_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/res2net/res2net_50_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/resnest/README.md b/configs/resnest/README.md index 1dfe88180..13d270c95 100644 --- a/configs/resnest/README.md +++ b/configs/resnest/README.md @@ -61,12 +61,12 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/resnest/resnest50_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/resnest/resnest50_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/resnet/README.md b/configs/resnet/README.md index b62df84bd..1f695cc73 100644 --- a/configs/resnet/README.md +++ b/configs/resnet/README.md @@ -66,12 +66,12 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/resnet/resnet_18_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/resnet/resnet_18_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/resnetv2/README.md b/configs/resnetv2/README.md index d690c4504..1b6dce6cb 100644 --- a/configs/resnetv2/README.md +++ b/configs/resnetv2/README.md @@ -65,12 +65,12 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/resnetv2/resnetv2_50_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/resnetv2/resnetv2_50_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/resnext/README.md b/configs/resnext/README.md index 2850bf607..e89a9ecd9 100644 --- a/configs/resnext/README.md +++ b/configs/resnext/README.md @@ -69,12 +69,12 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/resnext/resnext50_32x4d_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/resnext/resnext50_32x4d_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/rexnet/README.md b/configs/rexnet/README.md index bafbc88db..b64d6b067 100644 --- a/configs/rexnet/README.md +++ b/configs/rexnet/README.md @@ -60,12 +60,12 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/rexnet/rexnet_x09_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/rexnet/rexnet_x09_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/senet/README.md b/configs/senet/README.md index dd7c963de..6d47c9586 100644 --- a/configs/senet/README.md +++ b/configs/senet/README.md @@ -68,12 +68,12 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/senet/seresnet50_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/senet/seresnet50_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/shufflenetv1/README.md b/configs/shufflenetv1/README.md index fd43aea50..3cda8b160 100644 --- a/configs/shufflenetv1/README.md +++ b/configs/shufflenetv1/README.md @@ -67,12 +67,12 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/shufflenetv1/shufflenet_v1_0.5_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/shufflenetv1/shufflenet_v1_0.5_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/shufflenetv2/README.md b/configs/shufflenetv2/README.md index 196f4b03b..32b9b410e 100644 --- a/configs/shufflenetv2/README.md +++ b/configs/shufflenetv2/README.md @@ -80,10 +80,10 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/shufflenetv2/shufflenet_v2_0.5_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/shufflenetv2/shufflenet_v2_0.5_ascend.yaml --data_dir /path/to/imagenet ``` -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/sknet/README.md b/configs/sknet/README.md index e2ea9ce2e..ba54585ae 100644 --- a/configs/sknet/README.md +++ b/configs/sknet/README.md @@ -74,10 +74,10 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/sknet/skresnext50_32x4d_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/sknet/skresnext50_32x4d_ascend.yaml --data_dir /path/to/imagenet ``` -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/squeezenet/README.md b/configs/squeezenet/README.md index 3a1e15db1..498691ca4 100644 --- a/configs/squeezenet/README.md +++ b/configs/squeezenet/README.md @@ -71,10 +71,10 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/squeezenet/squeezenet_1.0_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/squeezenet/squeezenet_1.0_ascend.yaml --data_dir /path/to/imagenet ``` -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/swintransformer/README.md b/configs/swintransformer/README.md index 5c45e3ac6..584fd6fd0 100644 --- a/configs/swintransformer/README.md +++ b/configs/swintransformer/README.md @@ -87,12 +87,11 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/swintransformer/swin_tiny_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/swintransformer/swin_tiny_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/swintransformerv2/README.md b/configs/swintransformerv2/README.md index aa97b1a95..5b00f7bb2 100644 --- a/configs/swintransformerv2/README.md +++ b/configs/swintransformerv2/README.md @@ -70,12 +70,12 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/swintransformerv2/swinv2_tiny_window8_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/swintransformerv2/swinv2_tiny_window8_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/vgg/README.md b/configs/vgg/README.md index cde8310c1..8196d9f9c 100644 --- a/configs/vgg/README.md +++ b/configs/vgg/README.md @@ -85,12 +85,10 @@ Ascend 910 devices, please run ```shell # distrubted training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/vgg/vgg16_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/vgg/vgg16_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. - -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/visformer/README.md b/configs/visformer/README.md index 1a1e688bc..e35463f27 100644 --- a/configs/visformer/README.md +++ b/configs/visformer/README.md @@ -70,10 +70,10 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/visformer/visformer_tiny_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/visformer/visformer_tiny_ascend.yaml --data_dir /path/to/imagenet ``` -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/vit/README.md b/configs/vit/README.md index 562238afe..75614e1af 100644 --- a/configs/vit/README.md +++ b/configs/vit/README.md @@ -78,12 +78,11 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/vit/vit_b32_224_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/vit/vit_b32_224_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/volo/README.md b/configs/volo/README.md index 9948f4c09..fef65d7c7 100644 --- a/configs/volo/README.md +++ b/configs/volo/README.md @@ -63,12 +63,11 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/volo/volo_d1_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/volo/volo_d1_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun` -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/xception/README.md b/configs/xception/README.md index efba12395..72622d25a 100644 --- a/configs/xception/README.md +++ b/configs/xception/README.md @@ -64,12 +64,12 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/xception/xception_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/xception/xception_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. + +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/configs/xcit/README.md b/configs/xcit/README.md index b408a02b0..7ff973331 100644 --- a/configs/xcit/README.md +++ b/configs/xcit/README.md @@ -67,11 +67,10 @@ Ascend 910 devices, please run ```shell # distributed training on multiple GPU/Ascend devices -mpirun -n 8 python train.py --config configs/xcit/xcit_tiny_12_p16_ascend.yaml --data_dir /path/to/imagenet +msrun --bind_core=True --worker_num 8 python train.py --config configs/xcit/xcit_tiny_12_p16_ascend.yaml --data_dir /path/to/imagenet ``` -> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. -> Similarly, you can train the model on multiple GPU devices with the above `mpirun` command. +Similarly, you can train the model on multiple GPU devices with the above `msrun` command. For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py). diff --git a/docs/en/index.md b/docs/en/index.md index 5bf194b33..4a3282675 100644 --- a/docs/en/index.md +++ b/docs/en/index.md @@ -109,15 +109,24 @@ It is easy to train your model on a standard or customized dataset using `train. - Distributed Training - For large datasets like ImageNet, it is necessary to do training in distributed mode on multiple devices. This can be achieved with `mpirun` and parallel features supported by MindSpore. + For large datasets like ImageNet, it is necessary to do training in distributed mode on multiple devices. This can be achieved with `msrun` and parallel features supported by MindSpore. ```shell # distributed training # assume you have 4 GPUs/NPUs - mpirun -n 4 python train.py --distribute \ + msrun --bind_core=True --worker_num 4 python train.py --distribute \ --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet ``` - > Notes: If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. + + Notice that if you are using msrun startup with 2 devices, please add `--bind_core=True` to improve performance. For example: + + ```shell + msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \ + --log_dir=msrun_log --join=True --cluster_time_out=300 \ + python train.py --distribute --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet + ``` + + > For more information, please refer to https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/startup_method.html Detailed parameter definitions can be seen in `config.py` and checked by running `python train.py --help'. @@ -128,7 +137,7 @@ It is easy to train your model on a standard or customized dataset using `train. You can configure your model and other components either by specifying external parameters or by writing a yaml config file. Here is an example of training using a preset yaml file. ```shell - mpirun --allow-run-as-root -n 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml + msrun --bind_core=True --worker_num 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml ``` !!! tip "Pre-defined Training Strategies" diff --git a/docs/zh/index.md b/docs/zh/index.md index 75bfa2a89..cc0784e8e 100644 --- a/docs/zh/index.md +++ b/docs/zh/index.md @@ -109,15 +109,25 @@ MindCV是一个基于 [MindSpore](https://www.mindspore.cn/) 开发的,致力 - 分布式训练 - 对于像ImageNet这样的大型数据集,有必要在多个设备上以分布式模式进行训练。基于MindSpore对分布式相关功能的良好支持,用户可以使用`mpirun`来进行模型的分布式训练。 + 对于像ImageNet这样的大型数据集,有必要在多个设备上以分布式模式进行训练。基于MindSpore对分布式相关功能的良好支持,用户可以使用`msrun`来进行模型的分布式训练。 ```shell # 分布式训练 # 假设你有4张GPU或者NPU卡 - mpirun --allow-run-as-root -n 4 python train.py --distribute \ + msrun --bind_core=True --worker_num 4 python train.py --distribute \ --model densenet121 --dataset imagenet --data_dir ./datasets/imagenet ``` + 注意,如果在两卡环境下选用msrun作为启动方式,请添加配置项 `--bind_core=True` 增加绑核操作以优化两卡性能,范例代码如下: + + ```shell + msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \ + --log_dir=msrun_log --join=True --cluster_time_out=300 \ + python train.py --distribute --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet + ``` + + > 如需更多操作指导,请参考 https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/startup_method.html + 完整的参数列表及说明在`config.py`中定义,可运行`python train.py --help`快速查看。 如需恢复训练,请指定`--ckpt_path`和`--ckpt_save_dir`参数,脚本将加载路径中的模型权重和优化器状态,并恢复中断的训练进程。 @@ -127,7 +137,7 @@ MindCV是一个基于 [MindSpore](https://www.mindspore.cn/) 开发的,致力 您可以编写yaml文件或设置外部参数来指定配置数据、模型、优化器等组件及其超参。以下是使用预设的训练策略(yaml文件)进行模型训练的示例。 ```shell - mpirun --allow-run-as-root -n 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml + msrun --bind_core=True --worker_num 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml ``` !!! tip "预定义的训练策略" diff --git a/examples/det/ssd/README.md b/examples/det/ssd/README.md index c4599f450..7f6b5d4be 100644 --- a/examples/det/ssd/README.md +++ b/examples/det/ssd/README.md @@ -69,15 +69,15 @@ Specify the path of the preprocessed dataset at keyword `data_dir` in the config It is highly recommended to use **distributed training** for this SSD implementation. -For distributed training using **OpenMPI's `mpirun`**, simply run +For distributed training using **`msrun`**, simply run ``` cd mindcv # change directory to the root of MindCV repository -mpirun -n [# of devices] python examples/det/ssd/train.py --config [the path to the config file] +msrun --bind_core=True --worker_num [# of devices] python examples/det/ssd/train.py --config [the path to the config file] ``` For example, if train SSD distributively with the `MobileNetV2` configuration on 8 devices, run ``` cd mindcv # change directory to the root of MindCV repository -mpirun -n 8 python examples/det/ssd/train.py --config examples/det/ssd/ssd_mobilenetv2.yaml +msrun --bind_core=True --worker_num 8 python examples/det/ssd/train.py --config examples/det/ssd/ssd_mobilenetv2.yaml ``` For distributed training with [Ascend rank table](https://github.com/mindspore-lab/mindocr/blob/main/docs/en/tutorials/distribute_train.md#12-configure-rank_table_file-for-training), configure `ascend8p.sh` as follows diff --git a/examples/seg/deeplabv3/README.md b/examples/seg/deeplabv3/README.md index 9d4b4309c..c6d4844fa 100644 --- a/examples/seg/deeplabv3/README.md +++ b/examples/seg/deeplabv3/README.md @@ -80,9 +80,9 @@ Specify `deeplabv3` or `deeplabv3plus` at the key word `model` in the config f It is highly recommended to use **distributed training** for this DeepLabV3 and DeepLabV3+ implementation. -For distributed training using **OpenMPI's `mpirun`**, simply run +For distributed training using **`msrun`**, simply run ```shell -mpirun -n [# of devices] python examples/seg/deeplabv3/train.py --config [the path to the config file] +msrun --bind_core=True --worker_num [# of devices] python examples/seg/deeplabv3/train.py --config [the path to the config file] ``` For distributed training with [Ascend rank table](https://github.com/mindspore-lab/mindocr/blob/main/docs/en/tutorials/distribute_train.md#12-configure-rank_table_file-for-training), configure `ascend8p.sh` as follows @@ -110,26 +110,26 @@ For single-device training, simply set the keyword ``distributed`` to ``False`` python examples/seg/deeplabv3/train.py --config [the path to the config file] ``` -**Take mpirun command as an example, the training steps are as follow**: +**Take msrun command as an example, the training steps are as follow**: - Step 1: Employ output_stride=16 and fine-tune pretrained resnet101 on *trainaug* dataset. In config file, please specify the path of pretrained backbone checkpoint in keyword `backbone_ckpt_path` and set `output_stride` to `16`. ```shell # for deeplabv3 - mpirun -n 8 python examples/seg/deeplabv3/train.py --config examples/seg/deeplabv3/config/deeplabv3_s16_dilated_resnet101.yaml + msrun --bind_core=True --worker_num 8 python examples/seg/deeplabv3/train.py --config examples/seg/deeplabv3/config/deeplabv3_s16_dilated_resnet101.yaml # for deeplabv3+ - mpirun -n 8 python examples/seg/deeplabv3/train.py --config examples/seg/deeplabv3/config/deeplabv3plus_s16_dilated_resnet101.yaml + msrun --bind_core=True --worker_num 8 python examples/seg/deeplabv3/train.py --config examples/seg/deeplabv3/config/deeplabv3plus_s16_dilated_resnet101.yaml ``` - Step 2: Employ output_stride=8, fine-tune model from step 1 on *trainaug* dataset with smaller base learning rate. In config file, please specify the path of checkpoint from previous step in `ckpt_path`, set `ckpt_pre_trained` to `True` and set `output_stride` to `8` . ```shell # for deeplabv3 - mpirun -n 8 python examples/seg/deeplabv3/train.py --config examples/seg/deeplabv3/config/deeplabv3_s8_dilated_resnet101.yaml + msrun --bind_core=True --worker_num 8 python examples/seg/deeplabv3/train.py --config examples/seg/deeplabv3/config/deeplabv3_s8_dilated_resnet101.yaml # for deeplabv3+ - mpirun -n 8 python examples/seg/deeplabv3/train.py --config examples/seg/deeplabv3/config/deeplabv3plus_s8_dilated_resnet101.yaml + msrun --bind_core=True --worker_num 8 python examples/seg/deeplabv3/train.py --config examples/seg/deeplabv3/config/deeplabv3plus_s8_dilated_resnet101.yaml ``` > If use Ascend 910 devices, need to open SATURATION_MODE via `export MS_ASCEND_CHECK_OVERFLOW_MODE="SATURATION_MODE"`. diff --git a/scripts/README.md b/scripts/README.md index 957966b66..7b66b2f1a 100644 --- a/scripts/README.md +++ b/scripts/README.md @@ -32,7 +32,7 @@ python -m build A simple clean launcher for distributed training on **_Ascend_**. Following [instruction](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.1/parallel/startup_method.html) from Mindspore, -except launching distributed training with `mpirun`, we can also use multiprocess +except launching distributed training with `msrun`, we can also use multiprocess with multi-card networking configuration `rank_table.json` to manually start a process on each card. To get `rank_table.json` on your machine, try the hccl tools from [here](https://gitee.com/mindspore/models/tree/master/utils/hccl_tools).