Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support online & ofline distributed inference #143

Merged
merged 1 commit into from
Jun 16, 2023

Conversation

Mark-ZhouWX
Copy link
Collaborator

Thank you for your contribution to the MindYOLO repo.
Before submitting this PR, please make sure:

Motivation

  1. support online & ofline distributed inference to increase inference speed
    (Write your motivation for proposed changes here.)

Test Plan

(How should this PR be tested? Do you require special setup to run the test or repro the fixed bug?)

Related Issues and PRs

closes #142
(Is this PR part of a group of changes? Link the other relevant PRs and Issues here. Use https://help.github.com/en/articles/closing-issues-using-keywords for help on GitHub syntax)

@Mark-ZhouWX Mark-ZhouWX added inside-test 内部开发者提的issue rfc 需求单issue labels Jun 15, 2023
@Mark-ZhouWX Mark-ZhouWX added this to the mindyolo-0.1 milestone Jun 15, 2023
@Mark-ZhouWX Mark-ZhouWX self-assigned this Jun 15, 2023
@@ -37,6 +37,19 @@ def __init__(self, logger_name="MindYOLO"):
self.device_per_servers = 8
self.formatter = logging.Formatter("%(asctime)s [%(levelname)s] %(message)s")

def write(self, msg):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是啥方法?和info有啥不一样?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两个方法(write , flush)可以将三方库(比如coco api)的print信息重定向到mindyolo的logger系统

@@ -31,7 +31,8 @@ def set_default(args):
if args.is_parallel:
init()
args.rank, args.rank_size, parallel_mode = get_rank(), get_group_size(), ParallelMode.DATA_PARALLEL
context.set_auto_parallel_context(device_num=args.rank_size, parallel_mode=parallel_mode, gradients_mean=True)
context.set_auto_parallel_context(device_num=args.rank_size, parallel_mode=parallel_mode, gradients_mean=True,
parameter_broadcast=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么要加这个,cv那边加了这个后精度掉了

Copy link
Collaborator Author

@Mark-ZhouWX Mark-ZhouWX Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我想做一个强制保证。我做过2个简单实验,1、对比加前后network初始化参数时候一致; 2、对比加前后 在yolox-tiny上训练25, 50,100个epoch时的eval精度是否接近,答案都是肯定的。 我删掉吧

@@ -55,7 +56,9 @@ def set_default(args):
args.config,
)
# Directories and Save run settings
args.save_dir = os.path.join(args.save_dir, datetime.now().strftime("%Y.%m.%d-%H_%M_%S"))
time = get_broadcast_datetime(rank_size=args.rank_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要加判断,单卡用Broadcast会报错

Copy link
Collaborator Author

@Mark-ZhouWX Mark-ZhouWX Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

做了判断,单卡不会走broadcast;同时测试了,单卡初始化Broadcast算子是可以的,只是不能调用

@@ -79,7 +90,9 @@ def set_default_test(args):
args.config,
)
# Directories and Save run settings
args.save_dir = os.path.join(args.save_dir, datetime.now().strftime("%Y.%m.%d-%H:%M:%S"))
time = get_broadcast_datetime(rank_size=args.rank_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

class Synchronizer:
def __init__(self, rank_size=1):
# this init method should be run only once
self.all_reduce = AllReduce()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单卡的时候初始化AllReduce算子可能存在问题,可以把rank_sink的判断放到上面

Copy link
Collaborator Author

@Mark-ZhouWX Mark-ZhouWX Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个测试过,单卡的时候初始化AllReduce算子不会有问题;
此外,单卡不会初始化Synchronizer,也就不会调用allreduce;而且,单卡和多卡情况我测试过多次,程序运行正常

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方初始化后就把ranksize确定了,可以直接在init做判断,单多/卡都能调用,不用在外面做判断

class Synchronizer:
def __init__(self, rank_size=1):
# this init method should be run only once
self.all_reduce = AllReduce()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方初始化后就把ranksize确定了,可以直接在init做判断,单多/卡都能调用,不用在外面做判断

@CaitinZhao CaitinZhao merged commit 4680b63 into mindspore-lab:master Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inside-test 内部开发者提的issue rfc 需求单issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Feature] support distributed inference offline and online (eval while train)
3 participants