-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835
Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835
Conversation
c09bd44
to
adb6b99
Compare
Pull Request Test Coverage Report for Build 5302424174
💛 - Coveralls |
Maybe, this PR will resolve #1832? |
adb6b99
to
f5a4117
Compare
cc: @lowang-bh |
if client.is_job_running(name, namespace, job_kind): | ||
raise Exception(f"{job_kind} shouldn't be in Running condition") | ||
# Job shouldn't have a Running condition. | ||
if client.is_job_running(name, namespace, job_kind): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what you mean by "pending for a while"? Are you referring to a situation which is in created but not running? If then, job will get into running state after retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant unschedulable pods (gang scheduling).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If before the training-operator updates the job condition from Runnng=false
to Running=true
, this test code gets the job condition and the job condition has Running=false
or doesn't have Running
condition, this test unintended passes.
So, let's imagine the following situation:
Current e2e:
- Test: Deploy job with gang scheduling setting (.runPolicy.schedulingPolicy).
- Operator: Failed to set schedulerName=volcano to the job. Or create an incorrect PodGroup.
- Test: Get the job with
Running=false
or withoutRunning
condition. - Pods: Pods are immediately scheduled to Node and start since the job doesn't have appropriate gang scheduling settings.
- Operator: Update the job condition with
Running=true
. - Test: Succeeded! (Unintended)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This verify_unschedulable_job_e2e
function verifies that gang scheduler integrations work well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation.
elif gang_scheduler_name == TEST_GANG_SCHEDULER_NAME_VOLCANO: | ||
return "" | ||
return TEST_GANG_SCHEDULER_NAME_VOLCANO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge In fact, even though we forgot to set volcano
to schedulerName
in the podSpec, e2e passed in #1831.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary and it used to set the scheduler for gang-schedule e2e. I forget to changed it in last pr, sorry.
f5a4117
to
752c266
Compare
…ions in e2e Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
752c266
to
6f47848
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnugeorge, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
When enabling the gang-scheduling, we don't check whether jobs have been pending for a while in e2e.
So tests for the gang-scheduling will pass if jobs meet the
Created=true
andRunning=false
conditions for just a moment.I added a check that jobs have been pending for a while.
Also, I fixed a test bug that
volcano
isn't set to jobs as a schedulerName when testing for volcano integration.Note: I faced errors in #1834.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: