Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

PAI training service: EARLY stopped trial job status is "USER_CANCELED" #2215

Closed
chicm-ms opened this issue Mar 21, 2020 · 1 comment
Closed
Assignees
Labels

Comments

@chicm-ms
Copy link
Contributor

chicm-ms commented Mar 21, 2020

assessor: median stop
maxTrialNum: 8
trialConcurrency: 8

This problem has been reproduced on PAI IT pipeline multiple times.

[3/22/2020, 1:42:46 AM] INFO [ 'NNIManager received command from dispatcher: KI, "pVTgC"' ]
[3/22/2020, 1:42:46 AM] INFO [ 'cancelTrialJob: pVTgC' ]
[3/22/2020, 1:42:46 AM] INFO [ 'POST: /stdout/tL9oppaY/pVTgC: body:\n{\n    "tag": "trial",\n    "stdOutputType": "Stdout",\n    "msg": "NNISDK_MEb\'{\\"sequence\\": 17, \\"parameter_id\\": 4, \\"trial_job_id\\": \\"pVTgC\\", \\"value\\": \\"-0.7386407815195938\\", \\"type\\": \\"PERIODICAL\\"}\'"\n}' ]
[3/22/2020, 1:42:46 AM] INFO [ 'NNIManager received command from dispatcher: KI, "pVTgC"' ]
[3/22/2020, 1:42:46 AM] INFO [ 'cancelTrialJob: pVTgC' ]
[3/22/2020, 1:42:47 AM] INFO [ 'POST: /stdout/tL9oppaY/pVTgC: body:\n{\n    "tag": "trial",\n    "stdOutputType": "Stdout",\n    "msg": "NNISDK_MEb\'{\\"sequence\\": 18, \\"parameter_id\\": 4, \\"trial_job_id\\": \\"pVTgC\\", \\"value\\": \\"-0.8328627067207651\\", \\"type\\": \\"PERIODICAL\\"}\'"\n}' ]
[3/22/2020, 1:42:47 AM] INFO [ 'NNIManager received command from dispatcher: KI, "pVTgC"' ]
[3/22/2020, 1:42:47 AM] INFO [ 'cancelTrialJob: pVTgC' ]
[3/22/2020, 1:42:47 AM] INFO [ 'POST: /stdout/tL9oppaY/pVTgC: body:\n{\n    "tag": "trial",\n    "stdOutputType": "Stdout",\n    "msg": "NNISDK_MEb\'{\\"sequence\\": 19, \\"parameter_id\\": 4, \\"trial_job_id\\": \\"pVTgC\\", \\"value\\": \\"-0.9026275312848\\", \\"type\\": \\"PERIODICAL\\"}\'"\n}' ]
[3/22/2020, 1:42:47 AM] INFO [ 'NNIManager received command from dispatcher: KI, "pVTgC"' ]
[3/22/2020, 1:42:47 AM] INFO [ 'cancelTrialJob: pVTgC' ]
[3/22/2020, 1:42:47 AM] INFO [ 'Trial job pVTgC status changed from RUNNING to USER_CANCELED' ]
[3/22/2020, 1:42:47 AM] INFO [ 'Trial job CNDTO status changed from WAITING to RUNNING' ]

@SparkSnail
Copy link
Contributor

SparkSnail commented Mar 24, 2020

This issue is caused by PAI status update error, when job in PAI is in 'STOPPING' status, NNI client does parse this status correctly, and set trial to USER_CANCELED.
Fix in #2229

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants