- See details in job-exit-spec.yaml
- This markdown file is generated by update_markdown.py with job-exit-spec.yaml
- See full doc in PAI Job Exit Spec User Manual
field | description | required | unique | type | range |
---|---|---|---|---|---|
code | The PAI Job ExitCode | True | True | Integer | begin: -8000 end: 256 |
phrase | The textual phrase representation of this ExitCode | True | True | String | Any |
issuer | Who root issued this ExitCode in details | False | False | Enum | 1. USER_CONTAINER 2. PAI_OS 3. PAI_RUNTIME 4. PAI_YARN 5. PAI_LAUNCHER |
causer | Who root caused this ExitCode in details | False | False | Enum | 1. USER_SUBMISSION 2. USER_CONTAINER 3. USER_STOP 4. USER_DELETION 5. USER_RETRY 6. USER_UPGRADE 7. RESOURCE_ALLOCATION_TIMEOUT 8. PAI_HDFS 9. PAI_OS 10. PAI_DOCKER 11. PAI_RUNTIME 12. PAI_YARN 13. PAI_LAUNCHER 14. UNKNOWN |
type | The rough type of this ExitCode | False | False | Enum | 1. USER_SUCCESS 2. USER_STOP 3. USER_FAILURE 4. PLATFORM_FAILURE 5. RESOURCE_ALLOCATION_TIMEOUT 6. UNKNOWN_FAILURE |
stage | The user process stage just before this ExitCode issued | False | False | Enum | 1. SUBMITTING 2. ALLOCATING 3. LAUNCHING 4. RUNNING 5. COMPLETING 6. UNKNOWN |
behavior | The rerun behavior of this ExitCode | False | False | Enum | 1. TRANSIENT_NORMAL 2. TRANSIENT_CONFLICT 3. NON_TRANSIENT 4. UNKNOWN |
reaction | The reaction for this ExitCode will be executed by PAI automatically | False | False | Enum | 1. ALWAYS_RETRY 2. ALWAYS_BACKOFF_RETRY 3. RETRY_TO_MAX 4. NEVER_RETRY |
reason | Why this ExitCode is issued | False | False | String | Any |
repro | One specific reproduce steps of this ExitCode | False | False | List<String> | Any |
solution | Some optional solutions to resolve this ExitCode if it indicates failure | False | False | List<String> | Any |
pattern | The pattern that PAI used to detect this ExitCode | False | False | String | Any, such as USER_EXITCODE=X && USER_LOG_PATTERN=Y || OS_Signal=Z |
- You may need to scroll right side to see full table.
- The code 256 is just used to represent all undefined positive exitcodes in this spec, and the specific undefined exitcode will always override it to expose to user.
- The code -8000 is just used to represent all undefined negative exitcodes in this spec, and the specific undefined exitcode will always override it to expose to user.
code | phrase | issuer | causer | type | stage | behavior | reaction | reason | repro | solution | pattern |
---|---|---|---|---|---|---|---|---|---|---|---|
154 | CONTAINER_EXIT_CODE_FILE_LOST | PAI_YARN | PAI_YARN | PLATFORM_FAILURE | COMPLETING | TRANSIENT_NORMAL | ALWAYS_RETRY | Container exitcode file cannot be found by YARN NM, maybe node unexpected shutdown, disk cleaned up or disk failure | 1. Stop YARN NM 2. Kill container process 3. Delete container exitcode file 4. Start YARN NM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
130 | CONTAINER_KILLED_BY_SIGINT | PAI_OS | PAI_OS | PLATFORM_FAILURE | RUNNING | TRANSIENT_NORMAL | ALWAYS_RETRY | Container killed by OS Signal: SIGINT | 1. Kill container process by SIGINT |
1. Wait result from next retry 2. Contact Cluster Admin |
|
132 | CONTAINER_KILLED_BY_SIGILL | USER_CONTAINER | USER_CONTAINER | USER_FAILURE | RUNNING | NON_TRANSIENT | NEVER_RETRY | Container killed by OS Signal: SIGILL | 1. User program executes an illegal, malformed, unknown, or privileged machine instruction |
1. Check container log and fix your program bug |
|
134 | CONTAINER_KILLED_BY_SIGABRT | USER_CONTAINER | UNKNOWN | UNKNOWN_FAILURE | RUNNING | UNKNOWN | RETRY_TO_MAX | Container killed by OS Signal: SIGABRT | 1. User program calls abort() by libc |
1. Check container log and find root cause 2. Wait result from next retry |
|
135 | CONTAINER_KILLED_BY_SIGBUS | USER_CONTAINER | USER_CONTAINER | USER_FAILURE | RUNNING | NON_TRANSIENT | NEVER_RETRY | Container killed by OS Signal: SIGBUS | 1. User program accesses an unaligned memory address |
1. Check container log and fix your program bug |
|
136 | CONTAINER_KILLED_BY_SIGFPE | USER_CONTAINER | USER_CONTAINER | USER_FAILURE | RUNNING | NON_TRANSIENT | NEVER_RETRY | Container killed by OS Signal: SIGFPE | 1. User program division by zero |
1. Check container log and fix your program bug |
|
137 | CONTAINER_KILLED_BY_SIGKILL | PAI_OS | PAI_OS | PLATFORM_FAILURE | RUNNING | TRANSIENT_NORMAL | ALWAYS_RETRY | Container killed by OS Signal: SIGKILL | 1. Kill container process by SIGKILL |
1. Wait result from next retry 2. Contact Cluster Admin |
|
139 | CONTAINER_KILLED_BY_SIGSEGV | USER_CONTAINER | USER_CONTAINER | USER_FAILURE | RUNNING | NON_TRANSIENT | NEVER_RETRY | Container killed by OS Signal: SIGSEGV | 1. User program accesses an illegal memory address |
1. Check container log and fix your program bug |
|
141 | CONTAINER_KILLED_BY_SIGPIPE | USER_CONTAINER | USER_CONTAINER | USER_FAILURE | RUNNING | NON_TRANSIENT | NEVER_RETRY | Container killed by OS Signal: SIGPIPE | 1. User program writes to a pipe without a process connected to the other end |
1. Check container log and fix your program bug |
|
143 | CONTAINER_KILLED_BY_SIGTERM | PAI_OS | PAI_OS | PLATFORM_FAILURE | RUNNING | TRANSIENT_NORMAL | ALWAYS_RETRY | Container killed by OS Signal: SIGTERM | 1. Kill container process by SIGTERM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
193 | CONTAINER_DOCKER_RUN_FAILED | PAI_RUNTIME | PAI_DOCKER | PLATFORM_FAILURE | LAUNCHING | TRANSIENT_NORMAL | ALWAYS_RETRY | Container cannot be launched by docker run | 1. PAI Runtime calls docker run with unknown flag |
1. Wait result from next retry 2. Contact Cluster Admin |
|
196 | CONTAINER_OOM_KILLED_BY_DOCKER | PAI_RUNTIME | USER_CONTAINER | USER_FAILURE | RUNNING | NON_TRANSIENT | NEVER_RETRY | Container killed by docker due to it exceeded the request memory | 1. User program uses more memory than its requested |
1. Increase per task memory request 2. Decrease per task memory usage by such as increasing task number |
|
198 | CONTAINER_OOD_KILLED_BY_DISKCLEANER | PAI_RUNTIME | USER_CONTAINER | USER_FAILURE | RUNNING | NON_TRANSIENT | NEVER_RETRY | Container is killed by disk cleaner due to it used major disk space and all containers disk usage on the node exceeded platform limit | 1. User program uses almost all disk space of the node |
1. Decrease per task disk space usage by such as increasing task number |
|
255 | CONTAINER_RUNTIME_UNKNOWN_FAILURE | PAI_RUNTIME | UNKNOWN | UNKNOWN_FAILURE | COMPLETING | UNKNOWN | RETRY_TO_MAX | Container failed but the failure cannot be recognized by PAI Runtime | 1. User program directly exits with exitcode 1 |
1. Check container log and find root cause 2. Wait result from next retry |
|
256 | CONTAINER_RUNTIME_EXIT_ABNORMALLY | PAI_RUNTIME | PAI_RUNTIME | PLATFORM_FAILURE | UNKNOWN | UNKNOWN | RETRY_TO_MAX | PAI Runtime exit abnormally with undefined exitcode, it may have bugs | 1. PAI Runtime exits with exitcode 1 |
1. Contact PAI Dev to fix PAI Runtime bugs |
|
0 | SUCCEEDED | USER_CONTAINER | USER_CONTAINER | USER_SUCCESS | COMPLETING | UNKNOWN | NEVER_RETRY | 1. User program exits with exitcode 0 |
|||
-7100 | CONTAINER_INVALID_EXIT_STATUS | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | LAUNCHING | TRANSIENT_NORMAL | ALWAYS_RETRY | Container exited with invalid exit status, maybe YARN failed to initialize container environment | 1. Disable write permission for YARN NM to access {yarn.nodemanager.local-dirs} |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7101 | CONTAINER_NOT_AVAILABLE_EXIT_STATUS | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | LAUNCHING | TRANSIENT_NORMAL | ALWAYS_RETRY | Container exited with not available exit status, maybe YARN failed to create container executor process | 1. Disable execute permission for YARN NM to access bash on *nix or winutils.exe on Windows |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7102 | CONTAINER_NODE_DISKS_FAILED | PAI_LAUNCHER | PAI_OS | PLATFORM_FAILURE | LAUNCHING | TRANSIENT_NORMAL | ALWAYS_RETRY | Container cannot be launched by YARN due to local bad disk, maybe no disk space left | 1. Set zero disk space for {yarn.nodemanager.local-dirs} |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7103 | CONTAINER_PORT_CONFLICT | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | LAUNCHING | TRANSIENT_NORMAL | ALWAYS_RETRY | Container cannot be launched by YARN due to local port conflict | 1. After container allocated and before container started, stop the container's YARN NM 2. Occupy a container requested port on the container node 3. Start the container's YARN NM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7110 | CONTAINER_ABORTED | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | Container aborted by YARN | 1. Corrupt the container entry in YARN NM state store |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7111 | CONTAINER_NODE_LOST | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | Container lost due to node lost, maybe its YARN NM is down for a long time | 1. Stop the container's YARN NM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7112 | CONTAINER_EXPIRED | PAI_LAUNCHER | RESOURCE_ALLOCATION_TIMEOUT | RESOURCE_ALLOCATION_TIMEOUT | ALLOCATING | TRANSIENT_CONFLICT | ALWAYS_BACKOFF_RETRY | Container previously allocated is expired due to it is not launched on YARN NM in time, maybe other containers cannot be allocated in time | 1. Disable virtual cluster bonus token 2. Set amGangAllocationTimeoutSec large than yarn.resourcemanager.rm.container-allocation.expiry-interval-ms 3. Request more containers in a job than its virtual cluster current available resource |
1. Wait result from next retry 2. Decrease task number 3. Decrease per task resource request 4. Contact Cluster Admin to increase your virtual cluster quota |
|
-7113 | CONTAINER_ABORTED_ON_AM_RESTART | PAI_LAUNCHER | RESOURCE_ALLOCATION_TIMEOUT | RESOURCE_ALLOCATION_TIMEOUT | ALLOCATING | TRANSIENT_CONFLICT | ALWAYS_BACKOFF_RETRY | Container previously allocated is aborted by YARN RM during Launcher AM restart, maybe other containers cannot be allocated in time | 1. Disable virtual cluster bonus token 2. Request more containers in a job than its virtual cluster current available resource 3. Kill Launcher AM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7120 | CONTAINER_PREEMPTED | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | Container preempted by YARN RM, maybe its virtual cluster overused resource was reclaimed | 1. Enable virtual cluster bonus token 2. Request more containers in a job than its virtual cluster current available resource 3. Use up all other virtual clusters available resource |
1. Wait result from next retry 2. Decrease task number 3. Decrease per task resource request 4. Contact Cluster Admin to increase your virtual cluster quota 5. Contact Cluster Admin to disable your virtual cluster bonus token |
|
-7121 | CONTAINER_RUNTIME_VIRTUAL_MEMORY_EXCEEDED | PAI_LAUNCHER | PAI_RUNTIME | PLATFORM_FAILURE | UNKNOWN | NON_TRANSIENT | NEVER_RETRY | Container killed by YARN due to its PAI Runtime exceeded the request virtual memory | 1. PAI Runtime uses more virtual memory than its container requested |
1. Increase per task virtual memory request 2. Contact PAI Dev to decrease PAI Runtime virtual memory usage |
|
-7122 | CONTAINER_RUNTIME_PHYSICAL_MEMORY_EXCEEDED | PAI_LAUNCHER | PAI_RUNTIME | PLATFORM_FAILURE | UNKNOWN | NON_TRANSIENT | NEVER_RETRY | Container killed by YARN due to its PAI Runtime exceeded the request physical memory | 1. PAI Runtime uses more physical memory than its container requested |
1. Increase per task physical memory request 2. Contact PAI Dev to decrease PAI Runtime physical memory usage |
|
-7123 | CONTAINER_KILLED_BY_AM | PAI_LAUNCHER | PAI_LAUNCHER | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | Container killed by Launcher AM, maybe allocated container is rejected | 1. Setup single node cluster 2. Submit job with two tasks and antiaffinityAllocation enabled 3. Launcher rejects allocated container whose node already allocated another container |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7124 | CONTAINER_KILLED_BY_RM | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | Container killed by YARN RM, maybe the container is not managed by YARN RM anymore | 1. Delete the container's app entry in YARN RM state store |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7125 | CONTAINER_KILLED_ON_APP_COMPLETION | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | COMPLETING | TRANSIENT_NORMAL | ALWAYS_RETRY | Container killed by YARN RM due to its app is already completed | 1. Stop Launcher AM container's YARN NM 2. Kill the container's app |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7126 | CONTAINER_EXTERNAL_UTILIZATION_SPIKED | PAI_LAUNCHER | PAI_OS | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | Container killed by YARN due to external utilization spiked | 1. Enable YARN external utilization check 2. Start raw process to use up almost all memory on the node |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7150 | CONTAINER_NM_LAUNCH_FAILED | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | LAUNCHING | TRANSIENT_NORMAL | ALWAYS_RETRY | Container failed to launch on YARN NM | 1. After container allocated and before container started, stop the container's YARN NM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7151 | CONTAINER_RM_RESYNC_LOST | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | Container lost after Launcher AM resynced with YARN RM | 1. Stop the container's YARN NM 2. Restart YARN RM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7152 | CONTAINER_RM_RESYNC_EXCEEDED | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | UNKNOWN | NON_TRANSIENT | NEVER_RETRY | Container exceeded after Launcher AM resynced with YARN RM | 1. Stop the container's YARN NM 2. Restart YARN RM 3. Wait until AM releases container 4. Start the container's YARN NM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7153 | CONTAINER_MIGRATE_TASK_REQUESTED | PAI_LAUNCHER | USER_RETRY | USER_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | Container killed by Launcher due to user MigrateTaskRequest | 1. Send MigrateTaskRequest for the container |
1. Wait result from next retry |
|
-7154 | CONTAINER_AGENT_EXPIRED | PAI_LAUNCHER | PAI_OS | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | Container killed by Launcher due to no Launcher Agent heartbeat is received in time | 1. Enable Launcher Agent 2. Bring down the container's node |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7200 | AM_RM_HEARTBEAT_YARN_EXCEPTION | PAI_LAUNCHER | USER_SUBMISSION | USER_FAILURE | SUBMITTING | NON_TRANSIENT | NEVER_RETRY | Launcher AM failed to heartbeat with YARN RM due to YarnException, maybe App is non-compliant | 1. Submit a job with invalid node label |
1. Check diagnostics and revise your job config |
|
-7201 | AM_RM_HEARTBEAT_IO_EXCEPTION | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | Launcher AM failed to heartbeat with YARN RM due to IOException, maybe YARN RM is down | 1. Stop YARN RM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7202 | AM_RM_HEARTBEAT_UNKNOWN_EXCEPTION | PAI_LAUNCHER | UNKNOWN | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | Launcher AM failed to heartbeat with YARN RM due to unknown Exception | 1. AM sends invalid message to YARN RM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7203 | AM_RM_HEARTBEAT_SHUTDOWN_REQUESTED | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | Launcher AM failed to heartbeat with YARN RM due to ShutdownRequest, maybe AM is not managed by YARN RM anymore | 1. Set small AM expiry time 2. Set network partition between AM and YARN RM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7250 | AM_UNKNOWN_EXCEPTION | PAI_LAUNCHER | PAI_LAUNCHER | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | Launcher AM failed due to unknown Exception | 1. Set network partition between AM and ZK |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7251 | AM_NON_TRANSIENT_EXCEPTION | PAI_LAUNCHER | USER_SUBMISSION | USER_FAILURE | SUBMITTING | NON_TRANSIENT | NEVER_RETRY | Launcher AM failed due to NonTransientException, maybe App is non-compliant | 1. Submit a job with invalid data dir |
1. Check diagnostics and revise your job config |
|
-7252 | AM_GANG_ALLOCATION_TIMEOUT | PAI_LAUNCHER | RESOURCE_ALLOCATION_TIMEOUT | RESOURCE_ALLOCATION_TIMEOUT | ALLOCATING | TRANSIENT_CONFLICT | ALWAYS_BACKOFF_RETRY | Launcher AM failed due to all the requested resource cannot be satisfied in time | 1. Disable virtual cluster bonus token 2. Request more containers in a job than its virtual cluster current available resource |
1. Wait result from next retry 2. Decrease task number 3. Decrease per task resource request 4. Contact Cluster Admin to increase your virtual cluster quota |
|
-7300 | APP_SUBMISSION_YARN_EXCEPTION | PAI_LAUNCHER | USER_SUBMISSION | USER_FAILURE | SUBMITTING | NON_TRANSIENT | NEVER_RETRY | Failed to submit App to YARN RM due to YarnException, maybe App is non-compliant | 1. Submit a job to invalid virtual cluster |
1. Check diagnostics and revise your job config |
|
-7301 | APP_SUBMISSION_IO_EXCEPTION | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | SUBMITTING | TRANSIENT_NORMAL | ALWAYS_RETRY | Failed to submit App to YARN RM due to IOException, maybe YARN RM is down | 1. Stop YARN RM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7302 | APP_SUBMISSION_UNKNOWN_EXCEPTION | PAI_LAUNCHER | UNKNOWN | UNKNOWN_FAILURE | SUBMITTING | UNKNOWN | RETRY_TO_MAX | Failed to submit App to YARN RM due to unknown Exception | 1. Launcher Service sends invalid message to YARN RM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7303 | APP_KILLED_UNEXPECTEDLY | PAI_LAUNCHER | UNKNOWN | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | App killed unexpectedly and directly through YARN RM | 1. Kill the app directly through YARN RM |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7350 | APP_RM_RESYNC_LOST | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | UNKNOWN | TRANSIENT_NORMAL | ALWAYS_RETRY | App lost after Launcher Service resynced with YARN RM | 1. Delete the app entry in YARN RM state store |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7351 | APP_STOP_FRAMEWORK_REQUESTED | PAI_LAUNCHER | USER_STOP | USER_STOP | UNKNOWN | NON_TRANSIENT | NEVER_RETRY | App stopped by Launcher due to user StopFrameworkRequest | 1. Stop a job |
||
-7352 | APP_AM_DIAGNOSTICS_LOST | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | COMPLETING | TRANSIENT_NORMAL | ALWAYS_RETRY | Failed to retrieve AMDiagnostics from YARN, maybe the App is cleaned up in YARN | 1. App is in APPLICATION_RETRIEVING_DIAGNOSTICS state 2. Stop Launcher Service 3. Delete the app entry in YARN RM state store 4. Start Launcher Service |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7353 | APP_AM_DIAGNOSTICS_DESERIALIZATION_FAILED | PAI_LAUNCHER | PAI_YARN | PLATFORM_FAILURE | COMPLETING | TRANSIENT_NORMAL | ALWAYS_RETRY | Failed to deserialize AMDiagnostics from YARN, maybe it is corrupted or Launcher AM unexpectedly crashed frequently without generating AMDiagnostics | 1. Set yarn.app.attempt.diagnostics.limit.kc to 1B |
1. Wait result from next retry 2. Contact Cluster Admin |
|
-7400 | TASK_STOPPED_ON_APP_COMPLETION | PAI_LAUNCHER | USER_STOP | USER_STOP | UNKNOWN | NON_TRANSIENT | NEVER_RETRY | Task stopped by Launcher due to its app is already completed | 1. Stop a job with long running container |
||
-8000 | CONTAINER_UNKNOWN_YARN_EXIT_STATUS | PAI_YARN | UNKNOWN | UNKNOWN_FAILURE | UNKNOWN | UNKNOWN | RETRY_TO_MAX | Container exited with unknown exitcode which is issued from YARN | 1. Change YARN code to make it return container exitcode -886 |
1. Contact PAI Dev to recognize this exitcode |