Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Latest commit

 

History

History
82 lines (78 loc) · 21.5 KB

job-exit-spec.md

File metadata and controls

82 lines (78 loc) · 21.5 KB

PAI Job Exit Spec

  1. See details in job-exit-spec.yaml
  2. This markdown file is generated by update_markdown.py with job-exit-spec.yaml
  3. See full doc in PAI Job Exit Spec User Manual

Spec Schema

field description required unique type range
code The PAI Job ExitCode True True Integer begin: -8000
end: 256
phrase The textual phrase representation of this ExitCode True True String Any
issuer Who root issued this ExitCode in details False False Enum 1. USER_CONTAINER
2. PAI_OS
3. PAI_RUNTIME
4. PAI_YARN
5. PAI_LAUNCHER
causer Who root caused this ExitCode in details False False Enum 1. USER_SUBMISSION
2. USER_CONTAINER
3. USER_STOP
4. USER_DELETION
5. USER_RETRY
6. USER_UPGRADE
7. RESOURCE_ALLOCATION_TIMEOUT
8. PAI_HDFS
9. PAI_OS
10. PAI_DOCKER
11. PAI_RUNTIME
12. PAI_YARN
13. PAI_LAUNCHER
14. UNKNOWN
type The rough type of this ExitCode False False Enum 1. USER_SUCCESS
2. USER_STOP
3. USER_FAILURE
4. PLATFORM_FAILURE
5. RESOURCE_ALLOCATION_TIMEOUT
6. UNKNOWN_FAILURE
stage The user process stage just before this ExitCode issued False False Enum 1. SUBMITTING
2. ALLOCATING
3. LAUNCHING
4. RUNNING
5. COMPLETING
6. UNKNOWN
behavior The rerun behavior of this ExitCode False False Enum 1. TRANSIENT_NORMAL
2. TRANSIENT_CONFLICT
3. NON_TRANSIENT
4. UNKNOWN
reaction The reaction for this ExitCode will be executed by PAI automatically False False Enum 1. ALWAYS_RETRY
2. ALWAYS_BACKOFF_RETRY
3. RETRY_TO_MAX
4. NEVER_RETRY
reason Why this ExitCode is issued False False String Any
repro One specific reproduce steps of this ExitCode False False List<String> Any
solution Some optional solutions to resolve this ExitCode if it indicates failure False False List<String> Any
pattern The pattern that PAI used to detect this ExitCode False False String Any, such as USER_EXITCODE=X && USER_LOG_PATTERN=Y || OS_Signal=Z

Spec Table

  1. You may need to scroll right side to see full table.
  2. The code 256 is just used to represent all undefined positive exitcodes in this spec, and the specific undefined exitcode will always override it to expose to user.
  3. The code -8000 is just used to represent all undefined negative exitcodes in this spec, and the specific undefined exitcode will always override it to expose to user.
code phrase issuer causer type stage behavior reaction reason repro solution pattern
154 CONTAINER_EXIT_CODE_FILE_LOST PAI_YARN PAI_YARN PLATFORM_FAILURE COMPLETING TRANSIENT_NORMAL ALWAYS_RETRY Container exitcode file cannot be found by YARN NM, maybe node unexpected shutdown, disk cleaned up or disk failure 1. Stop YARN NM
2. Kill container process
3. Delete container exitcode file
4. Start YARN NM
1. Wait result from next retry
2. Contact Cluster Admin
130 CONTAINER_KILLED_BY_SIGINT PAI_OS PAI_OS PLATFORM_FAILURE RUNNING TRANSIENT_NORMAL ALWAYS_RETRY Container killed by OS Signal: SIGINT 1. Kill container process by SIGINT
1. Wait result from next retry
2. Contact Cluster Admin
132 CONTAINER_KILLED_BY_SIGILL USER_CONTAINER USER_CONTAINER USER_FAILURE RUNNING NON_TRANSIENT NEVER_RETRY Container killed by OS Signal: SIGILL 1. User program executes an illegal, malformed, unknown, or privileged machine instruction
1. Check container log and fix your program bug
134 CONTAINER_KILLED_BY_SIGABRT USER_CONTAINER UNKNOWN UNKNOWN_FAILURE RUNNING UNKNOWN RETRY_TO_MAX Container killed by OS Signal: SIGABRT 1. User program calls abort() by libc
1. Check container log and find root cause
2. Wait result from next retry
135 CONTAINER_KILLED_BY_SIGBUS USER_CONTAINER USER_CONTAINER USER_FAILURE RUNNING NON_TRANSIENT NEVER_RETRY Container killed by OS Signal: SIGBUS 1. User program accesses an unaligned memory address
1. Check container log and fix your program bug
136 CONTAINER_KILLED_BY_SIGFPE USER_CONTAINER USER_CONTAINER USER_FAILURE RUNNING NON_TRANSIENT NEVER_RETRY Container killed by OS Signal: SIGFPE 1. User program division by zero
1. Check container log and fix your program bug
137 CONTAINER_KILLED_BY_SIGKILL PAI_OS PAI_OS PLATFORM_FAILURE RUNNING TRANSIENT_NORMAL ALWAYS_RETRY Container killed by OS Signal: SIGKILL 1. Kill container process by SIGKILL
1. Wait result from next retry
2. Contact Cluster Admin
139 CONTAINER_KILLED_BY_SIGSEGV USER_CONTAINER USER_CONTAINER USER_FAILURE RUNNING NON_TRANSIENT NEVER_RETRY Container killed by OS Signal: SIGSEGV 1. User program accesses an illegal memory address
1. Check container log and fix your program bug
141 CONTAINER_KILLED_BY_SIGPIPE USER_CONTAINER USER_CONTAINER USER_FAILURE RUNNING NON_TRANSIENT NEVER_RETRY Container killed by OS Signal: SIGPIPE 1. User program writes to a pipe without a process connected to the other end
1. Check container log and fix your program bug
143 CONTAINER_KILLED_BY_SIGTERM PAI_OS PAI_OS PLATFORM_FAILURE RUNNING TRANSIENT_NORMAL ALWAYS_RETRY Container killed by OS Signal: SIGTERM 1. Kill container process by SIGTERM
1. Wait result from next retry
2. Contact Cluster Admin
193 CONTAINER_DOCKER_RUN_FAILED PAI_RUNTIME PAI_DOCKER PLATFORM_FAILURE LAUNCHING TRANSIENT_NORMAL ALWAYS_RETRY Container cannot be launched by docker run 1. PAI Runtime calls docker run with unknown flag
1. Wait result from next retry
2. Contact Cluster Admin
196 CONTAINER_OOM_KILLED_BY_DOCKER PAI_RUNTIME USER_CONTAINER USER_FAILURE RUNNING NON_TRANSIENT NEVER_RETRY Container killed by docker due to it exceeded the request memory 1. User program uses more memory than its requested
1. Increase per task memory request
2. Decrease per task memory usage by such as increasing task number
198 CONTAINER_OOD_KILLED_BY_DISKCLEANER PAI_RUNTIME USER_CONTAINER USER_FAILURE RUNNING NON_TRANSIENT NEVER_RETRY Container is killed by disk cleaner due to it used major disk space and all containers disk usage on the node exceeded platform limit 1. User program uses almost all disk space of the node
1. Decrease per task disk space usage by such as increasing task number
255 CONTAINER_RUNTIME_UNKNOWN_FAILURE PAI_RUNTIME UNKNOWN UNKNOWN_FAILURE COMPLETING UNKNOWN RETRY_TO_MAX Container failed but the failure cannot be recognized by PAI Runtime 1. User program directly exits with exitcode 1
1. Check container log and find root cause
2. Wait result from next retry
256 CONTAINER_RUNTIME_EXIT_ABNORMALLY PAI_RUNTIME PAI_RUNTIME PLATFORM_FAILURE UNKNOWN UNKNOWN RETRY_TO_MAX PAI Runtime exit abnormally with undefined exitcode, it may have bugs 1. PAI Runtime exits with exitcode 1
1. Contact PAI Dev to fix PAI Runtime bugs
0 SUCCEEDED USER_CONTAINER USER_CONTAINER USER_SUCCESS COMPLETING UNKNOWN NEVER_RETRY 1. User program exits with exitcode 0
-7100 CONTAINER_INVALID_EXIT_STATUS PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE LAUNCHING TRANSIENT_NORMAL ALWAYS_RETRY Container exited with invalid exit status, maybe YARN failed to initialize container environment 1. Disable write permission for YARN NM to access {yarn.nodemanager.local-dirs}
1. Wait result from next retry
2. Contact Cluster Admin
-7101 CONTAINER_NOT_AVAILABLE_EXIT_STATUS PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE LAUNCHING TRANSIENT_NORMAL ALWAYS_RETRY Container exited with not available exit status, maybe YARN failed to create container executor process 1. Disable execute permission for YARN NM to access bash on *nix or winutils.exe on Windows
1. Wait result from next retry
2. Contact Cluster Admin
-7102 CONTAINER_NODE_DISKS_FAILED PAI_LAUNCHER PAI_OS PLATFORM_FAILURE LAUNCHING TRANSIENT_NORMAL ALWAYS_RETRY Container cannot be launched by YARN due to local bad disk, maybe no disk space left 1. Set zero disk space for {yarn.nodemanager.local-dirs}
1. Wait result from next retry
2. Contact Cluster Admin
-7103 CONTAINER_PORT_CONFLICT PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE LAUNCHING TRANSIENT_NORMAL ALWAYS_RETRY Container cannot be launched by YARN due to local port conflict 1. After container allocated and before container started, stop the container's YARN NM
2. Occupy a container requested port on the container node
3. Start the container's YARN NM
1. Wait result from next retry
2. Contact Cluster Admin
-7110 CONTAINER_ABORTED PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY Container aborted by YARN 1. Corrupt the container entry in YARN NM state store
1. Wait result from next retry
2. Contact Cluster Admin
-7111 CONTAINER_NODE_LOST PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY Container lost due to node lost, maybe its YARN NM is down for a long time 1. Stop the container's YARN NM
1. Wait result from next retry
2. Contact Cluster Admin
-7112 CONTAINER_EXPIRED PAI_LAUNCHER RESOURCE_ALLOCATION_TIMEOUT RESOURCE_ALLOCATION_TIMEOUT ALLOCATING TRANSIENT_CONFLICT ALWAYS_BACKOFF_RETRY Container previously allocated is expired due to it is not launched on YARN NM in time, maybe other containers cannot be allocated in time 1. Disable virtual cluster bonus token
2. Set amGangAllocationTimeoutSec large than yarn.resourcemanager.rm.container-allocation.expiry-interval-ms
3. Request more containers in a job than its virtual cluster current available resource
1. Wait result from next retry
2. Decrease task number
3. Decrease per task resource request
4. Contact Cluster Admin to increase your virtual cluster quota
-7113 CONTAINER_ABORTED_ON_AM_RESTART PAI_LAUNCHER RESOURCE_ALLOCATION_TIMEOUT RESOURCE_ALLOCATION_TIMEOUT ALLOCATING TRANSIENT_CONFLICT ALWAYS_BACKOFF_RETRY Container previously allocated is aborted by YARN RM during Launcher AM restart, maybe other containers cannot be allocated in time 1. Disable virtual cluster bonus token
2. Request more containers in a job than its virtual cluster current available resource
3. Kill Launcher AM
1. Wait result from next retry
2. Contact Cluster Admin
-7120 CONTAINER_PREEMPTED PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY Container preempted by YARN RM, maybe its virtual cluster overused resource was reclaimed 1. Enable virtual cluster bonus token
2. Request more containers in a job than its virtual cluster current available resource
3. Use up all other virtual clusters available resource
1. Wait result from next retry
2. Decrease task number
3. Decrease per task resource request
4. Contact Cluster Admin to increase your virtual cluster quota
5. Contact Cluster Admin to disable your virtual cluster bonus token
-7121 CONTAINER_RUNTIME_VIRTUAL_MEMORY_EXCEEDED PAI_LAUNCHER PAI_RUNTIME PLATFORM_FAILURE UNKNOWN NON_TRANSIENT NEVER_RETRY Container killed by YARN due to its PAI Runtime exceeded the request virtual memory 1. PAI Runtime uses more virtual memory than its container requested
1. Increase per task virtual memory request
2. Contact PAI Dev to decrease PAI Runtime virtual memory usage
-7122 CONTAINER_RUNTIME_PHYSICAL_MEMORY_EXCEEDED PAI_LAUNCHER PAI_RUNTIME PLATFORM_FAILURE UNKNOWN NON_TRANSIENT NEVER_RETRY Container killed by YARN due to its PAI Runtime exceeded the request physical memory 1. PAI Runtime uses more physical memory than its container requested
1. Increase per task physical memory request
2. Contact PAI Dev to decrease PAI Runtime physical memory usage
-7123 CONTAINER_KILLED_BY_AM PAI_LAUNCHER PAI_LAUNCHER PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY Container killed by Launcher AM, maybe allocated container is rejected 1. Setup single node cluster
2. Submit job with two tasks and antiaffinityAllocation enabled
3. Launcher rejects allocated container whose node already allocated another container
1. Wait result from next retry
2. Contact Cluster Admin
-7124 CONTAINER_KILLED_BY_RM PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY Container killed by YARN RM, maybe the container is not managed by YARN RM anymore 1. Delete the container's app entry in YARN RM state store
1. Wait result from next retry
2. Contact Cluster Admin
-7125 CONTAINER_KILLED_ON_APP_COMPLETION PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE COMPLETING TRANSIENT_NORMAL ALWAYS_RETRY Container killed by YARN RM due to its app is already completed 1. Stop Launcher AM container's YARN NM
2. Kill the container's app
1. Wait result from next retry
2. Contact Cluster Admin
-7126 CONTAINER_EXTERNAL_UTILIZATION_SPIKED PAI_LAUNCHER PAI_OS PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY Container killed by YARN due to external utilization spiked 1. Enable YARN external utilization check
2. Start raw process to use up almost all memory on the node
1. Wait result from next retry
2. Contact Cluster Admin
-7150 CONTAINER_NM_LAUNCH_FAILED PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE LAUNCHING TRANSIENT_NORMAL ALWAYS_RETRY Container failed to launch on YARN NM 1. After container allocated and before container started, stop the container's YARN NM
1. Wait result from next retry
2. Contact Cluster Admin
-7151 CONTAINER_RM_RESYNC_LOST PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY Container lost after Launcher AM resynced with YARN RM 1. Stop the container's YARN NM
2. Restart YARN RM
1. Wait result from next retry
2. Contact Cluster Admin
-7152 CONTAINER_RM_RESYNC_EXCEEDED PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE UNKNOWN NON_TRANSIENT NEVER_RETRY Container exceeded after Launcher AM resynced with YARN RM 1. Stop the container's YARN NM
2. Restart YARN RM
3. Wait until AM releases container
4. Start the container's YARN NM
1. Wait result from next retry
2. Contact Cluster Admin
-7153 CONTAINER_MIGRATE_TASK_REQUESTED PAI_LAUNCHER USER_RETRY USER_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY Container killed by Launcher due to user MigrateTaskRequest 1. Send MigrateTaskRequest for the container
1. Wait result from next retry
-7154 CONTAINER_AGENT_EXPIRED PAI_LAUNCHER PAI_OS PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY Container killed by Launcher due to no Launcher Agent heartbeat is received in time 1. Enable Launcher Agent
2. Bring down the container's node
1. Wait result from next retry
2. Contact Cluster Admin
-7200 AM_RM_HEARTBEAT_YARN_EXCEPTION PAI_LAUNCHER USER_SUBMISSION USER_FAILURE SUBMITTING NON_TRANSIENT NEVER_RETRY Launcher AM failed to heartbeat with YARN RM due to YarnException, maybe App is non-compliant 1. Submit a job with invalid node label
1. Check diagnostics and revise your job config
-7201 AM_RM_HEARTBEAT_IO_EXCEPTION PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY Launcher AM failed to heartbeat with YARN RM due to IOException, maybe YARN RM is down 1. Stop YARN RM
1. Wait result from next retry
2. Contact Cluster Admin
-7202 AM_RM_HEARTBEAT_UNKNOWN_EXCEPTION PAI_LAUNCHER UNKNOWN PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY Launcher AM failed to heartbeat with YARN RM due to unknown Exception 1. AM sends invalid message to YARN RM
1. Wait result from next retry
2. Contact Cluster Admin
-7203 AM_RM_HEARTBEAT_SHUTDOWN_REQUESTED PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY Launcher AM failed to heartbeat with YARN RM due to ShutdownRequest, maybe AM is not managed by YARN RM anymore 1. Set small AM expiry time
2. Set network partition between AM and YARN RM
1. Wait result from next retry
2. Contact Cluster Admin
-7250 AM_UNKNOWN_EXCEPTION PAI_LAUNCHER PAI_LAUNCHER PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY Launcher AM failed due to unknown Exception 1. Set network partition between AM and ZK
1. Wait result from next retry
2. Contact Cluster Admin
-7251 AM_NON_TRANSIENT_EXCEPTION PAI_LAUNCHER USER_SUBMISSION USER_FAILURE SUBMITTING NON_TRANSIENT NEVER_RETRY Launcher AM failed due to NonTransientException, maybe App is non-compliant 1. Submit a job with invalid data dir
1. Check diagnostics and revise your job config
-7252 AM_GANG_ALLOCATION_TIMEOUT PAI_LAUNCHER RESOURCE_ALLOCATION_TIMEOUT RESOURCE_ALLOCATION_TIMEOUT ALLOCATING TRANSIENT_CONFLICT ALWAYS_BACKOFF_RETRY Launcher AM failed due to all the requested resource cannot be satisfied in time 1. Disable virtual cluster bonus token
2. Request more containers in a job than its virtual cluster current available resource
1. Wait result from next retry
2. Decrease task number
3. Decrease per task resource request
4. Contact Cluster Admin to increase your virtual cluster quota
-7300 APP_SUBMISSION_YARN_EXCEPTION PAI_LAUNCHER USER_SUBMISSION USER_FAILURE SUBMITTING NON_TRANSIENT NEVER_RETRY Failed to submit App to YARN RM due to YarnException, maybe App is non-compliant 1. Submit a job to invalid virtual cluster
1. Check diagnostics and revise your job config
-7301 APP_SUBMISSION_IO_EXCEPTION PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE SUBMITTING TRANSIENT_NORMAL ALWAYS_RETRY Failed to submit App to YARN RM due to IOException, maybe YARN RM is down 1. Stop YARN RM
1. Wait result from next retry
2. Contact Cluster Admin
-7302 APP_SUBMISSION_UNKNOWN_EXCEPTION PAI_LAUNCHER UNKNOWN UNKNOWN_FAILURE SUBMITTING UNKNOWN RETRY_TO_MAX Failed to submit App to YARN RM due to unknown Exception 1. Launcher Service sends invalid message to YARN RM
1. Wait result from next retry
2. Contact Cluster Admin
-7303 APP_KILLED_UNEXPECTEDLY PAI_LAUNCHER UNKNOWN PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY App killed unexpectedly and directly through YARN RM 1. Kill the app directly through YARN RM
1. Wait result from next retry
2. Contact Cluster Admin
-7350 APP_RM_RESYNC_LOST PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE UNKNOWN TRANSIENT_NORMAL ALWAYS_RETRY App lost after Launcher Service resynced with YARN RM 1. Delete the app entry in YARN RM state store
1. Wait result from next retry
2. Contact Cluster Admin
-7351 APP_STOP_FRAMEWORK_REQUESTED PAI_LAUNCHER USER_STOP USER_STOP UNKNOWN NON_TRANSIENT NEVER_RETRY App stopped by Launcher due to user StopFrameworkRequest 1. Stop a job
-7352 APP_AM_DIAGNOSTICS_LOST PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE COMPLETING TRANSIENT_NORMAL ALWAYS_RETRY Failed to retrieve AMDiagnostics from YARN, maybe the App is cleaned up in YARN 1. App is in APPLICATION_RETRIEVING_DIAGNOSTICS state
2. Stop Launcher Service
3. Delete the app entry in YARN RM state store
4. Start Launcher Service
1. Wait result from next retry
2. Contact Cluster Admin
-7353 APP_AM_DIAGNOSTICS_DESERIALIZATION_FAILED PAI_LAUNCHER PAI_YARN PLATFORM_FAILURE COMPLETING TRANSIENT_NORMAL ALWAYS_RETRY Failed to deserialize AMDiagnostics from YARN, maybe it is corrupted or Launcher AM unexpectedly crashed frequently without generating AMDiagnostics 1. Set yarn.app.attempt.diagnostics.limit.kc to 1B
1. Wait result from next retry
2. Contact Cluster Admin
-7400 TASK_STOPPED_ON_APP_COMPLETION PAI_LAUNCHER USER_STOP USER_STOP UNKNOWN NON_TRANSIENT NEVER_RETRY Task stopped by Launcher due to its app is already completed 1. Stop a job with long running container
-8000 CONTAINER_UNKNOWN_YARN_EXIT_STATUS PAI_YARN UNKNOWN UNKNOWN_FAILURE UNKNOWN UNKNOWN RETRY_TO_MAX Container exited with unknown exitcode which is issued from YARN 1. Change YARN code to make it return container exitcode -886
1. Contact PAI Dev to recognize this exitcode