PAI Job Exit Spec

See details in job-exit-spec.yaml
This markdown file is generated by update_markdown.py with job-exit-spec.yaml
See full doc in PAI Job Exit Spec User Manual

Spec Schema

field	description	required	unique	type	range
code	The PAI Job ExitCode	True	True	Integer	begin: -8000 end: 256
phrase	The textual phrase representation of this ExitCode	True	True	String	Any
issuer	Who root issued this ExitCode in details	False	False	Enum	1. USER_CONTAINER 2. PAI_OS 3. PAI_RUNTIME 4. PAI_YARN 5. PAI_LAUNCHER
causer	Who root caused this ExitCode in details	False	False	Enum	1. USER_SUBMISSION 2. USER_CONTAINER 3. USER_STOP 4. USER_DELETION 5. USER_RETRY 6. USER_UPGRADE 7. RESOURCE_ALLOCATION_TIMEOUT 8. PAI_HDFS 9. PAI_OS 10. PAI_DOCKER 11. PAI_RUNTIME 12. PAI_YARN 13. PAI_LAUNCHER 14. UNKNOWN
type	The rough type of this ExitCode	False	False	Enum	1. USER_SUCCESS 2. USER_STOP 3. USER_FAILURE 4. PLATFORM_FAILURE 5. RESOURCE_ALLOCATION_TIMEOUT 6. UNKNOWN_FAILURE
stage	The user process stage just before this ExitCode issued	False	False	Enum	1. SUBMITTING 2. ALLOCATING 3. LAUNCHING 4. RUNNING 5. COMPLETING 6. UNKNOWN
behavior	The rerun behavior of this ExitCode	False	False	Enum	1. TRANSIENT_NORMAL 2. TRANSIENT_CONFLICT 3. NON_TRANSIENT 4. UNKNOWN
reaction	The reaction for this ExitCode will be executed by PAI automatically	False	False	Enum	1. ALWAYS_RETRY 2. ALWAYS_BACKOFF_RETRY 3. RETRY_TO_MAX 4. NEVER_RETRY
reason	Why this ExitCode is issued	False	False	String	Any
repro	One specific reproduce steps of this ExitCode	False	False	List<String>	Any
solution	Some optional solutions to resolve this ExitCode if it indicates failure	False	False	List<String>	Any
pattern	The pattern that PAI used to detect this ExitCode	False	False	String	Any, such as USER_EXITCODE=X && USER_LOG_PATTERN=Y \|\| OS_Signal=Z

Spec Table

You may need to scroll right side to see full table.
The code 256 is just used to represent all undefined positive exitcodes in this spec, and the specific undefined exitcode will always override it to expose to user.
The code -8000 is just used to represent all undefined negative exitcodes in this spec, and the specific undefined exitcode will always override it to expose to user.

code	phrase	issuer	causer	type	stage	behavior	reaction	reason	repro	solution
154	CONTAINER_EXIT_CODE_FILE_LOST	PAI_YARN	PAI_YARN	PLATFORM_FAILURE	COMPLETING	TRANSIENT_NORMAL	ALWAYS_RETRY	Container exitcode file cannot be found by YARN NM, maybe node unexpected shutdown, disk cleaned up or disk failure	1. Stop YARN NM 2. Kill container process 3. Delete container exitcode file 4. Start YARN NM	1. Wait result from next retry 2. Contact Cluster Admin
130	CONTAINER_KILLED_BY_SIGINT	PAI_OS	PAI_OS	PLATFORM_FAILURE	RUNNING	TRANSIENT_NORMAL	ALWAYS_RETRY	Container killed by OS Signal: SIGINT	1. Kill container process by SIGINT	1. Wait result from next retry 2. Contact Cluster Admin
132	CONTAINER_KILLED_BY_SIGILL	USER_CONTAINER	USER_CONTAINER	USER_FAILURE	RUNNING	NON_TRANSIENT	NEVER_RETRY	Container killed by OS Signal: SIGILL	1. User program executes an illegal, malformed, unknown, or privileged machine instruction	1. Check container log and fix your program bug
134	CONTAINER_KILLED_BY_SIGABRT	USER_CONTAINER	UNKNOWN	UNKNOWN_FAILURE	RUNNING	UNKNOWN	RETRY_TO_MAX	Container killed by OS Signal: SIGABRT	1. User program calls abort() by libc	1. Check container log and find root cause 2. Wait result from next retry
135	CONTAINER_KILLED_BY_SIGBUS	USER_CONTAINER	USER_CONTAINER	USER_FAILURE	RUNNING	NON_TRANSIENT	NEVER_RETRY	Container killed by OS Signal: SIGBUS	1. User program accesses an unaligned memory address	1. Check container log and fix your program bug
136	CONTAINER_KILLED_BY_SIGFPE	USER_CONTAINER	USER_CONTAINER	USER_FAILURE	RUNNING	NON_TRANSIENT	NEVER_RETRY	Container killed by OS Signal: SIGFPE	1. User program division by zero	1. Check container log and fix your program bug
137	CONTAINER_KILLED_BY_SIGKILL	PAI_OS	PAI_OS	PLATFORM_FAILURE	RUNNING	TRANSIENT_NORMAL	ALWAYS_RETRY	Container killed by OS Signal: SIGKILL	1. Kill container process by SIGKILL	1. Wait result from next retry 2. Contact Cluster Admin
139	CONTAINER_KILLED_BY_SIGSEGV	USER_CONTAINER	USER_CONTAINER	USER_FAILURE	RUNNING	NON_TRANSIENT	NEVER_RETRY	Container killed by OS Signal: SIGSEGV	1. User program accesses an illegal memory address	1. Check container log and fix your program bug
141	CONTAINER_KILLED_BY_SIGPIPE	USER_CONTAINER	USER_CONTAINER	USER_FAILURE	RUNNING	NON_TRANSIENT	NEVER_RETRY	Container killed by OS Signal: SIGPIPE	1. User program writes to a pipe without a process connected to the other end	1. Check container log and fix your program bug
143	CONTAINER_KILLED_BY_SIGTERM	PAI_OS	PAI_OS	PLATFORM_FAILURE	RUNNING	TRANSIENT_NORMAL	ALWAYS_RETRY	Container killed by OS Signal: SIGTERM	1. Kill container process by SIGTERM	1. Wait result from next retry 2. Contact Cluster Admin
193	CONTAINER_DOCKER_RUN_FAILED	PAI_RUNTIME	PAI_DOCKER	PLATFORM_FAILURE	LAUNCHING	TRANSIENT_NORMAL	ALWAYS_RETRY	Container cannot be launched by docker run	1. PAI Runtime calls docker run with unknown flag	1. Wait result from next retry 2. Contact Cluster Admin
196	CONTAINER_OOM_KILLED_BY_DOCKER	PAI_RUNTIME	USER_CONTAINER	USER_FAILURE	RUNNING	NON_TRANSIENT	NEVER_RETRY	Container killed by docker due to it exceeded the request memory	1. User program uses more memory than its requested	1. Increase per task memory request 2. Decrease per task memory usage by such as increasing task number
198	CONTAINER_OOD_KILLED_BY_DISKCLEANER	PAI_RUNTIME	USER_CONTAINER	USER_FAILURE	RUNNING	NON_TRANSIENT	NEVER_RETRY	Container is killed by disk cleaner due to it used major disk space and all containers disk usage on the node exceeded platform limit	1. User program uses almost all disk space of the node	1. Decrease per task disk space usage by such as increasing task number
255	CONTAINER_RUNTIME_UNKNOWN_FAILURE	PAI_RUNTIME	UNKNOWN	UNKNOWN_FAILURE	COMPLETING	UNKNOWN	RETRY_TO_MAX	Container failed but the failure cannot be recognized by PAI Runtime	1. User program directly exits with exitcode 1	1. Check container log and find root cause 2. Wait result from next retry
256	CONTAINER_RUNTIME_EXIT_ABNORMALLY	PAI_RUNTIME	PAI_RUNTIME	PLATFORM_FAILURE	UNKNOWN	UNKNOWN	RETRY_TO_MAX	PAI Runtime exit abnormally with undefined exitcode, it may have bugs	1. PAI Runtime exits with exitcode 1	1. Contact PAI Dev to fix PAI Runtime bugs
0	SUCCEEDED	USER_CONTAINER	USER_CONTAINER	USER_SUCCESS	COMPLETING	UNKNOWN	NEVER_RETRY		1. User program exits with exitcode 0
-7100	CONTAINER_INVALID_EXIT_STATUS	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	LAUNCHING	TRANSIENT_NORMAL	ALWAYS_RETRY	Container exited with invalid exit status, maybe YARN failed to initialize container environment	1. Disable write permission for YARN NM to access {yarn.nodemanager.local-dirs}	1. Wait result from next retry 2. Contact Cluster Admin
-7101	CONTAINER_NOT_AVAILABLE_EXIT_STATUS	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	LAUNCHING	TRANSIENT_NORMAL	ALWAYS_RETRY	Container exited with not available exit status, maybe YARN failed to create container executor process	1. Disable execute permission for YARN NM to access bash on *nix or winutils.exe on Windows	1. Wait result from next retry 2. Contact Cluster Admin
-7102	CONTAINER_NODE_DISKS_FAILED	PAI_LAUNCHER	PAI_OS	PLATFORM_FAILURE	LAUNCHING	TRANSIENT_NORMAL	ALWAYS_RETRY	Container cannot be launched by YARN due to local bad disk, maybe no disk space left	1. Set zero disk space for {yarn.nodemanager.local-dirs}	1. Wait result from next retry 2. Contact Cluster Admin
-7103	CONTAINER_PORT_CONFLICT	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	LAUNCHING	TRANSIENT_NORMAL	ALWAYS_RETRY	Container cannot be launched by YARN due to local port conflict	1. After container allocated and before container started, stop the container's YARN NM 2. Occupy a container requested port on the container node 3. Start the container's YARN NM	1. Wait result from next retry 2. Contact Cluster Admin
-7110	CONTAINER_ABORTED	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	Container aborted by YARN	1. Corrupt the container entry in YARN NM state store	1. Wait result from next retry 2. Contact Cluster Admin
-7111	CONTAINER_NODE_LOST	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	Container lost due to node lost, maybe its YARN NM is down for a long time	1. Stop the container's YARN NM	1. Wait result from next retry 2. Contact Cluster Admin
-7112	CONTAINER_EXPIRED	PAI_LAUNCHER	RESOURCE_ALLOCATION_TIMEOUT	RESOURCE_ALLOCATION_TIMEOUT	ALLOCATING	TRANSIENT_CONFLICT	ALWAYS_BACKOFF_RETRY	Container previously allocated is expired due to it is not launched on YARN NM in time, maybe other containers cannot be allocated in time	1. Disable virtual cluster bonus token 2. Set amGangAllocationTimeoutSec large than yarn.resourcemanager.rm.container-allocation.expiry-interval-ms 3. Request more containers in a job than its virtual cluster current available resource	1. Wait result from next retry 2. Decrease task number 3. Decrease per task resource request 4. Contact Cluster Admin to increase your virtual cluster quota
-7113	CONTAINER_ABORTED_ON_AM_RESTART	PAI_LAUNCHER	RESOURCE_ALLOCATION_TIMEOUT	RESOURCE_ALLOCATION_TIMEOUT	ALLOCATING	TRANSIENT_CONFLICT	ALWAYS_BACKOFF_RETRY	Container previously allocated is aborted by YARN RM during Launcher AM restart, maybe other containers cannot be allocated in time	1. Disable virtual cluster bonus token 2. Request more containers in a job than its virtual cluster current available resource 3. Kill Launcher AM	1. Wait result from next retry 2. Contact Cluster Admin
-7120	CONTAINER_PREEMPTED	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	Container preempted by YARN RM, maybe its virtual cluster overused resource was reclaimed	1. Enable virtual cluster bonus token 2. Request more containers in a job than its virtual cluster current available resource 3. Use up all other virtual clusters available resource	1. Wait result from next retry 2. Decrease task number 3. Decrease per task resource request 4. Contact Cluster Admin to increase your virtual cluster quota 5. Contact Cluster Admin to disable your virtual cluster bonus token
-7121	CONTAINER_RUNTIME_VIRTUAL_MEMORY_EXCEEDED	PAI_LAUNCHER	PAI_RUNTIME	PLATFORM_FAILURE	UNKNOWN	NON_TRANSIENT	NEVER_RETRY	Container killed by YARN due to its PAI Runtime exceeded the request virtual memory	1. PAI Runtime uses more virtual memory than its container requested	1. Increase per task virtual memory request 2. Contact PAI Dev to decrease PAI Runtime virtual memory usage
-7122	CONTAINER_RUNTIME_PHYSICAL_MEMORY_EXCEEDED	PAI_LAUNCHER	PAI_RUNTIME	PLATFORM_FAILURE	UNKNOWN	NON_TRANSIENT	NEVER_RETRY	Container killed by YARN due to its PAI Runtime exceeded the request physical memory	1. PAI Runtime uses more physical memory than its container requested	1. Increase per task physical memory request 2. Contact PAI Dev to decrease PAI Runtime physical memory usage
-7123	CONTAINER_KILLED_BY_AM	PAI_LAUNCHER	PAI_LAUNCHER	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	Container killed by Launcher AM, maybe allocated container is rejected	1. Setup single node cluster 2. Submit job with two tasks and antiaffinityAllocation enabled 3. Launcher rejects allocated container whose node already allocated another container	1. Wait result from next retry 2. Contact Cluster Admin
-7124	CONTAINER_KILLED_BY_RM	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	Container killed by YARN RM, maybe the container is not managed by YARN RM anymore	1. Delete the container's app entry in YARN RM state store	1. Wait result from next retry 2. Contact Cluster Admin
-7125	CONTAINER_KILLED_ON_APP_COMPLETION	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	COMPLETING	TRANSIENT_NORMAL	ALWAYS_RETRY	Container killed by YARN RM due to its app is already completed	1. Stop Launcher AM container's YARN NM 2. Kill the container's app	1. Wait result from next retry 2. Contact Cluster Admin
-7126	CONTAINER_EXTERNAL_UTILIZATION_SPIKED	PAI_LAUNCHER	PAI_OS	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	Container killed by YARN due to external utilization spiked	1. Enable YARN external utilization check 2. Start raw process to use up almost all memory on the node	1. Wait result from next retry 2. Contact Cluster Admin
-7150	CONTAINER_NM_LAUNCH_FAILED	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	LAUNCHING	TRANSIENT_NORMAL	ALWAYS_RETRY	Container failed to launch on YARN NM	1. After container allocated and before container started, stop the container's YARN NM	1. Wait result from next retry 2. Contact Cluster Admin
-7151	CONTAINER_RM_RESYNC_LOST	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	Container lost after Launcher AM resynced with YARN RM	1. Stop the container's YARN NM 2. Restart YARN RM	1. Wait result from next retry 2. Contact Cluster Admin
-7152	CONTAINER_RM_RESYNC_EXCEEDED	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	UNKNOWN	NON_TRANSIENT	NEVER_RETRY	Container exceeded after Launcher AM resynced with YARN RM	1. Stop the container's YARN NM 2. Restart YARN RM 3. Wait until AM releases container 4. Start the container's YARN NM	1. Wait result from next retry 2. Contact Cluster Admin
-7153	CONTAINER_MIGRATE_TASK_REQUESTED	PAI_LAUNCHER	USER_RETRY	USER_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	Container killed by Launcher due to user MigrateTaskRequest	1. Send MigrateTaskRequest for the container	1. Wait result from next retry
-7154	CONTAINER_AGENT_EXPIRED	PAI_LAUNCHER	PAI_OS	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	Container killed by Launcher due to no Launcher Agent heartbeat is received in time	1. Enable Launcher Agent 2. Bring down the container's node	1. Wait result from next retry 2. Contact Cluster Admin
-7200	AM_RM_HEARTBEAT_YARN_EXCEPTION	PAI_LAUNCHER	USER_SUBMISSION	USER_FAILURE	SUBMITTING	NON_TRANSIENT	NEVER_RETRY	Launcher AM failed to heartbeat with YARN RM due to YarnException, maybe App is non-compliant	1. Submit a job with invalid node label	1. Check diagnostics and revise your job config
-7201	AM_RM_HEARTBEAT_IO_EXCEPTION	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	Launcher AM failed to heartbeat with YARN RM due to IOException, maybe YARN RM is down	1. Stop YARN RM	1. Wait result from next retry 2. Contact Cluster Admin
-7202	AM_RM_HEARTBEAT_UNKNOWN_EXCEPTION	PAI_LAUNCHER	UNKNOWN	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	Launcher AM failed to heartbeat with YARN RM due to unknown Exception	1. AM sends invalid message to YARN RM	1. Wait result from next retry 2. Contact Cluster Admin
-7203	AM_RM_HEARTBEAT_SHUTDOWN_REQUESTED	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	Launcher AM failed to heartbeat with YARN RM due to ShutdownRequest, maybe AM is not managed by YARN RM anymore	1. Set small AM expiry time 2. Set network partition between AM and YARN RM	1. Wait result from next retry 2. Contact Cluster Admin
-7250	AM_UNKNOWN_EXCEPTION	PAI_LAUNCHER	PAI_LAUNCHER	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	Launcher AM failed due to unknown Exception	1. Set network partition between AM and ZK	1. Wait result from next retry 2. Contact Cluster Admin
-7251	AM_NON_TRANSIENT_EXCEPTION	PAI_LAUNCHER	USER_SUBMISSION	USER_FAILURE	SUBMITTING	NON_TRANSIENT	NEVER_RETRY	Launcher AM failed due to NonTransientException, maybe App is non-compliant	1. Submit a job with invalid data dir	1. Check diagnostics and revise your job config
-7252	AM_GANG_ALLOCATION_TIMEOUT	PAI_LAUNCHER	RESOURCE_ALLOCATION_TIMEOUT	RESOURCE_ALLOCATION_TIMEOUT	ALLOCATING	TRANSIENT_CONFLICT	ALWAYS_BACKOFF_RETRY	Launcher AM failed due to all the requested resource cannot be satisfied in time	1. Disable virtual cluster bonus token 2. Request more containers in a job than its virtual cluster current available resource	1. Wait result from next retry 2. Decrease task number 3. Decrease per task resource request 4. Contact Cluster Admin to increase your virtual cluster quota
-7300	APP_SUBMISSION_YARN_EXCEPTION	PAI_LAUNCHER	USER_SUBMISSION	USER_FAILURE	SUBMITTING	NON_TRANSIENT	NEVER_RETRY	Failed to submit App to YARN RM due to YarnException, maybe App is non-compliant	1. Submit a job to invalid virtual cluster	1. Check diagnostics and revise your job config
-7301	APP_SUBMISSION_IO_EXCEPTION	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	SUBMITTING	TRANSIENT_NORMAL	ALWAYS_RETRY	Failed to submit App to YARN RM due to IOException, maybe YARN RM is down	1. Stop YARN RM	1. Wait result from next retry 2. Contact Cluster Admin
-7302	APP_SUBMISSION_UNKNOWN_EXCEPTION	PAI_LAUNCHER	UNKNOWN	UNKNOWN_FAILURE	SUBMITTING	UNKNOWN	RETRY_TO_MAX	Failed to submit App to YARN RM due to unknown Exception	1. Launcher Service sends invalid message to YARN RM	1. Wait result from next retry 2. Contact Cluster Admin
-7303	APP_KILLED_UNEXPECTEDLY	PAI_LAUNCHER	UNKNOWN	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	App killed unexpectedly and directly through YARN RM	1. Kill the app directly through YARN RM	1. Wait result from next retry 2. Contact Cluster Admin
-7350	APP_RM_RESYNC_LOST	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	UNKNOWN	TRANSIENT_NORMAL	ALWAYS_RETRY	App lost after Launcher Service resynced with YARN RM	1. Delete the app entry in YARN RM state store	1. Wait result from next retry 2. Contact Cluster Admin
-7351	APP_STOP_FRAMEWORK_REQUESTED	PAI_LAUNCHER	USER_STOP	USER_STOP	UNKNOWN	NON_TRANSIENT	NEVER_RETRY	App stopped by Launcher due to user StopFrameworkRequest	1. Stop a job
-7352	APP_AM_DIAGNOSTICS_LOST	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	COMPLETING	TRANSIENT_NORMAL	ALWAYS_RETRY	Failed to retrieve AMDiagnostics from YARN, maybe the App is cleaned up in YARN	1. App is in APPLICATION_RETRIEVING_DIAGNOSTICS state 2. Stop Launcher Service 3. Delete the app entry in YARN RM state store 4. Start Launcher Service	1. Wait result from next retry 2. Contact Cluster Admin
-7353	APP_AM_DIAGNOSTICS_DESERIALIZATION_FAILED	PAI_LAUNCHER	PAI_YARN	PLATFORM_FAILURE	COMPLETING	TRANSIENT_NORMAL	ALWAYS_RETRY	Failed to deserialize AMDiagnostics from YARN, maybe it is corrupted or Launcher AM unexpectedly crashed frequently without generating AMDiagnostics	1. Set yarn.app.attempt.diagnostics.limit.kc to 1B	1. Wait result from next retry 2. Contact Cluster Admin
-7400	TASK_STOPPED_ON_APP_COMPLETION	PAI_LAUNCHER	USER_STOP	USER_STOP	UNKNOWN	NON_TRANSIENT	NEVER_RETRY	Task stopped by Launcher due to its app is already completed	1. Stop a job with long running container
-8000	CONTAINER_UNKNOWN_YARN_EXIT_STATUS	PAI_YARN	UNKNOWN	UNKNOWN_FAILURE	UNKNOWN	UNKNOWN	RETRY_TO_MAX	Container exited with unknown exitcode which is issued from YARN	1. Change YARN code to make it return container exitcode -886	1. Contact PAI Dev to recognize this exitcode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-exit-spec.md

job-exit-spec.md

PAI Job Exit Spec

Spec Schema

Spec Table

Files

job-exit-spec.md

Latest commit

History

job-exit-spec.md

File metadata and controls

PAI Job Exit Spec

Spec Schema

Spec Table