You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 14, 2020. It is now read-only.
I noticed an odd thing today. I ran a job that failed when starting the ApplicationMaster:
Application application_1416843883012_0019 failed 2 times due to Error launching appattempt_1416843883012_0019_000002. Got exception: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.security.Credentials.readTokenStorageStream(Credentials.java:209)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.setupTokens(AMLauncher.java:226)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.createAMContainerLaunchContext(AMLauncher.java:198)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:108)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
. Failing the application.
The ResourceManager correctly shows the job in the FAILED state. Timberlake, however, says the job is still running, and refreshing the page just resets the duration to 0. After restarting Timberlake the job is no longer shown as running, but it is not included in the list of finished jobs.
The text was updated successfully, but these errors were encountered:
Ah yeah, sorry about that. This happens because TL asks the ResourceManager for running jobs and trusts the HistoryServer to have all the finished jobs. Your job got into the list of running jobs but then it got stuck since the HistoryServer didn't know about it.
A previous version would drop the job from the list of running jobs if the ResourceManager didn't know about it. This led to weird issues where the job would disappear for a few seconds until the HistoryServer picked it up.
Thanks for the report! I'm thinking about how to make this part more reliable.
I noticed an odd thing today. I ran a job that failed when starting the ApplicationMaster:
The ResourceManager correctly shows the job in the FAILED state. Timberlake, however, says the job is still running, and refreshing the page just resets the duration to 0. After restarting Timberlake the job is no longer shown as running, but it is not included in the list of finished jobs.
The text was updated successfully, but these errors were encountered: