-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19379][CORE] SparkAppHandle.getState not registering FAILED state upon Spark app failure in Local mode #16743
Conversation
Added this new integration suite for creating the docker based test suite for the bug fix in SPARK-12941
Updated this pom.xml with the ojdbc jar related dependency to test the OracleIntegrationSuite related to SPARK-12941 bug fix
… suite addition Updated DockerJDBCIntegrationSuite.scala with the time out interval, as it is required to be atleast 200 sec for Oracle instance to be up
Updated OracleDialect.scala for [SPARK-12941] for the fix of adding a data type mapping to Oracle
…te addition Updated OracleIntegrationSuite.scala for [SPARK-12941] with Oracle Docker test cases
…es and ignore test case Updated the file with style check fixes and ignore test cases, as the ojdbc jar is not present in the maven.
Updated the order of imports
Updating this pom.xml to comment the changes related oracle ojdbc jar dependency as the maven repository does not contain this.
Updated the OracleIntegrationSuite.scala with the doc and override method
updated doc
Updated style check fixes
Updated OracleIntegrationSuite.scala for [SPARK-12941] for the edits in the comments section
Added comments on why the oracle dependency was commented out
Updated DockerJDBCIntegrationSuite.scala- Reverted the timeout and interval parameter to 60 and 1. This would be required to be increased for Oracle based testing.
Updated the comments about the time interval change required in DockerJDBCIntegrationSuite.scala when testing with Oracle tests
The setState methods of the launcher backend is being invoked for the status updates related to Failed and Finished. The isFailed Test explicity captures the failure scenario and updates the status. and n else if is given to maintain the as is scenario of choosing isFinished conditions as FINISHED.
Can one of the admins verify this patch? |
CC @zsxwing as this was last touched in https://issues.apache.org/jira/browse/SPARK-6602 |
@thomastechs It looks like it would be more appropriate to call private method, stop(SparkAppHandle.State) so that ensures closure (in addition to the callback) of the backend rather than just the callback. Also, wondering if LAUNCHING task state should be translated and called back as SUBMITTED in the SparkAppHandle.State? Wasn't sure. |
I don't get the point. You are using a task state to update an application state. |
@zsxwing What's the best way to callback and update the SparkAppHandle state upon FAILED or FINISHED state of the spark application? It currently lacks handling FAILED. |
Maybe I'm missing something. Why need to set it in the local mode? There is no public way to get the state. |
My use case is end-to-end automated testing in local mode using programmatic Launcher. I have tests where the Spark app is expected to be FINISHED and those where it is expected to be FAILED. |
One point, as discussed, statusChange gets called for task status change. So, if we can identify the point where the job or that executor(Only one executor for local mode, right) is failed, we can give a call back at that point. If you could give some insights about it, that fix can be applied. |
There's no way to set the app state currently because in client mode (which includes local) Spark lacks an API to do that. So the code guesses that since "stop" was called, the app finished successfully. In cluster mode, Spark has more info and can propagate the proper status. It might be possible to detect failure in client mode by checking the exit code of the underlying spark-submit process. But the current code here is wrong. A task failing does not mean the application failed. |
Hi @thomastechs, is it still active and any opinion on ^ ? Otherwise, I would like to propose to close this. |
Yes, I believe it is still active. The solution stated by Marcelo of
detecting the exit code in local mode would be a solution for my purposes
of testing where you want to do end-to-end integration testing in local
mode and test for failure/success states of the spark app in question.
…On Thu, May 11, 2017 at 7:23 AM, Hyukjin Kwon ***@***.***> wrote:
Hi @thomastechs <https://github.com/thomastechs>, is it still active and
any opinion on ^ ? Otherwise, I would like to propose to close this.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#16743 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AWX-_EAGX-_x73pKjTHrC8aHf0O_smEyks5r4xnxgaJpZM4LxtgW>
.
|
This PR has been silent for months, so unless the owner of the PR replies it should be closed. If you want to pick up the work, you'll have to open a new PR anyway. |
What changes were proposed in this pull request?
As per the bug, the state corresponding to app failures are not getting registered with call back to the launcher.So, the already available isFailed() check of TaskState.scala, is utilized to send the explicit status of SparkAppHandle.State.FAILED back to the launcherBackend, which under the hood updates the status for (LOST == state) || (FAILED == state).
The already available TaskState.isFinished() is used to set the value SparkAppHandle.State.FINISHED, as under the hood, it considers FINISHED, FAILED, KILLED, LOST as part of FINISHED state.
Since the isFailed() is checked first, the isFinished is checked in the else if condition to get the mutually exclusive scenarios.
How was this patch tested?
This patch was tested manually. The tests included, by simply submitting local jobs, the states are set correctly. Also, the tests included with some fail prone codes executed and tested the status set as FAILED. It looks difficult to write testcase to create an error prone scnario code and check the status.