Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(jvm): make job cancellation hashcode not fail after deserialization #4291

Merged

Conversation

AlexRiedler
Copy link
Contributor

Summary

On JVM JobCancellationException is Serializable, but job field is nullable and Transient:

@JvmField @Transient internal actual val job: Job

as a result job is null after deserialization, so when hashCode() is ran on the exception, it results in null pointer exception. This change is to fix this issue by handling nullability of the job field.

Exception from Production while running Apache Flink:

java.lang.NullPointerException: Cannot invoke "Object.hashCode()" because "this.job" is null
	at kotlinx.coroutines.JobCancellationException.hashCode(Exceptions.kt:65)
	at java.base/java.util.HashMap.hash(Unknown Source)
	at java.base/java.util.HashMap.put(Unknown Source)
	at java.base/java.util.HashSet.add(Unknown Source)
	at org.apache.flink.util.SerializedThrowable.<init>(SerializedThrowable.java:91)
	at org.apache.flink.util.SerializedThrowable.<init>(SerializedThrowable.java:93)
	at org.apache.flink.util.SerializedThrowable.<init>(SerializedThrowable.java:62)
	at org.apache.flink.runtime.executiongraph.ErrorInfo.<init>(ErrorInfo.java:72)
	at org.apache.flink.runtime.executiongraph.ErrorInfo.createErrorInfoWithNullableCause(ErrorInfo.java:52)
	at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1154)
	at org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:933)
	at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.updateStateInternal(DefaultExecutionGraph.java:1396)
	at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.updateState(DefaultExecutionGraph.java:1355)
	at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:763)
	at org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:83)
	at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:488)

@qwwdfsad
Copy link
Collaborator

qwwdfsad commented Dec 4, 2024

Thanks! Could you please elaborate -- is this something you encountered in the wild (i.e. you have a domain where you serialize and deserialize JCE consistently) or is it something you just spotted accidentally?

@qwwdfsad qwwdfsad changed the base branch from master to develop December 4, 2024 13:54
@qwwdfsad qwwdfsad changed the base branch from develop to master December 4, 2024 13:54
@qwwdfsad
Copy link
Collaborator

qwwdfsad commented Dec 4, 2024

Could you please also rebase your branch on top of develop (see also: https://github.com/Kotlin/kotlinx.coroutines/blob/master/CONTRIBUTING.md#submitting-prs)?

@AlexRiedler AlexRiedler changed the base branch from master to develop December 4, 2024 13:56
@AlexRiedler AlexRiedler force-pushed the job-cancellation-transient-failure branch from 8b8a0f8 to d24a35c Compare December 4, 2024 13:57
@AlexRiedler
Copy link
Contributor Author

AlexRiedler commented Dec 4, 2024

@qwwdfsad rebased onto develop (missed that stepped in the guide 😅)

Yes this was encountered in the wild while in one of our production Apache Flink Job's.

Since Apache Flink uses a distributed across system mechanism. It will serialize exceptions at job boundaries (that is between user code and framework code) across systems for error tracking and reporting; it happens to use a HashMap as part of this and puts the exception as the key (which in-turn runs hashCode). This however, results in a null pointer exception as a result since job field is Transient in Kotlin Coroutines.

More detail: this appears to happen cause we are capturing the exception in a completable future as a reason to mark the completable future as a failure.

@qwwdfsad qwwdfsad self-requested a review December 4, 2024 17:52
@qwwdfsad
Copy link
Collaborator

qwwdfsad commented Dec 4, 2024

Thank you for the explanation!

I've also filed #4293 and started an internal discussion about this bug -- it seems to be a soundness violation, and it would be nice to at least have a warning here

@qwwdfsad qwwdfsad merged commit e670f62 into Kotlin:develop Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants