You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We recently had a customer run into a deadlock. It is really confusing because all of the stuck stack traces still appeared to be RUNNABLE, but were in Object.wait in the JVM. From the stack traces it appears that the Object.wait is being called by the JVM itself to load the classes and run the static initialization to set it all up.
Please note that this deadlock showed up on 24.04.1, but it appears to be a systemic issue where we have a circular dependency that can cause objects to be loaded on different threads in different orders.
The stack traces I saw which caused the issue were (roughly)
THREAD 1:
ObjectInputStream is deserializing a task to run (which includes GpuToTimestamp).
I think the only way to fix this is to decompose these really large classes.
First the SparkShimsImpl is the old way of doing shims, and because it is a large catch all class it increases the likelihood that we will develop a circular dependency. We should take the time and split it apart into smaller pieces that follow the new shim pattern.
Second GpuOverrides is a giant class where most things are related to the main goal of translating GPU operators to GPU operators, but some of these things should be moved to other classes 4000+ lines is too much for a single file, and too much for a single class too.
The text was updated successfully, but these errors were encountered:
Describe the bug
We recently had a customer run into a deadlock. It is really confusing because all of the stuck stack traces still appeared to be RUNNABLE, but were in
Object.wait
in the JVM. From the stack traces it appears that theObject.wait
is being called by the JVM itself to load the classes and run the static initialization to set it all up.Please note that this deadlock showed up on 24.04.1, but it appears to be a systemic issue where we have a circular dependency that can cause objects to be loaded on different threads in different orders.
The stack traces I saw which caused the issue were (roughly)
THREAD 1:
ObjectInputStream
is deserializing a task to run (which includes GpuToTimestamp).GpuToTimestamp
calls intoGpuOverrides
to get theTimeParserPolicy
This causesGpuOverrides
to be loaded.GpuOverrides
calls intoSparkShimsImpl
to get rules to translate ANSI cast statements, but has to wait because another thread is loadingSparkShimsImpl
.THREAD 2:
GpuParquetFileFilterHandler
wants to do predicate push down and calls intoSparkShimsImpl
to do the translation to something Parquet MR can deal with. This causesSparkShimsImpl
to be loaded.SparkShimsImpl
ends up calling intoSpark340PlusNonDBShims
to setup override rules and to do this it needs to loadGpuOverrides
, but blocks because another thread is in the middle of loadingGpuOverrides
.I think the only way to fix this is to decompose these really large classes.
First the
SparkShimsImpl
is the old way of doing shims, and because it is a large catch all class it increases the likelihood that we will develop a circular dependency. We should take the time and split it apart into smaller pieces that follow the new shim pattern.Second
GpuOverrides
is a giant class where most things are related to the main goal of translating GPU operators to GPU operators, but some of these things should be moved to other classes 4000+ lines is too much for a single file, and too much for a single class too.The text was updated successfully, but these errors were encountered: