Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Possible Deadlock when loading objects #12032

Open
revans2 opened this issue Jan 27, 2025 · 1 comment
Open

[BUG] Possible Deadlock when loading objects #12032

revans2 opened this issue Jan 27, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Jan 27, 2025

Describe the bug
We recently had a customer run into a deadlock. It is really confusing because all of the stuck stack traces still appeared to be RUNNABLE, but were in Object.wait in the JVM. From the stack traces it appears that the Object.wait is being called by the JVM itself to load the classes and run the static initialization to set it all up.

Please note that this deadlock showed up on 24.04.1, but it appears to be a systemic issue where we have a circular dependency that can cause objects to be loaded on different threads in different orders.

The stack traces I saw which caused the issue were (roughly)

THREAD 1:

THREAD 2:

I think the only way to fix this is to decompose these really large classes.

First the SparkShimsImpl is the old way of doing shims, and because it is a large catch all class it increases the likelihood that we will develop a circular dependency. We should take the time and split it apart into smaller pieces that follow the new shim pattern.

Second GpuOverrides is a giant class where most things are related to the main goal of translating GPU operators to GPU operators, but some of these things should be moved to other classes 4000+ lines is too much for a single file, and too much for a single class too.

@revans2 revans2 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 27, 2025
@gerashegalov
Copy link
Collaborator

GpuOverrides complexity has been previously raised in #10838

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants