Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python REPL connection issues #35

Closed
S-C-H opened this issue Jan 20, 2020 · 2 comments
Closed

Python REPL connection issues #35

S-C-H opened this issue Jan 20, 2020 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@S-C-H
Copy link

S-C-H commented Jan 20, 2020

Describe the bug

When performing a GridSearch (or distributed or randomised), the tasks finish but the connection is never returned. Eventually the connection is closed with an out-of-memory exception. It does not appear to be an out-of-memory error as the tasks complete.

To Reproduce
Steps to reproduce the behavior:

Train a large Tree Based model across worker nodes in a grid search. It seems to only occur when a larger amount of time is required to construct the model.

Expected behavior

Additional context

Seems related to: apache/spark#24898

I replaced sk-dist with joblibspark, trained the same model which completed (joblib reports tasks done) and received the error:

because pyspark py4j is not in pinned thread mode, we could not terminate running spark jobs correctly.

@S-C-H S-C-H added the bug Something isn't working label Jan 20, 2020
@S-C-H
Copy link
Author

S-C-H commented Jan 20, 2020

Nevermind - I "think" I have this sorted.

@S-C-H S-C-H closed this as completed Jan 20, 2020
@denver1117
Copy link
Contributor

Glad it ended up working out. One note here is that particularly for grid search, when refit=True (which it is by default), the best estimator will be refit on the driver after the best parameter set is chosen. So all of the tasks can finish, but a remaining job will take place on the driver. According to the spark logs, the application will appear to "hang", when in reality it's training that final model on the driver.

You'll need to be sure to have enough driver memory allocated for that job to be successful. You won't get great error messages if this doesn't go well since the driver will likely just stop the spark application if it runs out of memory. This should be the same on sk-dist or joblib-spark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants