Python REPL connection issues #35

S-C-H · 2020-01-20T19:27:57Z

Describe the bug

When performing a GridSearch (or distributed or randomised), the tasks finish but the connection is never returned. Eventually the connection is closed with an out-of-memory exception. It does not appear to be an out-of-memory error as the tasks complete.

To Reproduce
Steps to reproduce the behavior:

Train a large Tree Based model across worker nodes in a grid search. It seems to only occur when a larger amount of time is required to construct the model.

Expected behavior

Additional context

Seems related to: apache/spark#24898

I replaced sk-dist with joblibspark, trained the same model which completed (joblib reports tasks done) and received the error:

because pyspark py4j is not in pinned thread mode, we could not terminate running spark jobs correctly.

S-C-H · 2020-01-20T20:41:35Z

Nevermind - I "think" I have this sorted.

denver1117 · 2020-01-21T17:57:51Z

Glad it ended up working out. One note here is that particularly for grid search, when refit=True (which it is by default), the best estimator will be refit on the driver after the best parameter set is chosen. So all of the tasks can finish, but a remaining job will take place on the driver. According to the spark logs, the application will appear to "hang", when in reality it's training that final model on the driver.

You'll need to be sure to have enough driver memory allocated for that job to be successful. You won't get great error messages if this doesn't go well since the driver will likely just stop the spark application if it runs out of memory. This should be the same on sk-dist or joblib-spark.

S-C-H added the bug Something isn't working label Jan 20, 2020

S-C-H assigned denver1117 Jan 20, 2020

S-C-H closed this as completed Jan 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python REPL connection issues #35

Python REPL connection issues #35

S-C-H commented Jan 20, 2020

S-C-H commented Jan 20, 2020

denver1117 commented Jan 21, 2020

Python REPL connection issues #35

Python REPL connection issues #35

Comments

S-C-H commented Jan 20, 2020

S-C-H commented Jan 20, 2020

denver1117 commented Jan 21, 2020