You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When performing a GridSearch (or distributed or randomised), the tasks finish but the connection is never returned. Eventually the connection is closed with an out-of-memory exception. It does not appear to be an out-of-memory error as the tasks complete.
To Reproduce
Steps to reproduce the behavior:
Train a large Tree Based model across worker nodes in a grid search. It seems to only occur when a larger amount of time is required to construct the model.
Glad it ended up working out. One note here is that particularly for grid search, when refit=True (which it is by default), the best estimator will be refit on the driver after the best parameter set is chosen. So all of the tasks can finish, but a remaining job will take place on the driver. According to the spark logs, the application will appear to "hang", when in reality it's training that final model on the driver.
You'll need to be sure to have enough driver memory allocated for that job to be successful. You won't get great error messages if this doesn't go well since the driver will likely just stop the spark application if it runs out of memory. This should be the same on sk-dist or joblib-spark.
Describe the bug
When performing a GridSearch (or distributed or randomised), the tasks finish but the connection is never returned. Eventually the connection is closed with an out-of-memory exception. It does not appear to be an out-of-memory error as the tasks complete.
To Reproduce
Steps to reproduce the behavior:
Train a large Tree Based model across worker nodes in a grid search. It seems to only occur when a larger amount of time is required to construct the model.
Expected behavior
Additional context
Seems related to: apache/spark#24898
I replaced sk-dist with joblibspark, trained the same model which completed (joblib reports tasks done) and received the error:
because pyspark py4j is not in pinned thread mode, we could not terminate running spark jobs correctly.
The text was updated successfully, but these errors were encountered: