-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance #792
Conversation
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Codecov Report
@@ Coverage Diff @@
## master #792 +/- ##
===========================================
- Coverage 31.47% 18.13% -13.35%
===========================================
Files 240 184 -56
Lines 9699 8483 -1216
Branches 528 528
===========================================
- Hits 3053 1538 -1515
- Misses 6646 6945 +299
Continue to review full report at Codecov.
|
interesting, all of the regular tests are passing but the pyspark notebooks are all failing with errors like: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 956.0 failed 4 times, most recent failure: Lost task 0.3 in stage 956.0 (TID 12257, 10.139.64.7, executor 1): java.lang.NoClassDefFoundError: Could not initialize class com.microsoft.ml.lightgbm.lightgbmlibConstantsPy4JJavaError Traceback (most recent call last) /databricks/spark/python/pyspark/sql/dataframe.py in toPandas(self) /databricks/spark/python/pyspark/sql/dataframe.py in collect(self) |
seems to be some pickle issue |
oooooh I see the issue. We are doing the init only on getting booster:
however it seems the init also needs to be done before we call anything that uses lightgbmlibConstants |
Where was it done before then? |
we were just doing it at top of methods before using lightgbmlibConstants:
Note where we use lightgbmlibConstants.C_API_PREDICT_RAW_SCORE, lightgbmlibConstants.C_API_PREDICT_NORMAL |
now we are potentially accessing lightgbmlibConstants after the booster was loaded |
I see, I can try to find a way around this. But first I would like to be sure that you guys think this PR is a good contribution and if it could be the root of some memory issues me and some others have been experiencing. |
Did some changes to try to guarantee that native library is loaded any time is needed |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala
Outdated
Show resolved
Hide resolved
…r changes. By boosterHandler not being serializable, it forces Serializable objects to declare it as transient
…rder of finalization so it is dangerous. I wonder how is it guaranteed that the memory will be cleared though
See benchmarking failure in tests for LightGBMRegressor and Classifier, It seems that the model trained differs, but I do not see how it can affect since LigthGBMBooster is created with a given trained model. |
In the last commit I removed this finalize method because according to scalastyle and Java recommendation they should be avoided, but how is it that in current master there is a finalize for LightGBMBooster?.
Some issues come to my mind, please correct me if some of these reflections are wrong or inaccurate, I am not very experienced with the JVM:
Thank you very much for your help |
As a general practice yes, but they're perfect for working with native memory. I'd agree with this answer from stackoverflow "You can use the finalizer to cleanup as a last resort, but, i would not use solely to manage the actual lifetime for the reasons you actually pointed out in your own question. As such, let the finalizer be the final attempt, but not the first." Oracle also has a doc on the subject
Yes, it is our job to call the methods to free any memory we have allocated that LightGBM has given us pointers for.
The JVM is completely unaware of the unmanaged memory that LightGBM allocates. The JVM only tracks memory that it allocates itself (java objects)
I don't know this for sure, but it might have to do with the design of the spark apis. If we need to return objects to spark for it to serialize between tasks, we wouldn't be able to free it beforehand. This is also why finalizers are useful.
I'm not very familiar with swig, Ilya might have a better answer for you
I think general best practice is to implement a For scalastyle, check the link for examples on how to disable |
Thanks you very much @chris-smith-zocdoc and @imatiach-msft I added back the finalize method calling a private freeNativeMemory. I did it private for now, because at this point I do not know if there is need for external calls to this method. At the same time, I did not cover in this scope, I consider a little bit incoherent to ensure it is not null at delete time, but not when we try to use these pointers, although maybe I am unaware of some checks done under the hood. I still get the same errors during test that find difficult to understand. |
Which error are you seeing? Can you post the stacktrace? |
I mean, test errors, but I do not see why |
I was seeing errors in the tests, but now I found the origin, wrongly ported a lightCBMLibConstant |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
ping @mhamilton723 the cognitive tests still seem to be failing, even on other PRs |
Hello @imatiach-msft, any news about the investigation of these issues? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work @JoanFM :)
…ter instance (microsoft#792) * Move transient var to transient lazy val * Try guaranteeing NativeLibrary is loaded everytime it is needed but only loaded once * Add freeing of resources, handle all memory in BoosterHandler * Remove serialization from BoosterHandler, some documentation and minor changes. By boosterHandler not being serializable, it forces Serializable objects to declare it as transient * Style forbigs implementing finalize method, Java does not guarantee order of finalization so it is dangerous. I wonder how is it guaranteed that the memory will be cleared though * Add finalize method to call free for Native C++ allocated memory * predict normal correct constant
I am not sure this is a fix, and not sure about the potential performance implications.
I think these changes could be helpful in solving memory issues that seem related to concurrency in spark such as microsoft/LightGBM#2392.
Please feel free to comment, is just a proposal but I am not 100% sure that it fixes these issues.