-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mmlspark] Memory Corruption #2392
Comments
@imatiach-msft Can you please take a look? |
also adding @eisber |
@chris-smith-zocdoc could you please also post the code that reproduces this error? |
based on the stack trace it looks like it is going into here: Line 143 in 317b1bf
and then here:
which makes me suspect it has something to do with the Config: Line 403 in 317b1bf
which is initialized from the parameters passed to BoosterCreate: Line 1049 in 317b1bf
Not quite sure though, just a theory. Would be good to see the code to investigate further. |
I'm still trying to create a reliable reproduction of this, but the code is essentially from mmlspark.lightgbm import LightGBMRanker
ranker = LightGBMRanker()
ranker.setGroupCol('search_id')
ranker.setUseBarrierExecutionMode(True)
model = ranker.fit(df)
Yes they should be, I'm using https://github.com/Azure/mmlspark/ The reason I mentioned "parallel within the same JVM" is because the setup Databricks appears to use is standalone mode with all cores for the executor. So I have 2 executors, with 16 cores each. |
Here's a new error I saw today that might be useful in diagnosing this
|
For |
the check is coming from here: "I'm still trying to create a reliable reproduction of this, but the code is essentially" |
@imatiach-msft Do you a decent sized publicly available dataset I can train |
@chris-smith-zocdoc the ranker tests in mmlspark are on the same dataset as this lightgbm github repository, and unfortunately it isn't that large. There are a lot more datasets for the lightgbm classifier and regressor in mmlspark, but they are mostly small. I did test the classifier on the Higgs dataset and there have been tests done on Criteo dataset - |
you can find the lightgbm ranker test files here which we use as well: note we run the ranker on that test file in our tests |
Thanks for the link @StrikerRUS I was able to reproduce the issue. Commented on the other issue #2357 (comment) |
@chris-smith-zocdoc since other issue has been fixed, can this issue be closed as well, or are you still seeing crashes? |
On the latest master of both LightGBM + mmlspark I am no longer seeing crashes. I'll re-open if they appear again |
@imatiach-msft unfortunately I'm seeing this again. The version of mmlspark I was using did not have this commit merged in it yet. This hid the issue on my cluster because it artificially reduced the parallelism to the number of executors instead of num_cores * num_executors. Once I pulled that commit in I saw the issue again immediately. This further increases my belief that there is a concurrency bug here. Stack from today
|
Hello, I am seeing some similar behavior, I would like to know at which stage this is @chris-smith-zocdoc? I am not sure about the issue, but it feels that boosterPtr in mmlspark LightGBMBooster and loading the native library potentially more than once on the same JVM could be a reason. I am not familiar with how the JVM works when loading multiple times the same dynamic library, could this create problems? In any case, several Booster can be created? Is it guaranteed that all of them will be freed afterwards? It is important to say that I am working with Mac OS. Thank you very much |
AFAIK this is still an issue. I've moved onto another project and am not actively working on this
I don't believe loading the native library more than once should be an issue. Runtime.loadLibrary says subsequent calls are just ignored
Several Boosters are being created in the same process, which I think is triggering the bug. We end up with 1 per task. If your spark is configured like mine (on databricks) then there is one worker per cpu core in the same process.
They should be getting freed here https://github.com/Azure/mmlspark/blob/9c61053fa126959c962c0707aa543451ef077574/src/main/scala/com/microsoft/ml/spark/lightgbm/TrainUtils.scala#L278-L281 But this bug is triggered before this would happen |
I think this free statement is just called at Training time, but not once you are useing it for prediction or inference. I also think that the bug might be triggered by different boosterPtr created in the same process by different threads, I did a PR that I think could help solving this but I am not 100% sure it does, microsoft/SynapseML#792 Thank you very much for the quick reply |
@JoanFM I've only trigged this bug in training, haven't really tried in other scenarios. Have you triggered it in prediction or inference?
That would definitely be bad, but haven't seen that in my experiments (its been awhile so I might not remember clearly). What do you think it causing the current implementation to leak across threads? |
Hello @chris-smith-zocdoc I have triggered in a test scenario and not consistently all the times. I am not sure what is causing this. But it seems that lightGBM is releasing the responsibility to free these BoosterHandler's to in the c_api. And I think that in mmlspark we are potentially requesting it more than once per thread. At this point some resources I think may lead, but then why it is crashing I do not see it so clear. |
Yes it does look like the model classes (eg LightGBMRegressionModel) never free their native memory. They never call There are two places I think where that would need to happen
|
@chris-smith-zocdoc |
I think if you add the finalize method it would be not necessary anymore to call lightgbmlib.LGBM_BoosterFree(booster) in TrainUtils? In a RAII style maybe it can be guaranteed that resource is obtained and freed whenever it is actually cleared? UPDATE: I think I am wrong in this message, by training time, I think no Model is yet created, so TrainUtils can have the role to free that memory |
The best way I've seen so far is to implement both finalize and dispose, which allows both the user and the GC to cleanup the native memory public void dispose() {
if(this.ptr != null) {
LightGBM.free(this.ptr)
this.ptr = null;
}
}
protected void finalize() { dispose(); } If we are concerned about multiple threads calling dispose concurrently then we'll also need synchronization. There have been some attempts to make something closer to RAII in java, try-with-resources might be an option, though I think spark will get in the way. There is also https://github.com/ThoughtWorksInc/RAII.scala but I've never used it.
Its still best practice to free it manually and not rely on the GC for this
JVM doesn't have RAII, but I linked some alternatives above |
@chris-smith-zocdoc @imatiach-msft @JoanFM |
Hey @guolinke. I think it might be but unfortunately I do not have the possibility to reproduce the same conditions as when I was experiencing the problem. |
@guolinke that pr addresses freeing the native memory but I don't think they address the concurrency issues in |
It's been 2+ years since the last activity on this issue and it's possible that microsoft/SynapseML#792 fixed the original issue. I'm going to close this. |
This issue has been automatically locked since there has not been any recent activity since it was closed. |
Memory Corruption when using Spark Package https://github.com/azure/mmlspark/
Similar to #2363 but I'm getting this while avoiding the usage of
LGBM_DatasetCreateFromCSRSpark
. To avoid that function I'm forcing my dataset to be a dense vector which uses this code pathThe crash always occurs when calling the jni method LGBM_BoosterCreate
Is
LGBM_BoosterCreate
thread safe? What about the rest of the JNI methods?My understanding of how the spark package works is that we have each spark worker (thread) calling this method in parallel within the same JVM (per executor). In my setup this is ~16 invocations at roughly the same time. If some part of LightGBM isn't thread safe then this is likely the issue.
Forcing
lightgbmlib.LGBM_BoosterCreate
to be synchronized seems to have stopped the crashes for now, but I'll report back if they appear again.Environment info
Spark cluster described here #2360
LightGBM version or commit hash:
com.microsoft.ml.spark:mmlspark_2.11:0.18.1
I also tested with mmlspark@671b68892ace5967e60c7a064effd42dd5a21ec7 and saw the same crashes.
Error message
The actual error varies, here are few different ones
Another version
Another version
Reproducible examples
I haven't found a good way to reproduce this outside of my private dataset yet.
The text was updated successfully, but these errors were encountered: