Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mmlspark] Memory Corruption #2392

Closed
chris-smith-zocdoc opened this issue Sep 9, 2019 · 29 comments
Closed

[mmlspark] Memory Corruption #2392

chris-smith-zocdoc opened this issue Sep 9, 2019 · 29 comments
Labels

Comments

@chris-smith-zocdoc
Copy link
Contributor

chris-smith-zocdoc commented Sep 9, 2019

Memory Corruption when using Spark Package https://github.com/azure/mmlspark/

Similar to #2363 but I'm getting this while avoiding the usage of LGBM_DatasetCreateFromCSRSpark. To avoid that function I'm forcing my dataset to be a dense vector which uses this code path

The crash always occurs when calling the jni method LGBM_BoosterCreate

Is LGBM_BoosterCreate thread safe? What about the rest of the JNI methods?

My understanding of how the spark package works is that we have each spark worker (thread) calling this method in parallel within the same JVM (per executor). In my setup this is ~16 invocations at roughly the same time. If some part of LightGBM isn't thread safe then this is likely the issue.

Forcing lightgbmlib.LGBM_BoosterCreate to be synchronized seems to have stopped the crashes for now, but I'll report back if they appear again.

Environment info

Spark cluster described here #2360

LightGBM version or commit hash:
com.microsoft.ml.spark:mmlspark_2.11:0.18.1

I also tested with mmlspark@671b68892ace5967e60c7a064effd42dd5a21ec7 and saw the same crashes.

Error message

The actual error varies, here are few different ones

java: malloc.c:3722: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.

Another version

java: malloc.c:2927: __libc_malloc: Assertion `!victim || chunk_is_mmapped (mem2chunk (victim)) || ar_ptr == arena_for_chunk (mem2chunk (victim))' failed.
*** Error in `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java': malloc(): memory corruption: 0x00007f4bb03d9420 ***

Another version

*** Error in `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java': malloc(): memory corruption: 0x00007fc1455428c0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fd835d067e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8213e)[0x7fd835d1113e]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7fd835d13184]
/databricks/python/lib/libstdc++.so.6(_Znwm+0x16)[0x7fd8363b5097]
/local_disk0/tmp/mml-natives2521041663958175051/lib_lightgbm.so(_ZN8LightGBM17ObjectiveFunction23CreateObjectiveFunctionERKSsRKNS_6ConfigE+0x7fb)[0x7fc06602003b]
/local_disk0/tmp/mml-natives2521041663958175051/lib_lightgbm.so(_ZN8LightGBM7Booster25CreateObjectiveAndMetricsEv+0x21)[0x7fc065e9e051]
/local_disk0/tmp/mml-natives2521041663958175051/lib_lightgbm.so(LGBM_BoosterCreate+0x163)[0x7fc065e8fe13]
/local_disk0/tmp/mml-natives7682685279022156827/lib_lightgbm_swig.so(Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1BoosterCreate+0x3f)[0x7fc065bf1e7f]
[0x7fd81f1903c7]
======= Memory map: ========
00400000-00401000 r-xp 00000000 ca:10 5901725                            /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
00600000-00601000 r--p 00000000 ca:10 5901725                            /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
00601000-00602000 rw-p 00001000 ca:10 5901725                            /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
01b9e000-06d5a000 rw-p 00000000 00:00 0                                  [heap]

Reproducible examples

I haven't found a good way to reproduce this outside of my private dataset yet.

@StrikerRUS
Copy link
Collaborator

@imatiach-msft Can you please take a look?

@StrikerRUS StrikerRUS added the bug label Sep 10, 2019
@guolinke guolinke changed the title Memory Corruption [mmlspark] Memory Corruption Sep 10, 2019
@imatiach-msft
Copy link
Contributor

also adding @eisber

@imatiach-msft
Copy link
Contributor

@chris-smith-zocdoc could you please also post the code that reproduces this error?
Are the objects passed to create it separate instances (dataset, parameters)?

@imatiach-msft
Copy link
Contributor

based on the stack trace it looks like it is going into here:

objective_fun_.reset(ObjectiveFunction::CreateObjectiveFunction(config_.objective,

and then here:
ObjectiveFunction* ObjectiveFunction::CreateObjectiveFunction(const std::string& type, const Config& config) {

which makes me suspect it has something to do with the Config:
Config config_;

which is initialized from the parameters passed to BoosterCreate:
const char* parameters,

Not quite sure though, just a theory. Would be good to see the code to investigate further.

@chris-smith-zocdoc
Copy link
Contributor Author

@chris-smith-zocdoc could you please also post the code that reproduces this error?

I'm still trying to create a reliable reproduction of this, but the code is essentially

from mmlspark.lightgbm import LightGBMRanker

ranker = LightGBMRanker()
ranker.setGroupCol('search_id')
ranker.setUseBarrierExecutionMode(True)

model = ranker.fit(df)

Are the objects passed to create it separate instances (dataset, parameters)?

Yes they should be, I'm using https://github.com/Azure/mmlspark/

The reason I mentioned "parallel within the same JVM" is because the setup Databricks appears to use is standalone mode with all cores for the executor.

So I have 2 executors, with 16 cores each.

@chris-smith-zocdoc
Copy link
Contributor Author

Here's a new error I saw today that might be useful in diagnosing this

Check failed: cache_size_ >= 2 at /__w/1/s/src/treelearner/feature_histogram.hpp, line 676 .

	at com.microsoft.ml.spark.lightgbm.LightGBMUtils$.validate(LightGBMUtils.scala:30)
	at com.microsoft.ml.spark.lightgbm.TrainUtils$.createBooster(TrainUtils.scala:160)
	at com.microsoft.ml.spark.lightgbm.TrainUtils$.translate(TrainUtils.scala:295)
	at com.microsoft.ml.spark.lightgbm.TrainUtils$.trainLightGBM(TrainUtils.scala:440)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:128)
	at com.microsoft.ml.spark.lightgbm.LightGBMBase$$anonfun$6.apply(LightGBMBase.scala:128)
	at org.apache.spark.rdd.RDDBarrier$$anonfun$mapPartitions$1$$anonfun$apply$1.apply(RDDBarrier.scala:51)
	at org.apache.spark.rdd.RDDBarrier$$anonfun$mapPartitions$1$$anonfun$apply$1.apply(RDDBarrier.scala:51)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
	at org.apache.spark.scheduler.Task.run(Task.scala:112)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Sep 11, 2019

For cache_size >= 2, we have received the similar issue recently: #2357.

@imatiach-msft
Copy link
Contributor

the check is coming from here:
https://github.com/microsoft/LightGBM/blob/master/src/treelearner/feature_histogram.hpp#L676
would be interesting to see what the value is.

"I'm still trying to create a reliable reproduction of this, but the code is essentially"
bizarre, we have tests covering that basic scenario, for the ranker:
https://github.com/Azure/mmlspark/blob/master/src/test/scala/com/microsoft/ml/spark/lightgbm/split2/VerifyLightGBMRanker.scala
and for classifier and regressor with booster:
https://github.com/Azure/mmlspark/blob/master/src/test/scala/com/microsoft/ml/spark/lightgbm/split2/VerifyLightGBMRegressor.scala
https://github.com/Azure/mmlspark/blob/master/src/test/scala/com/microsoft/ml/spark/lightgbm/split1/VerifyLightGBMClassifier.scala
these tests consistently pass and don't fail with either error, hmm

@chris-smith-zocdoc
Copy link
Contributor Author

@imatiach-msft Do you a decent sized publicly available dataset I can train LightGBMRanker on? I'd like to try and reproduce on a public dataset without any of the custom logic in my codebase

@imatiach-msft
Copy link
Contributor

@chris-smith-zocdoc the ranker tests in mmlspark are on the same dataset as this lightgbm github repository, and unfortunately it isn't that large. There are a lot more datasets for the lightgbm classifier and regressor in mmlspark, but they are mostly small. I did test the classifier on the Higgs dataset and there have been tests done on Criteo dataset -
I've put the azure databricks and HDInsight notebooks here for the Higgs dataset in my branch (note I will not be checking these in):
https://github.com/imatiach-msft/mmlspark/tree/ilmat/higgs-notebooks/notebooks/higgs
You can get the Higgs csv dataset here:
https://github.com/guolinke/boosting_tree_benchmarks/tree/master/data
Specifically this link:
https://archive.ics.uci.edu/ml/datasets/HIGGS

@imatiach-msft
Copy link
Contributor

imatiach-msft commented Sep 11, 2019

you can find the lightgbm ranker test files here which we use as well:
https://github.com/microsoft/LightGBM/tree/master/examples/lambdarank

note we run the ranker on that test file in our tests
https://github.com/Azure/mmlspark/blob/master/src/test/scala/com/microsoft/ml/spark/lightgbm/split2/VerifyLightGBMRanker.scala

@chris-smith-zocdoc
Copy link
Contributor Author

For cache_size >= 2, we have received the similar issue recently: #2357.

Thanks for the link @StrikerRUS I was able to reproduce the issue. Commented on the other issue #2357 (comment)

@imatiach-msft
Copy link
Contributor

@chris-smith-zocdoc since other issue has been fixed, can this issue be closed as well, or are you still seeing crashes?

@chris-smith-zocdoc
Copy link
Contributor Author

@chris-smith-zocdoc since other issue has been fixed, can this issue be closed as well, or are you still seeing crashes?

On the latest master of both LightGBM + mmlspark I am no longer seeing crashes. I'll re-open if they appear again

@chris-smith-zocdoc
Copy link
Contributor Author

@imatiach-msft unfortunately I'm seeing this again.

The version of mmlspark I was using did not have this commit merged in it yet. This hid the issue on my cluster because it artificially reduced the parallelism to the number of executors instead of num_cores * num_executors. Once I pulled that commit in I saw the issue again immediately.

This further increases my belief that there is a concurrency bug here.

Stack from today

*** Error in `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java': corrupted double-linked list: 0x00000000037433d0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f5d127a67e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x82970)[0x7f5d127b1970]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f5d127b3184]
/databricks/python/lib/libstdc++.so.6(_Znwm+0x16)[0x7f5d12e55097]
/local_disk0/tmp/mml-natives5480771853002411874/lib_lightgbm.so(_ZN8LightGBM13DCGCalculator9CalMaxDCGERKSt6vectorIiSaIiEEPKfiPS1_IdSaIdEE+0x5d)[0x7f453bd38c7d]
/local_disk0/tmp/mml-natives5480771853002411874/lib_lightgbm.so(+0x1bc4ed)[0x7f453bd634ed]
/databricks/python/lib/libgomp.so.1(+0x10e25)[0x7f4542e28e25]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f5d123156ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f5d1283641d]

@JoanFM
Copy link
Contributor

JoanFM commented Jan 31, 2020

Hello,

I am seeing some similar behavior, I would like to know at which stage this is @chris-smith-zocdoc?

I am not sure about the issue, but it feels that boosterPtr in mmlspark LightGBMBooster and loading the native library potentially more than once on the same JVM could be a reason.

I am not familiar with how the JVM works when loading multiple times the same dynamic library, could this create problems?

In any case, several Booster can be created? Is it guaranteed that all of them will be freed afterwards?

It is important to say that I am working with Mac OS.

Thank you very much

@chris-smith-zocdoc
Copy link
Contributor Author

I would like to know at which stage this is

AFAIK this is still an issue. I've moved onto another project and am not actively working on this

I am not sure about the issue, but it feels that boosterPtr in mmlspark LightGBMBooster and loading the native library potentially more than once on the same JVM could be a reason.

I don't believe loading the native library more than once should be an issue. Runtime.loadLibrary says subsequent calls are just ignored

If this method is called more than once with the same library name, the second and subsequent calls are ignored.

In any case, several Booster can be created?

Several Boosters are being created in the same process, which I think is triggering the bug. We end up with 1 per task. If your spark is configured like mine (on databricks) then there is one worker per cpu core in the same process.

Is it guaranteed that all of them will be freed afterwards?

They should be getting freed here https://github.com/Azure/mmlspark/blob/9c61053fa126959c962c0707aa543451ef077574/src/main/scala/com/microsoft/ml/spark/lightgbm/TrainUtils.scala#L278-L281

But this bug is triggered before this would happen

@JoanFM
Copy link
Contributor

JoanFM commented Jan 31, 2020

I would like to know at which stage this is

AFAIK this is still an issue. I've moved onto another project and am not actively working on this

I am not sure about the issue, but it feels that boosterPtr in mmlspark LightGBMBooster and loading the native library potentially more than once on the same JVM could be a reason.

I don't believe loading the native library more than once should be an issue. Runtime.loadLibrary says subsequent calls are just ignored

If this method is called more than once with the same library name, the second and subsequent calls are ignored.

In any case, several Booster can be created?

Several Boosters are being created in the same process, which I think is triggering the bug. We end up with 1 per task. If your spark is configured like mine (on databricks) then there is one worker per cpu core in the same process.

Is it guaranteed that all of them will be freed afterwards?

They should be getting freed here https://github.com/Azure/mmlspark/blob/9c61053fa126959c962c0707aa543451ef077574/src/main/scala/com/microsoft/ml/spark/lightgbm/TrainUtils.scala#L278-L281

But this bug is triggered before this would happen

I think this free statement is just called at Training time, but not once you are useing it for prediction or inference.

I also think that the bug might be triggered by different boosterPtr created in the same process by different threads, I did a PR that I think could help solving this but I am not 100% sure it does, microsoft/SynapseML#792

Thank you very much for the quick reply

@chris-smith-zocdoc
Copy link
Contributor Author

@JoanFM I've only trigged this bug in training, haven't really tried in other scenarios. Have you triggered it in prediction or inference?

different boosterPtr created in the same process by different threads

That would definitely be bad, but haven't seen that in my experiments (its been awhile so I might not remember clearly). What do you think it causing the current implementation to leak across threads?

@JoanFM
Copy link
Contributor

JoanFM commented Jan 31, 2020

Hello @chris-smith-zocdoc I have triggered in a test scenario and not consistently all the times.

I am not sure what is causing this.

But it seems that lightGBM is releasing the responsibility to free these BoosterHandler's to in the c_api. And I think that in mmlspark we are potentially requesting it more than once per thread. At this point some resources I think may lead, but then why it is crashing I do not see it so clear.

@chris-smith-zocdoc
Copy link
Contributor Author

Yes it does look like the model classes (eg LightGBMRegressionModel) never free their native memory. They never call lightgbmlib.LGBM_BoosterFree(booster)

There are two places I think where that would need to happen

  • at the end of the task
  • when you user program releases the reference to it and it is GC'd (maybe we can use some type of finalizer?)

@imatiach-msft
Copy link
Contributor

@chris-smith-zocdoc
"never free their native memory"
yes that's something that I've discussed with @eisber before. I'm not sure what the best way to implement that would be. Perhaps a method on the model to "cleanup" or "close" the native model when the user is done, or add it to the finalizer as you suggested.

@JoanFM
Copy link
Contributor

JoanFM commented Jan 31, 2020

I think if you add the finalize method it would be not necessary anymore to call lightgbmlib.LGBM_BoosterFree(booster) in TrainUtils? In a RAII style maybe it can be guaranteed that resource is obtained and freed whenever it is actually cleared?

UPDATE: I think I am wrong in this message, by training time, I think no Model is yet created, so TrainUtils can have the role to free that memory

@chris-smith-zocdoc
Copy link
Contributor Author

chris-smith-zocdoc commented Feb 1, 2020

yes that's something that I've discussed with @eisber before. I'm not sure what the best way to implement that would be. Perhaps a method on the model to "cleanup" or "close" the native model when the user is done, or add it to the finalizer as you suggested.

The best way I've seen so far is to implement both finalize and dispose, which allows both the user and the GC to cleanup the native memory

public void dispose() { 
  if(this.ptr != null) {
    LightGBM.free(this.ptr)
    this.ptr = null;
  }
}

protected void finalize() { dispose(); }

If we are concerned about multiple threads calling dispose concurrently then we'll also need synchronization.

There have been some attempts to make something closer to RAII in java, try-with-resources might be an option, though I think spark will get in the way.

There is also https://github.com/ThoughtWorksInc/RAII.scala but I've never used it.

I think if you add the finalize method it would be not necessary anymore to call lightgbmlib.LGBM_BoosterFree(booster) in TrainUtils?

Its still best practice to free it manually and not rely on the GC for this

In a RAII style maybe it can be guaranteed that resource is obtained and freed whenever it is actually cleared?

JVM doesn't have RAII, but I linked some alternatives above

@StrikerRUS StrikerRUS mentioned this issue May 11, 2020
@guolinke
Copy link
Collaborator

guolinke commented Aug 6, 2020

@chris-smith-zocdoc @imatiach-msft @JoanFM
Is this issue solved? refer to microsoft/SynapseML#792

@JoanFM
Copy link
Contributor

JoanFM commented Aug 6, 2020

@chris-smith-zocdoc @imatiach-msft @JoanFM
Is this issue solved? refer to Azure/mmlspark#792

Hey @guolinke. I think it might be but unfortunately I do not have the possibility to reproduce the same conditions as when I was experiencing the problem.

@chris-smith-zocdoc
Copy link
Contributor Author

@guolinke that pr addresses freeing the native memory but I don't think they address the concurrency issues in LGBM_BoosterCreate that this issue was created for. I'm not working with LightGBM at the moment so I can't easily reproduce.

@jameslamb
Copy link
Collaborator

It's been 2+ years since the last activity on this issue and it's possible that microsoft/SynapseML#792 fixed the original issue.

I'm going to close this.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants