Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX models crash when they are used in Colab's T4 GPU runtime #14109

Closed
1 task done
maziyarpanahi opened this issue Dec 25, 2023 · 5 comments
Closed
1 task done

ONNX models crash when they are used in Colab's T4 GPU runtime #14109

maziyarpanahi opened this issue Dec 25, 2023 · 5 comments
Assignees
Labels

Comments

@maziyarpanahi
Copy link
Member

maziyarpanahi commented Dec 25, 2023

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

@danilojsl

What are you working on?

Downloading and loading models on ONNX over GPU devices crashes. (at least on T4 on Colab)

Current Behavior

Crashes with:

An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1193 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.11: cannot open shared object file: No such file or directory

	at ai.onnxruntime.providers.OrtCUDAProviderOptions.add(Native Method)
	at ai.onnxruntime.providers.OrtCUDAProviderOptions.<init>(OrtCUDAProviderOptions.java:44)
	at com.johnsnowlabs.ml.onnx.OnnxWrapper$.mapToCUDASessionConfig(OnnxWrapper.scala:152)
	at com.johnsnowlabs.ml.onnx.OnnxWrapper$.mapToSessionOptionsObject(OnnxWrapper.scala:136)
	at com.johnsnowlabs.ml.onnx.OnnxWrapper$.com$johnsnowlabs$ml$onnx$OnnxWrapper$$withSafeOnnxModelLoader(OnnxWrapper.scala:90)
	at com.johnsnowlabs.ml.onnx.OnnxWrapper$.read(OnnxWrapper.scala:122)
	at com.johnsnowlabs.ml.onnx.ReadOnnxModel.readOnnxModel(OnnxSerializeModel.scala:98)
	at com.johnsnowlabs.ml.onnx.ReadOnnxModel.readOnnxModel$(OnnxSerializeModel.scala:75)
	at com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings$.readOnnxModel(MPNetEmbeddings.scala:471)
	at com.johnsnowlabs.nlp.embeddings.ReadMPNetDLModel.readModel(MPNetEmbeddings.scala:416)
	at com.johnsnowlabs.nlp.embeddings.ReadMPNetDLModel.readModel$(MPNetEmbeddings.scala:407)
	at com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings$.readModel(MPNetEmbeddings.scala:471)
	at com.johnsnowlabs.nlp.embeddings.ReadMPNetDLModel.$anonfun$$init$$1(MPNetEmbeddings.scala:424)
	at com.johnsnowlabs.nlp.embeddings.ReadMPNetDLModel.$anonfun$$init$$1$adapted(MPNetEmbeddings.scala:424)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:50)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:49)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:49)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:61)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:61)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:38)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:513)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:505)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:705)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)

Expected Behavior

Should work before upgrading to newer version of Spark NLP

Steps To Reproduce

!pip install spark-nlp pyspark

embeddings = MPNetEmbeddings.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("embeddings")

Spark NLP version and Apache Spark

Spark NLP version 5.2.0
Apache Spark version: 3.5.0

Type of Spark Application

Python Application

Java Version

11

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

@danilojsl
Copy link
Contributor

danilojsl commented Dec 26, 2023

Hi @maziyarpanahi

I haven't been able to replicate the error. I tried in Google Colab with T4 but it is working for spark-np 5.2.0. Can you take a look at this notebook, reproduce the error and let me know
MPNet notebook

@maziyarpanahi
Copy link
Member Author

Hi @danilojsl

You forgot to load ONNX GPU build in start function: spark = sparknlp.start(gpu=True). Once the session is started with the GPU build of ONNX and TF, the ONNX models will fail with that error

@maziyarpanahi
Copy link
Member Author

Some extra information, I can use A100 GPUs without any issue. So this must be something with Colab itself, it is either missing something (lib) or it has them but a different versions. (usually older, so for GPU we usually do something in the Colab script-setup to fix those)

@danilojsl Let's find out what's missing and how to fix them, then we can modify the GPU installation for Colab accordingly:
image

@maziyarpanahi maziyarpanahi changed the title ONNX crashes on GPU in latest Spark NLP 5.2 ONNX crashes on Colab's T4 GPU runtime Dec 28, 2023
@maziyarpanahi maziyarpanahi changed the title ONNX crashes on Colab's T4 GPU runtime ONNX models crashe on Colab's T4 GPU runtime Dec 28, 2023
@maziyarpanahi maziyarpanahi changed the title ONNX models crashe on Colab's T4 GPU runtime ONNX models crash when they are used in Colab's T4 GPU runtime Dec 28, 2023
@danilojsl
Copy link
Contributor

danilojsl commented Jul 16, 2024

Hi @maziyarpanahi

This issue is no longer presented with the latest update of ONNX in spark-nlp 5.4.0.

@maziyarpanahi
Copy link
Member Author

Thanks @danilojsl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants