Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException using BertEmbeddings #918

Closed
mbluestone opened this issue Jun 3, 2020 · 2 comments · Fixed by #927
Closed

NullPointerException using BertEmbeddings #918

mbluestone opened this issue Jun 3, 2020 · 2 comments · Fixed by #927
Assignees
Labels

Comments

@mbluestone
Copy link

I'm trying to use the package on Databricks to get BioBert embeddings for pubmed abstracts but I'm running into a NullPointerException error.

Configuration

Spark: 2.4.4
Spark-NLP: 2.5.1
Scala: 2.11

Steps to reproduce

import sparknlp
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
document = DocumentAssembler().setInputCol("abstract").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("tokens")
word_embeddings = BertEmbeddings.pretrained("biobert_pubmed_large_cased").setInputCols(["document", "tokens"]).setOutputCol("word_embeddings")
sentence_embeddings = SentenceEmbeddings() \
            .setInputCols(["document", "word_embeddings"]) \
            .setOutputCol("sentence_embeddings") \
            .setPoolingStrategy("AVERAGE")
finisher = EmbeddingsFinisher() \
      .setInputCols(["sentence_embeddings"])
pipeline = Pipeline().setStages([
                                  document,
                                  tokenizer,
                                  word_embeddings,
                                  sentence_embeddings,
                                  finisher
                                ])
spark_input = pubmed_df.select("id", "abstract")
spark_output = pipeline.fit(spark_input).transform(spark_input)
parse = F.udf(lambda x: x[0] if x is not None else None, T.ArrayType(T.FloatType()))
spark_output = spark_output.withColumn('embedding', parse(F.col('finished_sentence_embeddings'))).select("id", "embedding")
When I display the spark_output dataframe, I can see some valid results, but if I try to do .count() or .filter() or save the dataframe as a delta table, then I get an error.

Error

org.apache.spark.SparkException: Failed to execute user defined function($anonfun$dfAnnotate$1: (array<array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>>) => array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.ScalaUDF_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.ScalaUDF_1$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source)
	at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:124)
	at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:122)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1074)
	at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1089)
	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1127)
	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:368)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:50)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:424)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2136)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:230)
Caused by: java.lang.NullPointerException
	at com.johnsnowlabs.nlp.annotators.tokenizer.wordpiece.BasicTokenizer.isPunctuation(BasicTokenizer.scala:33)
	at com.johnsnowlabs.nlp.annotators.tokenizer.wordpiece.BasicTokenizer.tokenize(BasicTokenizer.scala:94)
	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$$anonfun$tokenize$1.apply(BertEmbeddings.scala:234)
	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$$anonfun$tokenize$1.apply(BertEmbeddings.scala:233)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.tokenize(BertEmbeddings.scala:233)
	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.annotate(BertEmbeddings.scala:251)
	at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:35)
	at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:34)
	... 19 more
@maziyarpanahi
Copy link
Member

This issue was fixed in 2.5.2 release. Please feel free to reopen it if the issue still exists.

@mbluestone
Copy link
Author

This issue is fixed, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants