You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to use the package on Databricks to get BioBert embeddings for pubmed abstracts but I'm running into a NullPointerException error.
Configuration
Spark: 2.4.4
Spark-NLP: 2.5.1
Scala: 2.11
Steps to reproduce
import sparknlp
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
document = DocumentAssembler().setInputCol("abstract").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("tokens")
word_embeddings = BertEmbeddings.pretrained("biobert_pubmed_large_cased").setInputCols(["document", "tokens"]).setOutputCol("word_embeddings")
sentence_embeddings = SentenceEmbeddings() \
.setInputCols(["document", "word_embeddings"]) \
.setOutputCol("sentence_embeddings") \
.setPoolingStrategy("AVERAGE")
finisher = EmbeddingsFinisher() \
.setInputCols(["sentence_embeddings"])
pipeline = Pipeline().setStages([
document,
tokenizer,
word_embeddings,
sentence_embeddings,
finisher
])
spark_input = pubmed_df.select("id", "abstract")
spark_output = pipeline.fit(spark_input).transform(spark_input)
parse = F.udf(lambda x: x[0] if x is not None else None, T.ArrayType(T.FloatType()))
spark_output = spark_output.withColumn('embedding', parse(F.col('finished_sentence_embeddings'))).select("id", "embedding")
When I display the spark_output dataframe, I can see some valid results, but if I try to do .count() or .filter() or save the dataframe as a delta table, then I get an error.
Error
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$dfAnnotate$1: (array<array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>>) => array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.ScalaUDF_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.ScalaUDF_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:124)
at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:122)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1074)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1089)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1127)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:368)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:50)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:424)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2136)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:230)
Caused by: java.lang.NullPointerException
at com.johnsnowlabs.nlp.annotators.tokenizer.wordpiece.BasicTokenizer.isPunctuation(BasicTokenizer.scala:33)
at com.johnsnowlabs.nlp.annotators.tokenizer.wordpiece.BasicTokenizer.tokenize(BasicTokenizer.scala:94)
at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$$anonfun$tokenize$1.apply(BertEmbeddings.scala:234)
at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$$anonfun$tokenize$1.apply(BertEmbeddings.scala:233)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.tokenize(BertEmbeddings.scala:233)
at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.annotate(BertEmbeddings.scala:251)
at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:35)
at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:34)
... 19 more
The text was updated successfully, but these errors were encountered:
I'm trying to use the package on Databricks to get BioBert embeddings for pubmed abstracts but I'm running into a NullPointerException error.
Configuration
Spark: 2.4.4
Spark-NLP: 2.5.1
Scala: 2.11
Steps to reproduce
Error
The text was updated successfully, but these errors were encountered: