You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Databricks DBFS location configured in Spark NLP Configuration's spark.jsl.settings.storage.cluster_tmp_dir is not recognized. Instead it returns an incorrect location with the prefix of nvirginia-prod/423079709230XXXX/ such asnvirginia-prod/423079709230XXXX/dbfs:/mnt/audix-prod-1-ephemeral/tmp/personalization_ml/spark_nlp/6ade498ff4bf_cdx/EMBEDDINGS_glove_100d/.
Expected Behavior
Documentation - The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of hadoop.tmp.dir set via Hadoop configuration for Apache Spark. NOTE: S3 is not supported and it must be local, HDFS, or DBFS
We expected that temporary files should be written to the correct Databricks DBFS path (dbfs:/PATH_TO_STORAGE)
java.nio.file.AccessDeniedException: nvirginia-prod/423079709230XXXX/dbfs:/mnt/audix-prod-1-ephemeral/tmp/personalization_ml/spark_nlp/6ade498ff4bf_cdx/EMBEDDINGS_glove_100d/: PUT 0-byte object on nvirginia-prod/423079709230XXXX/dbfs:/mnt/audix-prod-1-ephemeral/tmp/personalization_ml/spark_nlp/6ade498ff4bf_cdx/EMBEDDINGS_glove_100d/: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: PUT https://audix-prod-root.s3-fips.us-east-1.amazonaws.com nvirginia-prod/423079709230XXXX/dbfs%3A/mnt/audix-prod-1-ephemeral/tmp/personalization_ml/spark_nlp/6ade498ff4bf_cdx/EMBEDDINGS_glove_100d/ {} Hadoop 2.7.4, aws-sdk-java/1.11.678 Linux/5.4.0-1116-aws-fips OpenJDK_64-Bit_Server_VM/25.362-b09 java/1.8.0_362 scala/2.12.10 vendor/Azul_Systems,_Inc. com.amazonaws.services.s3.model.PutObjectRequest; Request ID: 6H4JNNPF9CYXGZDC, Extended Request ID: pRZVB79TYuCzJPwyxHZKcAzSWeK8KxDKyla8U1/0qUhDrXjHeVB1rzhuJqmyeqMWZyuxlUs5h14=, Cloud Provider: AWS, Instance ID: i-00f9a7585d7c77bc9 (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 6H4JNNPF9CYXGZDC; S3 Extended Request ID: pRZVB79TYuCzJPwyxHZKcAzSWeK8KxDKyla8U1/0qUhDrXjHeVB1rzhuJqmyeqMWZyuxlUs5h14=), S3 Extended Request ID: pRZVB79TYuCzJPwyxHZKcAzSWeK8KxDKyla8U1/0qUhDrXjHeVB1rzhuJqmyeqMWZyuxlUs5h14=:AccessDenied
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<command-1042743005935531> in <module>
29
30 glove = (
---> 31 WordEmbeddingsModel.load(model_path)
32 .setInputCols(["document", "clean_normal"])
33 .setOutputCol("embeddings")
/databricks/spark/python/pyspark/ml/util.py in load(cls, path)
461 def load(cls, path):
462 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 463 return cls.read().load(path)
464
465
/databricks/spark/python/pyspark/ml/util.py in load(self, path)
411 if not isinstance(path, str):
412 raise TypeError("path should be a string, got type %s" % type(path))
--> 413 java_obj = self._jread.load(path)
414 if not hasattr(self._clazz, "_from_java"):
415 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 def deco(*a, **kw):
116 try:
--> 117 return f(*a, **kw)
118 except py4j.protocol.Py4JJavaError as e:
119 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o417.load.
: java.nio.file.AccessDeniedException: nvirginia-prod/423079709230XXXX/dbfs:/mnt/audix-prod-1-ephemeral/tmp/personalization_ml/spark_nlp/6ade498ff4bf_cdx/EMBEDDINGS_glove_100d/: PUT 0-byte object on nvirginia-prod/423079709230XXXX/dbfs:/mnt/audix-prod-1-ephemeral/tmp/personalization_ml/spark_nlp/6ade498ff4bf_cdx/EMBEDDINGS_glove_100d/: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: PUT https://audix-prod-root.s3-fips.us-east-1.amazonaws.com nvirginia-prod/423079709230XXXX/dbfs%3A/mnt/audix-prod-1-ephemeral/tmp/personalization_ml/spark_nlp/6ade498ff4bf_cdx/EMBEDDINGS_glove_100d/ {} Hadoop 2.7.4, aws-sdk-java/1.11.678 Linux/5.4.0-1116-aws-fips OpenJDK_64-Bit_Server_VM/25.362-b09 java/1.8.0_362 scala/2.12.10 vendor/Azul_Systems,_Inc. com.amazonaws.services.s3.model.PutObjectRequest; Request ID: 6H4JNNPF9CYXGZDC, Extended Request ID: pRZVB79TYuCzJPwyxHZKcAzSWeK8KxDKyla8U1/0qUhDrXjHeVB1rzhuJqmyeqMWZyuxlUs5h14=, Cloud Provider: AWS, Instance ID: i-00f9a7585d7c77bc9 (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 6H4JNNPF9CYXGZDC; S3 Extended Request ID: pRZVB79TYuCzJPwyxHZKcAzSWeK8KxDKyla8U1/0qUhDrXjHeVB1rzhuJqmyeqMWZyuxlUs5h14=), S3 Extended Request ID: pRZVB79TYuCzJPwyxHZKcAzSWeK8KxDKyla8U1/0qUhDrXjHeVB1rzhuJqmyeqMWZyuxlUs5h14=:AccessDenied
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:248)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:120)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:274)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:333)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:270)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:245)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.createEmptyObject(S3AFileSystem.java:3881)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.createFakeDirectory(S3AFileSystem.java:3853)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.innerMkdirs(S3AFileSystem.java:3155)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.mkdirs(S3AFileSystem.java:3088)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.$anonfun$mkdirs$3(DatabricksFileSystemV2.scala:820)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.s3a.S3AExceptionUtils$.convertAWSExceptionToJavaIOException(DatabricksStreamUtils.scala:66)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.$anonfun$mkdirs$2(DatabricksFileSystemV2.scala:818)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.$anonfun$withUserContextRecorded$2(DatabricksFileSystemV2.scala:1013)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:266)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionContext(DatabricksFileSystemV2.scala:510)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:305)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:297)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionTags(DatabricksFileSystemV2.scala:510)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withUserContextRecorded(DatabricksFileSystemV2.scala:986)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.$anonfun$mkdirs$1(DatabricksFileSystemV2.scala:817)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:395)
at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:484)
at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:504)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:266)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionContext(DatabricksFileSystemV2.scala:510)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:305)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:297)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionTags(DatabricksFileSystemV2.scala:510)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:479)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:404)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.recordOperationWithResultTags(DatabricksFileSystemV2.scala:510)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:395)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:367)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.recordOperation(DatabricksFileSystemV2.scala:510)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.mkdirs(DatabricksFileSystemV2.scala:817)
at com.databricks.backend.daemon.data.client.DatabricksFileSystem.mkdirs(DatabricksFileSystem.scala:198)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:351)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:338)
at com.johnsnowlabs.storage.StorageHelper$.copyIndexToCluster(StorageHelper.scala:104)
at com.johnsnowlabs.storage.StorageHelper$.sendToCluster(StorageHelper.scala:91)
at com.johnsnowlabs.storage.StorageHelper$.load(StorageHelper.scala:50)
at com.johnsnowlabs.storage.HasStorageModel.$anonfun$deserializeStorage$1(HasStorageModel.scala:43)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at com.johnsnowlabs.storage.HasStorageModel.deserializeStorage(HasStorageModel.scala:42)
at com.johnsnowlabs.storage.HasStorageModel.deserializeStorage$(HasStorageModel.scala:40)
at com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel.deserializeStorage(WordEmbeddingsModel.scala:146)
at com.johnsnowlabs.storage.StorageReadable.readStorage(StorageReadable.scala:34)
at com.johnsnowlabs.storage.StorageReadable.readStorage$(StorageReadable.scala:33)
at com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel$.readStorage(WordEmbeddingsModel.scala:303)
at com.johnsnowlabs.storage.StorageReadable.$anonfun$$init$$1(StorageReadable.scala:37)
at com.johnsnowlabs.storage.StorageReadable.$anonfun$$init$$1$adapted(StorageReadable.scala:37)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:50)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:49)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:49)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:61)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:61)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:38)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:750)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: PUT https://audix-prod-root.s3-fips.us-east-1.amazonaws.com nvirginia-prod/423079709230XXXX/dbfs%3A/mnt/audix-prod-1-ephemeral/tmp/personalization_ml/spark_nlp/6ade498ff4bf_cdx/EMBEDDINGS_glove_100d/ {} Hadoop 2.7.4, aws-sdk-java/1.11.678 Linux/5.4.0-1116-aws-fips OpenJDK_64-Bit_Server_VM/25.362-b09 java/1.8.0_362 scala/2.12.10 vendor/Azul_Systems,_Inc. com.amazonaws.services.s3.model.PutObjectRequest; Request ID: 6H4JNNPF9CYXGZDC, Extended Request ID: pRZVB79TYuCzJPwyxHZKcAzSWeK8KxDKyla8U1/0qUhDrXjHeVB1rzhuJqmyeqMWZyuxlUs5h14=, Cloud Provider: AWS, Instance ID: i-00f9a7585d7c77bc9 (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 6H4JNNPF9CYXGZDC; S3 Extended Request ID: pRZVB79TYuCzJPwyxHZKcAzSWeK8KxDKyla8U1/0qUhDrXjHeVB1rzhuJqmyeqMWZyuxlUs5h14=), S3 Extended Request ID: pRZVB79TYuCzJPwyxHZKcAzSWeK8KxDKyla8U1/0qUhDrXjHeVB1rzhuJqmyeqMWZyuxlUs5h14=
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4926)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4872)
at com.amazonaws.services.s3.AmazonS3Client.access$300(AmazonS3Client.java:390)
at com.amazonaws.services.s3.AmazonS3Client$PutObjectStrategy.invokeServiceCall(AmazonS3Client.java:5806)
at com.amazonaws.services.s3.AmazonS3Client.uploadObject(AmazonS3Client.java:1794)
at com.amazonaws.services.s3.AmazonS3Client.putObject(AmazonS3Client.java:1754)
at shaded.databricks.org.apache.hadoop.fs.s3a.EnforcingDatabricksS3Client.putObject(EnforcingDatabricksS3Client.scala:69)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.putObjectDirect(S3AFileSystem.java:2064)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$createEmptyObject$15(S3AFileSystem.java:3883)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:118)
... 82 more
The text was updated successfully, but these errors were encountered:
jiamaozheng
changed the title
Spark NLP Configuration's spark.jsl.settings.storage.cluster_tmp_dir: dbfs location such as "dbfs:/writeable_path_for_my_user/" does not work - Databricks
Spark NLP Configuration's spark.jsl.settings.storage.cluster_tmp_dir: Databricks DBFS location does not work
Jan 9, 2024
@maziyarpanahi, thanks for your prompt response and I have updated to use latest release. It seems that the root cause of the issue is the extra file system url prefix and this has been tested/verified via Databricks notebook. I will submit this PR for the fix. Please let me know if you have any thoughts. Thanks,
Is there an existing issue for this?
Who can help?
Please help
What are you working on?
Databricks 9.1 LTS ML (includes Apache Spark 3.1.2, Scala 2.12)
com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
Current Behavior
Databricks DBFS location configured in Spark NLP Configuration's
spark.jsl.settings.storage.cluster_tmp_dir
is not recognized. Instead it returns an incorrect location with the prefix ofnvirginia-prod/423079709230XXXX/
such asnvirginia-prod/423079709230XXXX/dbfs:/mnt/audix-prod-1-ephemeral/tmp/personalization_ml/spark_nlp/6ade498ff4bf_cdx/EMBEDDINGS_glove_100d/
.Expected Behavior
Documentation -
The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of hadoop.tmp.dir set via Hadoop configuration for Apache Spark. NOTE: S3 is not supported and it must be local, HDFS, or DBFS
We expected that temporary files should be written to the correct Databricks DBFS path (
dbfs:/PATH_TO_STORAGE
)Steps To Reproduce
Spark NLP version and Apache Spark
Databricks 9.1 LTS ML (includes Apache Spark 3.1.2, Scala 2.12)
com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
spark-nlp==5.2.2
Type of Spark Application
Python Application
Java Version
openjdk version "1.8.0_362" OpenJDK Runtime Environment (Zulu 8.68.0.21-CA-linux64) (build 1.8.0_362-b09) OpenJDK 64-Bit Server VM (Zulu 8.68.0.21-CA-linux64) (build 25.362-b09, mixed mode)
Java Home Directory
/usr/lib/jvm/zulu8-ca-amd64/jre/
Setup and installation
spark-nlp==5.2.2
com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
Operating System and Version
No response
Link to your project (if available)
No response
Additional Information
Stack Trace -
The text was updated successfully, but these errors were encountered: