-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue#14129] Fix for spark.jsl.settings.storage.cluster_tmp_dir configuration #14132
[Issue#14129] Fix for spark.jsl.settings.storage.cluster_tmp_dir configuration #14132
Conversation
@danilojsl perhaps this part can have logics for different storage layers like s3, dbfs, hdfs, local, etc.?
Could you please have a look and run some tests? |
@maziyarpanahi , I have done some tests in HDFS, DBFS, S3 and Local environments, and the outcomes of these runs are expected. @danilojsl, please feel free to run more tests in your environments if needed. Also please approve three appending CI builds. Thanks, |
@maziyarpanahi I also ran several tests and the change is working. Thanks for the contribution @jiamaozheng |
@maziyarpanahi , if there are no other concerns, would you please approve, merge and release this bug fix? Thanks, |
Thanks @jiamaozheng |
…date * fixed all sbt warnings * remove file system url prefix (#14132) * SPARKNLP-942: MPNet Classifiers (#14147) * SPARKNLP-942: MPNetForSequenceClassification * SPARKNLP-942: MPNetForQuestionAnswering * SPARKNLP-942: MPNet Classifiers Documentation * Restore RobertaforQA bugfix * adding import notebook + changing default model + adding onnx support (#14158) * Sparknlp 876: Introducing LLAMA2 (#14148) * introducing LLAMA2 * Added option to read model from model path to onnx wrapper * Added option to read model from model path to onnx wrapper * updated text description * LLAMA2 python API * added method to save onnx_data * added position ids * - updated Generate.scala to accept onnx tensors - added beam search support for LLAMA2 * updated max input length * updated python default params changed test to slow test * fixed serialization bug * Doc sim rank as retriever (#14149) * Added retrieval interface to the doc sim rank approach * Added Python interface as retriever in doc sim ranker --------- Co-authored-by: Stefano Lori <s.lori@izicap.com> * 812 implement de berta for zero shot classification annotator (#14151) * adding code * adding notebook for import --------- Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr> * Add notebook for fine tuning sbert (#14152) * [SPARKNLP-986] Fixing optional input col validations (#14153) * [SPARKNLP-984] Fixing Deberta notebooks URIs (#14154) * SparkNLP 933: Introducing M2M100 : multilingual translation model (#14155) * introducing LLAMA2 * Added option to read model from model path to onnx wrapper * Added option to read model from model path to onnx wrapper * updated text description * LLAMA2 python API * added method to save onnx_data * added position ids * - updated Generate.scala to accept onnx tensors - added beam search support for LLAMA2 * updated max input length * updated python default params changed test to slow test * fixed serialization bug * Added Scala code for M2M100 * Documentation for scala code * Python API for M2M100 * added more tests for scala * added tests for python * added pretrained * rewording * fixed serialization bug * fixed serialization bug --------- Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr> * SPARKNLP-985: Add flexible naming for onnx_data (#14165) Some annotators might have different naming schemes for their files. Added a parameter to control this. * Add LLAMA2Transformer and M2M100Transformer to annotator * Add LLAMA2Transformer and M2M100Transformer to ResourceDownloader * bump version to 5.3.0 [skip test] * SPARKNLP-999: Fix remote model loading for some onnx models * used filesystem to check for the onnx_data file (#14169) * [SPARKNLP-940] Adding changes to correctly copy cluster index storage… (#14167) * [SPARKNLP-940] Adding changes to correctly copy cluster index storage when defined * [SPARKNLP-940] Moving local mode control to its right place * [SPARKNLP-940] Refactoring sentToCLuster method * [SPARKNLP-988] Updating EntityRuler documentation (#14168) * [SPARKNLP-940] Adding changes to support storage temp directory (cluster_tmp_dir) * SPARKNLP-1000: Disable init_all_tables for GPT2 (#14177) Fixes `java.lang.IllegalArgumentException: No Operation named [init_all_tables] in the Graph` when model needs to be deserialized. The deserialization is skipped when the modelis already loaded (so it will only appear on the worker nodes and not the driver) GPT2 does not contain tables and so does not require this command. * fixes python documentation (#14172) * revert MarianTransformer.scala * revert HasBatchedAnnotate.scala * revert Preprocessor.scala * Revert ViTClassifier.scala * disable hard exception * Replace hard exception with soft logs (#14179) This reverts commit eb91fde. * move the example from root to examples/ [skip test] * Cleanup some code [skip test] * Update onnxruntime to 1.17.0 [skip test] * Fix M2M100 default model's name [skip test] * Update docs [run doc] * Update Scala and Python APIs --------- Co-authored-by: ahmedlone127 <ahmedlone127@gmail.com> Co-authored-by: Jiamao Zheng <jiamaozheng@users.noreply.github.com> Co-authored-by: Devin Ha <33089471+DevinTDHa@users.noreply.github.com> Co-authored-by: Prabod Rathnayaka <prabod@rathnayaka.me> Co-authored-by: Stefano Lori <wolliq@users.noreply.github.com> Co-authored-by: Stefano Lori <s.lori@izicap.com> Co-authored-by: Danilo Burbano <37355249+danilojsl@users.noreply.github.com> Co-authored-by: Devin Ha <t.ha@tu-berlin.de> Co-authored-by: Danilo Burbano <danilo@johnsnowlabs.com> Co-authored-by: github-actions <action@github.com>
fixes #14129
Verifications
1. DBFS - AWS Databricks (DBR 9.1 LTS ML)
Databricks notebook was modified from spark-nlp-training-and-inference-example
Before the fix:
Errors thrown from
glove_embeddings = WordEmbeddingsModel.load( "dbfs:/FileStore/pzn_ai/nlp_pretrained_models/glove_100d")
_java.nio.file.AccessDeniedException: nvirginia-prod/42307XXX92305032/dbfs:/tmp/spark_nlp/standard/ca628fbc03c8_cdx/EMBEDDINGS_glove_100d/: PUT 0-byte object on nvirginia-prod/4230797092305032/dbfs:/tmp/spark_nlp/standard/ca628fbc03c8_cdx/EMBEDDINGS_glove_100d/: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: PUT https://audix-prod-root.s3-fips.us-east-1.amazonaws.com nvirginia-prod/423079XXX305032/dbfs%3A/tmp/spark_nlp/standard/ca628fbc03c8_cdx/EMBEDDINGS_glove_100d/ {} Hadoop 2.7.4, aws-sdk-java/1.11.678 Linux/5.4.0-1116-aws-fips OpenJDK_64-Bit_Server_VM/25.362-b09 java/1.8.0_362 scala/2.12.10 vendor/Azul_Systems,_Inc. com.amazonaws.services.s3.model.PutObjectRequest; Request ID: 6T6PP67TRDG77BC3, Extended Request ID: /9WZK/wlhMxzFNR7j0NtCxqA5msaIFGj9HGOl8fOEJZ1G59sGls8uSqts31aryjXc6HHp99f1vo=, Cloud Provider: AWS, Instance ID: i-0855670b7e4a1edf4 (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 6T6PP67TRDG77BC3; S3 Extended Request ID: /9WZK/wlhMxzFNR7j0NtCxqA5msaIFGj9HGOl8fOEJZ1G59sGls8uSqts31aryjXc6HHp99f1vo=), S3 Extended Request ID: /9WZK/wlhMxzFNR7j0NtCxqA5msaIFGj9HGOl8fOEJZ1G59sGls8uSqts31aryjXc6HHp99f1vo=:AccessDenied_
After the fix
Outcome:
Databricks notebook runs successfully.
FILES from spark.jsl.settings.storage.cluster_tmp_dir
Conclusion
The PR resolved #14129
2. S3 - AWS Databricks (Not supported as expected)
Databricks notebook was modified from spark-nlp-training-and-inference-example
The same settings as shown above except
spark.jsl.settings.storage.cluster_tmp_dir
S3:
s3://audix-prod-1-rs-ephemeral/tmp/personalization_ml/spark_nlp/standard/
Errors thrown from
ner_model = ner_pipeline.fit(training_data)
:org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 19) (10.171.87.166 executor 1): org.apache.spark.SparkException: Failed to fetch s3://audix-prod-1-rs-ephemeral/tmp/personalization_ml/spark_nlp/standard/7ab643ad115f_cdx/EMBEDDINGS_glove_100d during dependency update
DBFS S3 bucket mounts:
dbfs:/mnt/audix-prod-1-ephemeral/tmp/personalization_ml/spark_nlp/standard/
Errors thrown from
ner_model = ner_pipeline.fit(training_data)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 440.0 failed 4 times, most recent failure: Lost task 0.3 in stage 440.0 (TID 36114) (10.171.14.248 executor 286): org.apache.spark.SparkException: Failed to fetch dbfs:/mnt/audix-prod-1-ephemeral/tmp/personalization_ml/spark_nlp/d5e82b9b2bc1_cdx/EMBEDDINGS_glove_100d during dependency update
Conclusion
S3 and DBFS S3 bucket mounts are not supported for
spark.jsl.settings.storage.cluster_tmp_dir
configuration.3. Local - Jupyter Notebook (macOS Monterey v12.7.1)
Jupyter Notebooks were modified from Databricks notebook.
Both local Jupyter notebooks with and without the fix run successfully.
Conclusion
The PR doesn't impact
spark.jsl.settings.storage.cluster_tmp_dir
. As expected, no intermediate files were generated from local runs.4. HDFS - AWS EMR
AWS EMR cluster creations followed the guidelines from How to create EMR cluster via CLI.
Before the fix:
"java.io.IOException: Incomplete HDFS URI, no host" Exception thrown from
glove_embeddings = WordEmbeddingsModel.load( 'hdfs:///sparknlp/glove_100d')
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/spark/python/pyspark/ml/util.py", line 332, in load return cls.read().load(path) File "/usr/lib/spark/python/pyspark/ml/util.py", line 282, in load java_obj = self._jread.load(path) File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o108.load. : java.io.IOException: Incomplete HDFS URI, no host: hdfs://ip-172-31-18-38.ec2.internal:8020hdfs:/tmp/sparknlp/standard/05ab51ba5bad_cdx/EMBEDDINGS_glove_100d at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:168) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3364) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:123) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3413) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3381) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:486) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at com.johnsnowlabs.storage.StorageHelper$.copyIndexToCluster(StorageHelper.scala:100) at com.johnsnowlabs.storage.StorageHelper$.sendToCluster(StorageHelper.scala:90) at com.johnsnowlabs.storage.StorageHelper$.load(StorageHelper.scala:50) at com.johnsnowlabs.storage.HasStorageModel.$anonfun$deserializeStorage$1(HasStorageModel.scala:43) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at com.johnsnowlabs.storage.HasStorageModel.deserializeStorage(HasStorageModel.scala:42) at com.johnsnowlabs.storage.HasStorageModel.deserializeStorage$(HasStorageModel.scala:40) at com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel.deserializeStorage(WordEmbeddingsModel.scala:147) at com.johnsnowlabs.storage.StorageReadable.readStorage(StorageReadable.scala:34) at com.johnsnowlabs.storage.StorageReadable.readStorage$(StorageReadable.scala:33) at com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel$.readStorage(WordEmbeddingsModel.scala:357) at com.johnsnowlabs.storage.StorageReadable.$anonfun$$init$$1(StorageReadable.scala:37) at com.johnsnowlabs.storage.StorageReadable.$anonfun$$init$$1$adapted(StorageReadable.scala:37) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:50) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:49) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:49) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:61) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:61) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:38) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750)
After the fix
pyspark script runs successfully.
Conclusion
The PR resolved the issue (
java.io.IOException: Incomplete HDFS URI, no host
) of the HDFS version ofspark.jsl.settings.storage.cluster_tmp_dir
, very similar to the one observed in DBFS - #14129