-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve zip util code and add tests for both ZipArchiveUtil ane OnnxW… #14056
improve zip util code and add tests for both ZipArchiveUtil ane OnnxW… #14056
Conversation
Hi @anqini First, let me thank you for your contributions. Highly appreciate it. Could you please provide some specs (EMR version, spark-nlp version, etc.) and some steps to reproduce this? I recently had a Webinar in which I used both AWS Glue (4.0) and AWS EMR (6.x) to showcase Spark NLP (E5 embeddings) by using ONNX models. It would be very helpful if you can show us how to reproduce this error via some steps. Thanks again for your contributions |
Hi Maziyar, Thanks for your response. I created an issue with the bug i observed. But this PR is NOT yet resolving it. This is to update code in zipUtil to make testing OnnxWrapper easier. for example, the zip function failed to create a zip file but there is no exception thrown.
there are also corner cases like if target file exists it doesn't throw the error. Let me know if you want to me create an issue for that too. |
Hi @maziyarpanahi , can we take a step to review and merge this PR and then i can send a new PR to resolve the multi-task node issue? Does that sound good to you? Feel free to let me know if you have other plan. thanks |
Hi @anqini Thanks again for your contribution |
Thanks for your reply. The Java heap space exception is kind of interesting. I did observe that the testing process used a lot of memory on my local desktop. |
Hi @anqini , I wanted to express my sincere gratitude for your recent contribution. Your efforts in enhancing the Considering the critical role of This component is integral to various annotators, including all of those in Onnx and some other in Tensorflow such as While the modifications LGFM, I plan to run tests on a few annotators to ensure seamless functionality before final approval. Your collaboration in testing these changes would be immensely helpful. Once again, thank you for your contribution, and I look forward to your continued support. |
@danilojsl Totally understand regression testing on annotator components is important. Thanks for taking over the those tests for this PR. I'm not familiar with the Colab environment setup so lack of knowledge on how to run the same code with package replaced. But I had an informal way to do the regression with a new jar built from SBT.
I did this in the Scala + EMR and it reflected the latest changes. I'm also looking into pyspark code to understand how it work in pyspark. It will be very helpful to have a simpler way to run regression on notebooks if we don't have it yet. I'll try to provide some complementary end-to-end tests and attach the resolve here later. Let me know if you have any comment. |
Hi @anqini Thanks for the details to run tests in EMR. I tested your changes in Google Colab and all passed ✅. |
37b24b9
into
JohnSnowLabs:release/520-release-candidate
Hi team, I'm Angelo (Anqi) Ni, currently an SDE in the software industry in the US. I'm very interested in Spark NLP and hope to have the chance to contribute. This is a small Pull request to add some testing and refactor a piece of code to getting familiar with the system and understand the contribution process better. Very excited to learn from your team and get any feedback!
Description
Motivation and Context
My original motivation was to investigate whether Spark-nlp is fully compatible with EMR on the ONNX and tensorflow related functionality. Since I ran into the "the SparkSession should only be created on driver" issue caused by OnnxWrapper, I suspected there was a bug in the code. But when I dived deep into the code, i found there was no test for OnnxWrapper class so it was hard to run the OnnxWrapper test on both drive node and worker node.
Before reaching my original goal, I'm happy to add some test first like I did in this Pull request. I'm also hoping to collaborate with your team to help Spark-nlp work better on EMR in a long term. I think it very promising since EMR is probably still one of the most popular places people run spark on.
How Has This Been Tested?
Unit test at local laptop.
ZipArchiveUtil is used in the following places
Unit test has been added to OnnxWrapper. Since TensorflowWrapper and TrainingHelper shared the same logic, we skip them for now.
Screenshots (if appropriate):
Types of changes
Checklist: