[SPARKNLP-1037] Adding addFile changes to to replace broadcast in all ONNX based annotators #14236

danilojsl · 2024-04-16T13:31:26Z

Description

This PR introduces enhancements to the Spark NLP library, focusing on the efficient distribution of ONNX model files across Spark executors. Leveraging Spark's built-in file distribution capabilities, this update aims to optimize the performance and scalability of LLMs within distributed cloud environments.

Motivation and Context

The primary motivation behind this update is to address the challenges associated with deploying and scaling LLMs in cloud-based Spark environments. By utilizing Spark's native support for distributing files across executors, we can significantly enhance the scalability and efficiency of LLM annotators. This is particularly crucial for models like Llama-2 and M2M100, which require access to large ONNX files to function correctly.

This improvement ensures that ONNX models are effectively shared across all nodes in a Spark cluster, reducing the overhead associated with model loading and facilitating faster, more scalable annotations. As a result, users can expect improved performance and a smoother experience when processing large datasets or working in resource-intensive cloud environments.

The integration of these changes represents a significant step forward in our ongoing efforts to optimize Spark NLP for LLM processing, reinforcing our commitment to providing robust, scalable NLP solutions for the cloud.

How Has This Been Tested?

Screenshots (if appropriate):

Local Tests
Google Colab notebooks
Databricks

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

…rough Spark files feature

…a file

…a addFile

danilojsl added 3 commits April 16, 2024 08:25

[SPARKNLP-1011] Adding changes to transfer ONNX files on executors th…

27644b3

…rough Spark files feature

[SPARKNLP-1011] Adding missing copyright comment

0fc5839

[SPARKNLP-1011] Adding changes to add prefix for models with onnx_dat…

4353eec

…a file

danilojsl added enhancement bug-fix DON'T MERGE Do not merge this PR labels Apr 16, 2024

danilojsl mentioned this pull request Apr 16, 2024

[SPARKNLP-1011] Adding changes to transfer ONNX files on executors for LLM #14207

Closed

10 tasks

danilojsl requested a review from maziyarpanahi April 16, 2024 13:33

danilojsl changed the title ~~[SPARKNLP-1011] Adding changes to transfer ONNX files on executors~~ [SPARKNLP-1037] Adding addFile changes to to replace broadcast in all ONNX based annotators Apr 17, 2024

[SPARKNLP-1037] Adding changes to transfer ONNX files on executors vi…

c9380c6

…a addFile

danilojsl force-pushed the feature/SPARKNLP-1011-Implement-addFile-all-models branch from be70be8 to c9380c6 Compare April 17, 2024 01:47

[SPARKNLP-1037] Adding unique suffix to avoid duplication in spark files

b105e51

danilojsl force-pushed the feature/SPARKNLP-1011-Implement-addFile-all-models branch from d34af15 to b105e51 Compare April 24, 2024 17:56

maziyarpanahi approved these changes May 21, 2024

View reviewed changes

maziyarpanahi changed the base branch from master to release/540-release-candidate May 21, 2024 12:33

maziyarpanahi merged commit 4419a70 into release/540-release-candidate May 21, 2024
6 checks passed

maziyarpanahi mentioned this pull request May 21, 2024

540 Release Candidate #14247

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARKNLP-1037] Adding addFile changes to to replace broadcast in all ONNX based annotators #14236

[SPARKNLP-1037] Adding addFile changes to to replace broadcast in all ONNX based annotators #14236

danilojsl commented Apr 16, 2024

[SPARKNLP-1037] Adding addFile changes to to replace broadcast in all ONNX based annotators #14236

[SPARKNLP-1037] Adding addFile changes to to replace broadcast in all ONNX based annotators #14236

Conversation

danilojsl commented Apr 16, 2024

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist: