Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARKNLP-962: UAEEmbeddings #14199

Conversation

DevinTDHa
Copy link
Member

@DevinTDHa DevinTDHa commented Mar 8, 2024

Description

This PR adds an Annotator for UAE embeddings. For this, new pooling operations for word embeddings have been added.

Namely poooling by

  1. Using a token at a specific index (such as the [CLS] token, or the last token)
  2. Max pooling across the sequence dimension
  3. [CLS] + Mean of the embeddings

These can be set with setPoolingStrategy for the annotator.

Additionally, it fixes a bug with serializing onnx models that do not have a .onnx_data file (b73dc0b). @prabod I think you worked on this part, could you review if the fix looks good? I provided a description in the commit message. Thanks!

How Has This Been Tested?

New tests and old tests are passing.

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

- added Scala side
- added Python Side
- Added default values
- Serialization tests
- onnxModelPath is not set for models without an .onnx_data file, so it will be None
- None.get will throw an error, this checks for it first
- make tests lazy
@DevinTDHa DevinTDHa added bug-fix new-feature Introducing a new feature labels Mar 8, 2024
@DevinTDHa DevinTDHa self-assigned this Mar 8, 2024
@maziyarpanahi
Copy link
Member

Hi @DevinTDHa

Regarding the fix in onnx serialization, is it related to this issue: #14194 (https://colab.research.google.com/drive/119u6hXoT1PRB9F38InuEV-bm4g1uu9UH?usp=sharing)

@DevinTDHa
Copy link
Member Author

Hi @DevinTDHa

Regarding the fix in onnx serialization, is it related to this issue: #14194 (https://colab.research.google.com/drive/119u6hXoT1PRB9F38InuEV-bm4g1uu9UH?usp=sharing)

Hi @maziyarpanahi,

Yes, the fix should prevent the error in the notebook as well.

@maziyarpanahi maziyarpanahi linked an issue Mar 11, 2024 that may be closed by this pull request
1 task
@maziyarpanahi maziyarpanahi changed the base branch from master to release/533-release-candidate April 5, 2024 15:36
@maziyarpanahi maziyarpanahi merged commit bf6d21e into JohnSnowLabs:release/533-release-candidate Apr 5, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix new-feature Introducing a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Onnx models fail when saving transformer
3 participants