Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARKNLP-828: Raise error when exceeding max input length #13774

Conversation

DevinTDHa
Copy link
Member

@DevinTDHa DevinTDHa commented Apr 27, 2023

Description

This PR introduces an exception in transformer based annotators. If an invalid value is used to set the max input length, an exception is thrown (limit is 512, or 4096 for the longformer). This exception exists on the scala side but was missing in python. These exceptions are now in sync.

This is done by introducing a new property HasMaxSentenceLengthLimit:

class HasMaxSentenceLengthLimit:
# Default Value, can be overridden
max_length_limit = 512
maxSentenceLength = Param(Params._dummy(),
"maxSentenceLength",
"Max sentence length to process",
typeConverter=TypeConverters.toInt)
def setMaxSentenceLength(self, value):
"""Sets max sentence length to process.
Note that a maximum limit exists depending on the model. If you are working with long single
sequences, consider splitting up the input first with another annotator e.g. SentenceDetector.
Parameters
----------
value : int
Max sentence length to process
"""
if value > self.max_length_limit:
raise ValueError(
f"{self.__class__.__name__} models do not support token sequences longer than {self.max_length_limit}.\n"
f"Consider splitting up the input first with another annotator e.g. SentenceDetector.")
return self._set(maxSentenceLength=value)
def getMaxSentenceLength(self):
"""Gets max sentence of the model.
Returns
-------
int
Max sentence length to process
"""
return self.getOrDefault("maxSentenceLength")
class HasLongMaxSentenceLengthLimit(HasMaxSentenceLengthLimit):
max_length_limit = 4096

A note regarding this has also been added to the documentation.

Motivation and Context

Users have been experiencing issues when trying to translate long texts. This should make it clear, that very long inputs are not intended for these models. Users should use a Sentence Detector to split the text to smaller chunks first and this new change will avoid exception by not allowing users to set anything larger than 512 in maxInputLength parameter. (a bug only on Python side)

How Has This Been Tested?

Existing tests have been amended to cover this behaviour. Missing tests were added for some annotators.

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

- Python side now also throws an exception if max length exceeds 512
…ators

- Added HasMaxSentenceLengthLimit mix-in to check for valid value for maxSentenceLength
- Appended tests with new test case for this
- Added missing tests for some annotators
@DevinTDHa DevinTDHa changed the title SPARKNLP-828: MarianTransformer - Raise error when exceeding max input length SPARKNLP-828: Raise error when exceeding max input length May 1, 2023
@DevinTDHa DevinTDHa marked this pull request as draft May 2, 2023 08:36
@DevinTDHa DevinTDHa marked this pull request as ready for review May 2, 2023 15:07
@maziyarpanahi maziyarpanahi changed the base branch from master to release/442-release-candidate May 10, 2023 09:45
@maziyarpanahi maziyarpanahi merged commit 7612d98 into JohnSnowLabs:release/442-release-candidate May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix DON'T MERGE Do not merge this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants