-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARKNLP-667: Fix indexing issue for custom pattern #13283
SPARKNLP-667: Fix indexing issue for custom pattern #13283
Conversation
- fix for patterns with lookahead/behinds that have 0 width matches, indexes would not be calculated correctly - resolved some warnings - refactored tokenizer tests and added new index alignment check
try { | ||
val strs = | ||
if (applyPattern) target.split($(splitPattern)) | ||
else target.split($(splitChars).mkString("|")) | ||
strs.map(str => | ||
try { | ||
IndexedToken( | ||
str, | ||
text.start + candidate.start + curPos, | ||
text.start + candidate.start + curPos + str.length - 1) | ||
} finally { | ||
curPos += str.length + 1 | ||
}) | ||
} finally { | ||
curPos -= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why initially there were these try-finally
blocks without catching any exceptions or handling any resources. I removed them as we would want to be specific with the exceptions (and easier maintenance if a bug would occur here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DevinTDHa Did this part contribute to this reported issue? I think one reason they added this try/finally without any catch was that most documents are messy and they didn't want to interrupt the whole process by a hard exception given it is almost impossible to know which document/sentence/token was the issue (if it can be fixed by the user at all)
If this part is not contributing to the actual issue I think we can leave it, but have a log that reports any possible issue. (user may want to adapt accordingly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was part of the issue, due to the way the indexes were calculated (always adding and subtracting 1).
However, I think that having a log line there would be pretty useful. I will add it in a new commit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case it is needed I openen the PR here:
#13291
try { | ||
val strs = | ||
if (applyPattern) target.split($(splitPattern)) | ||
else target.split($(splitChars).mkString("|")) | ||
strs.map(str => | ||
try { | ||
IndexedToken( | ||
str, | ||
text.start + candidate.start + curPos, | ||
text.start + candidate.start + curPos + str.length - 1) | ||
} finally { | ||
curPos += str.length + 1 | ||
}) | ||
} finally { | ||
curPos -= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DevinTDHa Did this part contribute to this reported issue? I think one reason they added this try/finally without any catch was that most documents are messy and they didn't want to interrupt the whole process by a hard exception given it is almost impossible to know which document/sentence/token was the issue (if it can be fixed by the user at all)
If this part is not contributing to the actual issue I think we can leave it, but have a log that reports any possible issue. (user may want to adapt accordingly).
* Removed duplicated method definition (#13280) Removed the duplicated definition of method `setWeightedDistPath` from `ContextSpellCheckerApproach`. * SPARKNLP-703 Fix Finisher outputAnnotatorType Issue (#13282) * SPARKNLP-703 adding control to avoid loading outputAnnotatorType attribute when components don't override it * SPARKNLP-703 Adding validation when PipelineModel is part of stages * SPARKNLP-667: Fix indexing issue for custom pattern (#13283) - fix for patterns with lookahead/behinds that have 0 width matches, indexes would not be calculated correctly - resolved some warnings - refactored tokenizer tests and added new index alignment check * SPARKNLP-708 Enabling embeddings output in LightPipeline.fullAnnotate (#13284) Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr> * Bump version to 4.2.7 * SPARKNLP-667: Added try-catch block for custom pattern/char (#13291) - if a user provided pattern/char can not be applied, a message will be logged instead of throwing an exception * Enable dropInvalid in reading photos * disable `assemble an image input` unit test - this unit test fails randomly for either a `javax.imageio.IIOException: Unsupported Image Type` or bad assert of `annotationImage.height`. Which suggests something is happening on the OS/file system level as if you re-try it will pass * SPARKNLP-713 Modifies Default Values GraphExtraction (#13305) * SPARKNLP-713 Modifies default values of explodeEntities and mergeEntities * SPARKNLP-713 Refactor GraphFinisher Tests * SPARKNLP-713 Adding warning message for empty paths * Fix links for APIs in Open Source (#13312) * Update 2022-09-27-finassertion_time_en.md * Update 2022-08-17-finner_orgs_prods_alias_en_3_2.md * Update 2022-08-17-legner_orgs_prods_alias_en_3_2.md * Update fin/leg clf models' benchmark (#13276) * relese note for 4.5.0 including gif (#13301) Co-authored-by: pranab <pranab@johnsnowlabs.com> Co-authored-by: diatrambitas <JSL.Git2018> * Databricks installation instructions update. (#13261) * Databricks installation instructions update. * updated DB installation steps Co-authored-by: diatrambitas <JSL.Git2018> * Update 2022-09-27-legassertion_time_en.md * Input output images (#13310) * [skip test] Fix links for APIs in Open Source Co-authored-by: Jose J. Martinez <36634572+josejuanmartinez@users.noreply.github.com> Co-authored-by: Bünyamin Polat <78386903+bunyamin-polat@users.noreply.github.com> Co-authored-by: rpranab <33893292+rpranab@users.noreply.github.com> Co-authored-by: pranab <pranab@johnsnowlabs.com> Co-authored-by: Jiri Dobes <44785730+jdobes-cz@users.noreply.github.com> Co-authored-by: Lev <agsfer@gmail.com> * SPARKNLP-715 Fix sentence index computation (#13318) * Update CHANGELOG for 4.2.7 [run doc] * Update Scala and Python APIs * Release Spark NLP 4.2.7 on Conda [skip test] Co-authored-by: David Cecchini <dadachini@hotmail.com> Co-authored-by: Danilo Burbano <37355249+danilojsl@users.noreply.github.com> Co-authored-by: Devin Ha <33089471+DevinTDHa@users.noreply.github.com> Co-authored-by: Jose J. Martinez <36634572+josejuanmartinez@users.noreply.github.com> Co-authored-by: Bünyamin Polat <78386903+bunyamin-polat@users.noreply.github.com> Co-authored-by: rpranab <33893292+rpranab@users.noreply.github.com> Co-authored-by: pranab <pranab@johnsnowlabs.com> Co-authored-by: Jiri Dobes <44785730+jdobes-cz@users.noreply.github.com> Co-authored-by: Lev <agsfer@gmail.com> Co-authored-by: github-actions <action@github.com>
Description
This PR fixes an issue for the Tokenizer, when a custom pattern is used with lookahead/-behinds that have 0 width matches, indexes would not be calculated correctly. This can lead to further issues down a pipeline.
Other changes:
How Has This Been Tested?
Existing tests are passing (and appended with an additional index check), new tests have been added to cover this scenario
Types of changes