SPARKNLP-667: Fix indexing issue for custom pattern #13283

DevinTDHa · 2022-12-28T11:01:36Z

Description

This PR fixes an issue for the Tokenizer, when a custom pattern is used with lookahead/-behinds that have 0 width matches, indexes would not be calculated correctly. This can lead to further issues down a pipeline.

Other changes:

resolved some warnings
refactored tokenizer tests and added new index alignment check

How Has This Been Tested?

Existing tests are passing (and appended with an additional index check), new tests have been added to cover this scenario

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

- fix for patterns with lookahead/behinds that have 0 width matches, indexes would not be calculated correctly - resolved some warnings - refactored tokenizer tests and added new index alignment check

DevinTDHa · 2022-12-28T11:03:41Z

src/main/scala/com/johnsnowlabs/nlp/annotators/TokenizerModel.scala

-                      try {
-                        val strs =
-                          if (applyPattern) target.split($(splitPattern))
-                          else target.split($(splitChars).mkString("|"))
-                        strs.map(str =>
-                          try {
-                            IndexedToken(
-                              str,
-                              text.start + candidate.start + curPos,
-                              text.start + candidate.start + curPos + str.length - 1)
-                          } finally {
-                            curPos += str.length + 1
-                          })
-                      } finally {
-                        curPos -= 1


I don't know why initially there were these try-finally blocks without catching any exceptions or handling any resources. I removed them as we would want to be specific with the exceptions (and easier maintenance if a bug would occur here).

@DevinTDHa Did this part contribute to this reported issue? I think one reason they added this try/finally without any catch was that most documents are messy and they didn't want to interrupt the whole process by a hard exception given it is almost impossible to know which document/sentence/token was the issue (if it can be fixed by the user at all)

If this part is not contributing to the actual issue I think we can leave it, but have a log that reports any possible issue. (user may want to adapt accordingly).

@maziyarpanahi

It was part of the issue, due to the way the indexes were calculated (always adding and subtracting 1).

However, I think that having a log line there would be pretty useful. I will add it in a new commit!

In case it is needed I openen the PR here:
#13291

maziyarpanahi · 2022-12-28T11:46:08Z

src/main/scala/com/johnsnowlabs/nlp/annotators/TokenizerModel.scala

-                      try {
-                        val strs =
-                          if (applyPattern) target.split($(splitPattern))
-                          else target.split($(splitChars).mkString("|"))
-                        strs.map(str =>
-                          try {
-                            IndexedToken(
-                              str,
-                              text.start + candidate.start + curPos,
-                              text.start + candidate.start + curPos + str.length - 1)
-                          } finally {
-                            curPos += str.length + 1
-                          })
-                      } finally {
-                        curPos -= 1


@DevinTDHa Did this part contribute to this reported issue? I think one reason they added this try/finally without any catch was that most documents are messy and they didn't want to interrupt the whole process by a hard exception given it is almost impossible to know which document/sentence/token was the issue (if it can be fixed by the user at all)

If this part is not contributing to the actual issue I think we can leave it, but have a log that reports any possible issue. (user may want to adapt accordingly).

* Removed duplicated method definition (#13280) Removed the duplicated definition of method `setWeightedDistPath` from `ContextSpellCheckerApproach`. * SPARKNLP-703 Fix Finisher outputAnnotatorType Issue (#13282) * SPARKNLP-703 adding control to avoid loading outputAnnotatorType attribute when components don't override it * SPARKNLP-703 Adding validation when PipelineModel is part of stages * SPARKNLP-667: Fix indexing issue for custom pattern (#13283) - fix for patterns with lookahead/behinds that have 0 width matches, indexes would not be calculated correctly - resolved some warnings - refactored tokenizer tests and added new index alignment check * SPARKNLP-708 Enabling embeddings output in LightPipeline.fullAnnotate (#13284) Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr> * Bump version to 4.2.7 * SPARKNLP-667: Added try-catch block for custom pattern/char (#13291) - if a user provided pattern/char can not be applied, a message will be logged instead of throwing an exception * Enable dropInvalid in reading photos * disable `assemble an image input` unit test - this unit test fails randomly for either a `javax.imageio.IIOException: Unsupported Image Type` or bad assert of `annotationImage.height`. Which suggests something is happening on the OS/file system level as if you re-try it will pass * SPARKNLP-713 Modifies Default Values GraphExtraction (#13305) * SPARKNLP-713 Modifies default values of explodeEntities and mergeEntities * SPARKNLP-713 Refactor GraphFinisher Tests * SPARKNLP-713 Adding warning message for empty paths * Fix links for APIs in Open Source (#13312) * Update 2022-09-27-finassertion_time_en.md * Update 2022-08-17-finner_orgs_prods_alias_en_3_2.md * Update 2022-08-17-legner_orgs_prods_alias_en_3_2.md * Update fin/leg clf models' benchmark (#13276) * relese note for 4.5.0 including gif (#13301) Co-authored-by: pranab <pranab@johnsnowlabs.com> Co-authored-by: diatrambitas <JSL.Git2018> * Databricks installation instructions update. (#13261) * Databricks installation instructions update. * updated DB installation steps Co-authored-by: diatrambitas <JSL.Git2018> * Update 2022-09-27-legassertion_time_en.md * Input output images (#13310) * [skip test] Fix links for APIs in Open Source Co-authored-by: Jose J. Martinez <36634572+josejuanmartinez@users.noreply.github.com> Co-authored-by: Bünyamin Polat <78386903+bunyamin-polat@users.noreply.github.com> Co-authored-by: rpranab <33893292+rpranab@users.noreply.github.com> Co-authored-by: pranab <pranab@johnsnowlabs.com> Co-authored-by: Jiri Dobes <44785730+jdobes-cz@users.noreply.github.com> Co-authored-by: Lev <agsfer@gmail.com> * SPARKNLP-715 Fix sentence index computation (#13318) * Update CHANGELOG for 4.2.7 [run doc] * Update Scala and Python APIs * Release Spark NLP 4.2.7 on Conda [skip test] Co-authored-by: David Cecchini <dadachini@hotmail.com> Co-authored-by: Danilo Burbano <37355249+danilojsl@users.noreply.github.com> Co-authored-by: Devin Ha <33089471+DevinTDHa@users.noreply.github.com> Co-authored-by: Jose J. Martinez <36634572+josejuanmartinez@users.noreply.github.com> Co-authored-by: Bünyamin Polat <78386903+bunyamin-polat@users.noreply.github.com> Co-authored-by: rpranab <33893292+rpranab@users.noreply.github.com> Co-authored-by: pranab <pranab@johnsnowlabs.com> Co-authored-by: Jiri Dobes <44785730+jdobes-cz@users.noreply.github.com> Co-authored-by: Lev <agsfer@gmail.com> Co-authored-by: github-actions <action@github.com>

SPARKNLP-667: Fix indexing issue for custom pattern

9618f19

- fix for patterns with lookahead/behinds that have 0 width matches, indexes would not be calculated correctly - resolved some warnings - refactored tokenizer tests and added new index alignment check

DevinTDHa added bug-fix DON'T MERGE Do not merge this PR labels Dec 28, 2022

DevinTDHa requested a review from maziyarpanahi December 28, 2022 11:01

DevinTDHa assigned maziyarpanahi and DevinTDHa Dec 28, 2022

DevinTDHa commented Dec 28, 2022

View reviewed changes

maziyarpanahi changed the base branch from master to release/427-release-candidate December 29, 2022 19:17

maziyarpanahi approved these changes Dec 30, 2022

View reviewed changes

maziyarpanahi merged commit 943dcc7 into JohnSnowLabs:release/427-release-candidate Dec 30, 2022

maziyarpanahi mentioned this pull request Dec 30, 2022

Spark NLP 427 release candidate #13290

Merged

maziyarpanahi mentioned this pull request Jan 14, 2023

updated tokenizer code #13232

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARKNLP-667: Fix indexing issue for custom pattern #13283

SPARKNLP-667: Fix indexing issue for custom pattern #13283

DevinTDHa commented Dec 28, 2022

DevinTDHa Dec 28, 2022

maziyarpanahi Dec 28, 2022

DevinTDHa Dec 30, 2022

DevinTDHa Dec 30, 2022

maziyarpanahi Dec 28, 2022

SPARKNLP-667: Fix indexing issue for custom pattern #13283

SPARKNLP-667: Fix indexing issue for custom pattern #13283

Conversation

DevinTDHa commented Dec 28, 2022

Description

How Has This Been Tested?

Types of changes

DevinTDHa Dec 28, 2022

Choose a reason for hiding this comment

maziyarpanahi Dec 28, 2022

Choose a reason for hiding this comment

DevinTDHa Dec 30, 2022

Choose a reason for hiding this comment

DevinTDHa Dec 30, 2022

Choose a reason for hiding this comment

maziyarpanahi Dec 28, 2022

Choose a reason for hiding this comment