Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARKNLP-667: Fix indexing issue for custom pattern #13283

Conversation

DevinTDHa
Copy link
Member

Description

This PR fixes an issue for the Tokenizer, when a custom pattern is used with lookahead/-behinds that have 0 width matches, indexes would not be calculated correctly. This can lead to further issues down a pipeline.

Other changes:

  • resolved some warnings
  • refactored tokenizer tests and added new index alignment check

How Has This Been Tested?

Existing tests are passing (and appended with an additional index check), new tests have been added to cover this scenario

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

- fix for patterns with lookahead/behinds that have 0 width matches, indexes would not be calculated correctly
- resolved some warnings
- refactored tokenizer tests and added new index alignment check
Comment on lines -352 to -366
try {
val strs =
if (applyPattern) target.split($(splitPattern))
else target.split($(splitChars).mkString("|"))
strs.map(str =>
try {
IndexedToken(
str,
text.start + candidate.start + curPos,
text.start + candidate.start + curPos + str.length - 1)
} finally {
curPos += str.length + 1
})
} finally {
curPos -= 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why initially there were these try-finally blocks without catching any exceptions or handling any resources. I removed them as we would want to be specific with the exceptions (and easier maintenance if a bug would occur here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DevinTDHa Did this part contribute to this reported issue? I think one reason they added this try/finally without any catch was that most documents are messy and they didn't want to interrupt the whole process by a hard exception given it is almost impossible to know which document/sentence/token was the issue (if it can be fixed by the user at all)

If this part is not contributing to the actual issue I think we can leave it, but have a log that reports any possible issue. (user may want to adapt accordingly).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maziyarpanahi

It was part of the issue, due to the way the indexes were calculated (always adding and subtracting 1).

However, I think that having a log line there would be pretty useful. I will add it in a new commit!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it is needed I openen the PR here:
#13291

@maziyarpanahi maziyarpanahi changed the base branch from master to release/427-release-candidate December 29, 2022 19:17
Comment on lines -352 to -366
try {
val strs =
if (applyPattern) target.split($(splitPattern))
else target.split($(splitChars).mkString("|"))
strs.map(str =>
try {
IndexedToken(
str,
text.start + candidate.start + curPos,
text.start + candidate.start + curPos + str.length - 1)
} finally {
curPos += str.length + 1
})
} finally {
curPos -= 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DevinTDHa Did this part contribute to this reported issue? I think one reason they added this try/finally without any catch was that most documents are messy and they didn't want to interrupt the whole process by a hard exception given it is almost impossible to know which document/sentence/token was the issue (if it can be fixed by the user at all)

If this part is not contributing to the actual issue I think we can leave it, but have a log that reports any possible issue. (user may want to adapt accordingly).

@maziyarpanahi maziyarpanahi merged commit 943dcc7 into JohnSnowLabs:release/427-release-candidate Dec 30, 2022
maziyarpanahi added a commit that referenced this pull request Jan 12, 2023
* Removed duplicated method definition (#13280)

Removed the duplicated definition of method `setWeightedDistPath` from `ContextSpellCheckerApproach`.

* SPARKNLP-703 Fix Finisher outputAnnotatorType Issue (#13282)

* SPARKNLP-703 adding control to avoid loading outputAnnotatorType attribute when components don't override it

* SPARKNLP-703 Adding validation when PipelineModel is part of stages

* SPARKNLP-667: Fix indexing issue for custom pattern (#13283)

- fix for patterns with lookahead/behinds that have 0 width matches, indexes would not be calculated correctly
- resolved some warnings
- refactored tokenizer tests and added new index alignment check

* SPARKNLP-708 Enabling embeddings output in LightPipeline.fullAnnotate (#13284)

Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr>

* Bump version to 4.2.7

* SPARKNLP-667: Added try-catch block for custom pattern/char (#13291)

- if a user provided pattern/char can not be applied, a
  message will be logged instead of throwing an exception

* Enable dropInvalid in reading photos

* disable `assemble an image input` unit test

- this unit test fails randomly for either a `javax.imageio.IIOException: Unsupported Image Type` or bad assert of `annotationImage.height`. Which suggests something is happening on the OS/file system level as if you re-try it will pass

* SPARKNLP-713 Modifies Default Values GraphExtraction (#13305)

* SPARKNLP-713 Modifies default values of explodeEntities and mergeEntities

* SPARKNLP-713 Refactor GraphFinisher Tests

* SPARKNLP-713 Adding warning message for empty paths

* Fix links for APIs in Open Source (#13312)

* Update 2022-09-27-finassertion_time_en.md

* Update 2022-08-17-finner_orgs_prods_alias_en_3_2.md

* Update 2022-08-17-legner_orgs_prods_alias_en_3_2.md

* Update fin/leg clf models' benchmark (#13276)

* relese note for 4.5.0 including gif (#13301)

Co-authored-by: pranab <pranab@johnsnowlabs.com>
Co-authored-by: diatrambitas <JSL.Git2018>

* Databricks installation instructions update. (#13261)

* Databricks installation instructions update.

* updated DB installation steps

Co-authored-by: diatrambitas <JSL.Git2018>

* Update 2022-09-27-legassertion_time_en.md

* Input output images (#13310)

* [skip test] Fix links for APIs in Open Source

Co-authored-by: Jose J. Martinez <36634572+josejuanmartinez@users.noreply.github.com>
Co-authored-by: Bünyamin Polat <78386903+bunyamin-polat@users.noreply.github.com>
Co-authored-by: rpranab <33893292+rpranab@users.noreply.github.com>
Co-authored-by: pranab <pranab@johnsnowlabs.com>
Co-authored-by: Jiri Dobes <44785730+jdobes-cz@users.noreply.github.com>
Co-authored-by: Lev <agsfer@gmail.com>

* SPARKNLP-715 Fix sentence index computation (#13318)

* Update CHANGELOG for 4.2.7 [run doc]

* Update Scala and Python APIs

* Release Spark NLP 4.2.7 on Conda [skip test]

Co-authored-by: David Cecchini <dadachini@hotmail.com>
Co-authored-by: Danilo Burbano <37355249+danilojsl@users.noreply.github.com>
Co-authored-by: Devin Ha <33089471+DevinTDHa@users.noreply.github.com>
Co-authored-by: Jose J. Martinez <36634572+josejuanmartinez@users.noreply.github.com>
Co-authored-by: Bünyamin Polat <78386903+bunyamin-polat@users.noreply.github.com>
Co-authored-by: rpranab <33893292+rpranab@users.noreply.github.com>
Co-authored-by: pranab <pranab@johnsnowlabs.com>
Co-authored-by: Jiri Dobes <44785730+jdobes-cz@users.noreply.github.com>
Co-authored-by: Lev <agsfer@gmail.com>
Co-authored-by: github-actions <action@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix DON'T MERGE Do not merge this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants