Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARKNLP-786 Add support for non-schema NER tags #13642

Conversation

maziyarpanahi
Copy link
Member

This PR adds support for NER tags without any schema to be detected correctly in NerConverter annotator.

Currently, if the NER labels/tags generated by NerDLModel or any XXXForTokenClassification annotators have no schema (IO/IOB/IOB2/etc.) the NerConverter has some difficulties to extract the tag in its metadata.

IOB/IOB2 -> I-PER / B-PER
without Schema -> PER

+----------------------------------------------------------------------------------------------+
|result                                                                                        |
+----------------------------------------------------------------------------------------------+
|[PER, PER, O, O, O, LOC, O, O, O, LOC, O, O, O, O, PER, O, O, O, O, LOC]                      |
|[ORG, PER, O, O, O, O, O, O, O]                                                               |
|[ORG, O, ORG, O, O, O, ORG, ORG, O]                                                           |
|[LOC, O]                                                                                      |
|[PER, PER, O, O, O, O, O, O, O, O, O, LOC, O, LOC, O, O, O, O, O, O, O, O, O, O, O, O, O, LOC]|
|[LOC, O, O, O, O, LOC, LOC, O]                                                                |
|[PER, PER, O, O, O, LOC]                                                                      |
|[O, O, O, O, O, MISC, O, O, MISC, O, O, O, O, O, O, O, O, O, O, O, O, O, O]                   |
+----------------------------------------------------------------------------------------------+

+-------------------------------------------+
|result                                     |
+-------------------------------------------+
|[John Lenon, London, Paris, Sarah, London] |
|[Rare Hendrix]                             |
|[EU, German, British lamb]                 |
|[TORONTO]                                  |
|[Barack Obama, Honolulu, Hawaï, États-Unis]|
|[Paris, la France]                         |
|[george washington, washington]            |
|[Camembert, Français]                      |
+-------------------------------------------+

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                                                                                                                                                        |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{entity -> R, sentence -> 0, chunk -> 0, confidence -> 0.9936409}, {entity -> C, sentence -> 0, chunk -> 1, confidence -> 0.9953128}, {entity -> C, sentence -> 0, chunk -> 2, confidence -> 0.99524385}, {entity -> R, sentence -> 0, chunk -> 3, confidence -> 0.790784}, {entity -> C, sentence -> 0, chunk -> 4, confidence -> 0.99527025}]|
|[{entity -> G, sentence -> 0, chunk -> 0, confidence -> 0.6287162}]                                                                                                                                                                                                                                                                             |
|[{entity -> G, sentence -> 0, chunk -> 0, confidence -> 0.9442431}, {entity -> G, sentence -> 0, chunk -> 1, confidence -> 0.71734494}, {entity -> G, sentence -> 0, chunk -> 2, confidence -> 0.8528497}]                                                                                                                                      |
|[{entity -> C, sentence -> 0, chunk -> 0, confidence -> 0.98190045}]                                                                                                                                                                                                                                                                            |
|[{entity -> R, sentence -> 0, chunk -> 0, confidence -> 0.99367046}, {entity -> C, sentence -> 0, chunk -> 1, confidence -> 0.995496}, {entity -> C, sentence -> 0, chunk -> 2, confidence -> 0.9955146}, {entity -> C, sentence -> 0, chunk -> 3, confidence -> 0.9870782}]                                                                    |
|[{entity -> C, sentence -> 0, chunk -> 0, confidence -> 0.99531066}, {entity -> C, sentence -> 0, chunk -> 1, confidence -> 0.8560585}]                                                                                                                                                                                                         |
|[{entity -> R, sentence -> 0, chunk -> 0, confidence -> 0.9946363}, {entity -> C, sentence -> 0, chunk -> 1, confidence -> 0.9790525}]                                                                                                                                                                                                          |
|[{entity -> SC, sentence -> 0, chunk -> 0, confidence -> 0.98350376}, {entity -> SC, sentence -> 0, chunk -> 1, confidence -> 0.9772832}]                                                                                                                                                                                                       |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The issue here is the G or SC which is a partial extraction of the tags without any schema. This PR adds a support to extract non-schema NER tags in metadata.

Signed-off-by: Maziyar Panahi <maziyar.panahi@iscpif.fr>
@maziyarpanahi maziyarpanahi added enhancement DON'T MERGE Do not merge this PR labels Mar 13, 2023
@maziyarpanahi maziyarpanahi self-assigned this Mar 13, 2023
@maziyarpanahi maziyarpanahi changed the base branch from master to release/432-release-candidate March 14, 2023 08:53
@maziyarpanahi maziyarpanahi merged commit 210854e into release/432-release-candidate Mar 14, 2023
maziyarpanahi added a commit that referenced this pull request May 10, 2023
Signed-off-by: Maziyar Panahi <maziyar.panahi@iscpif.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DON'T MERGE Do not merge this PR enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant