Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypedDependencyParser returning <no-type> as dep type #2775

Closed
albertoandreottiATgmail opened this issue Apr 19, 2021 · 6 comments · Fixed by #13648
Closed

TypedDependencyParser returning <no-type> as dep type #2775

albertoandreottiATgmail opened this issue Apr 19, 2021 · 6 comments · Fixed by #13648
Assignees

Comments

@albertoandreottiATgmail
Copy link
Contributor

albertoandreottiATgmail commented Apr 19, 2021

Description

TypedDependencyParser is apparently not producing right outputs, according to experiments from this notebook,

https://colab.research.google.com/drive/1PF8PQfvH1qMmk630rQZST4SJx_EtGGAC?usp=sharing#scrollTo=RysvWpG7hUdk

What I've found out so far,
a) this is not a serialization issue.
b) this is not only happening in 3.0.x.
c) the original code found here, https://github.com/shentianxiao/RBGParser/tree/labeling
d) the algorithm uses an internal structure which is sparsely filled, so most likely the training was not enough to cover all cases.

Next action check if more training helps to improve the situation.

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context

Your Environment

  • Spark NLP version sparknlp.version():
  • Apache NLP version spark.version:
  • Java version java -version:
  • Setup and installation (Pypi, Conda, Maven, etc.):
  • Operating System and version:
  • Link to your project (if any):
@albertoandreottiATgmail
Copy link
Contributor Author

albertoandreottiATgmail commented Apr 20, 2021

More info, @danilojsl , @maziyarpanahi , @vkocaman
a) the problem is not related to different tokenization in conll training data compared to our tokenizer.
b) the problem is not related to a different POS input to the parser.
c) the problem is not related to OOV, words never seen during training.
d) narrowing down things as much as possible we get,
"He reports that he feels well and denies any problems, or pain." --> fails!
"He reports that he feels well and denies any problems." --> succeeds!

Will check if the problem is different parsings coming out of DependencyParserModel.
Ideas?

@albertoandreottiATgmail
Copy link
Contributor Author

Some additional updates here, I suspect the punctuation is the problem,

"he denies problems or pain" -> works

"he denies problems, or pain" -> fails

probably a mismatch in the encoding of the labels between test & training datasets

@albertoandreottiATgmail
Copy link
Contributor Author

More progress on this one, it seems that there's a mismatch between the contents of the map that the model uses to represent POS and lemmas,

TypedDependencyParserModel.dependencyPipe.getDictionariesSet.getDictionaries

between a model that has just been trained and a model that has been loaded from disk.
So it seems the serialization is dropping some content in those dictionaries.
Haven't looked into detail into the serialization process, but it seems dictionaries are converted to Strings with a comma as a separator,

{cpos=DT=41,feat=Degree=Pos=55,cpos=CD=31,cpos=''=47,#TO........

So maybe there's a confusion between the actual value and the separator. It was very suspicious that the map was missing the entry cpos=,=46. And some others as well.
I tried adding back some of the missing content during deserialization,

  private def deserializeDictionaries(dictionariesValues: List[(TObjectIntHashMap[_], Int, Boolean)]): DictionarySet = {

    val dictionarySet = getDictionarySetInstance

    dictionariesValues.zipWithIndex.foreach { case (dictionaryValue, index) =>
      // TODO this is not a fix! - values taken from training
      if (index == 0)
          dictionaryValue._1.asInstanceOf[TObjectIntHashMap[String]].put("cpos=,", 46)

      if (index == 1) {
        dictionaryValue._1.asInstanceOf[TObjectIntHashMap[String]].put("form=,", 39)
        dictionaryValue._1.asInstanceOf[TObjectIntHashMap[String]].put("lemma=,", 58)
      }

But it seems that was not enough, the problem persisted.
What to do next?

  • Try the problematic sentence on a model that hasn't gone through the serialization process. This means train a new model a try the problematic sentence on it, don't use pretrained(). If the error is gone it means we have more serialization problems.
  • In general the model has a lot of commented code, TODOs, and things that are disabled. I would work more on enabling these things.
  • Investigate discrepancies in the sentence representation between training and inference.

@github-actions
Copy link

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

@luca-martial
Copy link
Contributor

@sillystring13 thanks for reporting, I'm reopening the issue - please describe how you've been able to replicate it (library version, code, issue description)

@luca-martial luca-martial reopened this Feb 21, 2023
@github-actions github-actions bot removed the Stale label Feb 22, 2023
@w2o-hbrashear
Copy link

Issue: DependencyParserModel.pretrained('dependency_conllu') returns dependency_type=<no-type> on some input with punctuation

Steps to Reproduce

I used the code from the display notebook: https://github.com/JohnSnowLabs/spark-nlp-display/blob/main/tutorials/Spark_NLP_Display.ipynb
You reproduce this by changing code in cell 5

  • text = "he denies problems or pain" works
    {'dependency': [Annotation(dependency, 0, 1, denies, {'head': '2', 'head.begin': '3', 'head.end': '8', 'sentence': '0'}, []), Annotation(dependency, 3, 8, problems, {'head': '3', 'head.begin': '10', 'head.end': '17', 'sentence': '0'}, []), Annotation(dependency, 10, 17, ROOT, {'head': '0', 'head.begin': '-1', 'head.end': '-1', 'sentence': '0'}, []), Annotation(dependency, 19, 20, pain, {'head': '5', 'head.begin': '22', 'head.end': '25', 'sentence': '0'}, []), Annotation(dependency, 22, 25, problems, {'head': '3', 'head.begin': '10', 'head.end': '17', 'sentence': '0'}, [])], 'dependency_type': [Annotation(labeled_dependency, 0, 1, nsubj, {'sentence': '0'}, []), Annotation(labeled_dependency, 3, 8, parataxis, {'sentence': '0'}, []), Annotation(labeled_dependency, 10, 17, root, {'sentence': '0'}, []), Annotation(labeled_dependency, 19, 20, compound, {'sentence': '0'}, []), Annotation(labeled_dependency, 22, 25, amod, {'sentence': '0'}, [])], 'document': [Annotation(document, 0, 25, he denies problems or pain, {}, [])], 'pos': [Annotation(pos, 0, 1, PRP, {'word': 'he', 'sentence': '0'}, []), Annotation(pos, 3, 8, VBZ, {'word': 'denies', 'sentence': '0'}, []), Annotation(pos, 10, 17, NNS, {'word': 'problems', 'sentence': '0'}, []), Annotation(pos, 19, 20, CC, {'word': 'or', 'sentence': '0'}, []), Annotation(pos, 22, 25, NN, {'word': 'pain', 'sentence': '0'}, [])], 'token': [Annotation(token, 0, 1, he, {'sentence': '0'}, []), Annotation(token, 3, 8, denies, {'sentence': '0'}, []), Annotation(token, 10, 17, problems, {'sentence': '0'}, []), Annotation(token, 19, 20, or, {'sentence': '0'}, []), Annotation(token, 22, 25, pain, {'sentence': '0'}, [])]}
  • text = "he denies problems, or pain" fails with the <no-type> label on everything
    {'dependency': [Annotation(dependency, 0, 1, denies, {'head': '2', 'head.begin': '3', 'head.end': '8', 'sentence': '0'}, []), Annotation(dependency, 3, 8, problems, {'head': '3', 'head.begin': '10', 'head.end': '17', 'sentence': '0'}, []), Annotation(dependency, 10, 17, ROOT, {'head': '0', 'head.begin': '-1', 'head.end': '-1', 'sentence': '0'}, []), Annotation(dependency, 18, 18, pain, {'head': '6', 'head.begin': '23', 'head.end': '26', 'sentence': '0'}, []), Annotation(dependency, 20, 21, pain, {'head': '6', 'head.begin': '23', 'head.end': '26', 'sentence': '0'}, []), Annotation(dependency, 23, 26, problems, {'head': '3', 'head.begin': '10', 'head.end': '17', 'sentence': '0'}, [])], 'dependency_type': [Annotation(labeled_dependency, 0, 1, <no-type>, {'sentence': '0'}, []), Annotation(labeled_dependency, 3, 8, <no-type>, {'sentence': '0'}, []), Annotation(labeled_dependency, 10, 17, <no-type>, {'sentence': '0'}, []), Annotation(labeled_dependency, 18, 18, <no-type>, {'sentence': '0'}, []), Annotation(labeled_dependency, 20, 21, <no-type>, {'sentence': '0'}, []), Annotation(labeled_dependency, 23, 26, <no-type>, {'sentence': '0'}, [])], 'document': [Annotation(document, 0, 26, he denies problems, or pain, {}, [])], 'pos': [Annotation(pos, 0, 1, PRP, {'word': 'he', 'sentence': '0'}, []), Annotation(pos, 3, 8, VBZ, {'word': 'denies', 'sentence': '0'}, []), Annotation(pos, 10, 17, NNS, {'word': 'problems', 'sentence': '0'}, []), Annotation(pos, 18, 18, ,, {'word': ',', 'sentence': '0'}, []), Annotation(pos, 20, 21, CC, {'word': 'or', 'sentence': '0'}, []), Annotation(pos, 23, 26, NN, {'word': 'pain', 'sentence': '0'}, [])], 'token': [Annotation(token, 0, 1, he, {'sentence': '0'}, []), Annotation(token, 3, 8, denies, {'sentence': '0'}, []), Annotation(token, 10, 17, problems, {'sentence': '0'}, []), Annotation(token, 18, 18, ,, {'sentence': '0'}, []), Annotation(token, 20, 21, or, {'sentence': '0'}, []), Annotation(token, 23, 26, pain, {'sentence': '0'}, [])]}

Your Environment

  • Spark NLP version sparknlp.version(): 4.2.7
  • Apache NLP version spark.version: 3.3.1
  • Java version java -version: openjdk version "1.8.0_345" OpenJDK Runtime Environment (Zulu 8.64.0.19-CA-linux64) (build 1.8.0_345-b01) OpenJDK 64-Bit Server VM (Zulu 8.64.0.19-CA-linux64) (build 25.345-b01, mixed mode)`
  • Setup and installation (Pypi, Conda, Maven, etc.):
    spark-nlp-display==4.1.0
    johnsnowlabs-for-databricks==4.3.2
    jsl.jar
    dbfs:/FileStore/johnsnowlabs/libs/spark-nlp-jsl-4.3.0.jar
    dbfs:/FileStore/johnsnowlabs/libs/spark_nlp_jsl-4.3.0-py3-none-any.whl
    dbfs:/FileStore/johnsnowlabs/libs/spark_nlp_jsl-4.3.0-py3-none-any.whl
    assembly.jar
    dbfs:/FileStore/johnsnowlabs/libs/spark-ocr-assembly-4.3.0.jar
    dbfs:/FileStore/johnsnowlabs/libs/spark_ocr-4.3.0-py3-none-any.whl
    dbfs:/FileStore/johnsnowlabs/libs/spark_ocr-4.3.0-py3-none-any.whl
    spark-nlp==4.3.0
    com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.0
  • Operating System and version: DataBricks 9.1 LTS ML (includes Apache Spark 3.1.2, Scala 2.12) - Fresh install on GCP from the DB Installer
  • Link to your project (if any):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants