Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new API for pubtator3 #177

Open
andrewsu opened this issue Feb 7, 2024 · 5 comments · Fixed by #192
Open

Create new API for pubtator3 #177

andrewsu opened this issue Feb 7, 2024 · 5 comments · Fixed by #192
Assignees
Labels
data source Data source pending to create a new API On CI Match https://github.com/biothings/biothings_explorer/labels On Test Match https://github.com/biothings/biothings_explorer/labels

Comments

@andrewsu
Copy link
Member

andrewsu commented Feb 7, 2024

Website: https://www.ncbi.nlm.nih.gov/research/pubtator3/
FTP: https://www.ncbi.nlm.nih.gov/research/pubtator3/ (We are most interested in relation2pubtator3.gz)

Pubtator3 is the latest iteration of pubtator from Zhiyong Lu's group at NCBI. It includes an analysis of the entire 35+ million abstracts in PubMed and nearly 6 million full-text articles in the PMC Text Mining subset, resulting in 1.6 billion entity annotations and 33 million extracted relations (8.8 unique pairs of entities).

Let's try to use the same structure as we did for the semmeddb API, e.g., https://biothings.transltr.io/semmeddb/association/C0040077-STIMULATES-C0076591

{
  "_id": "C0040077-STIMULATES-C0076591",
  "_version": 1,
  "object": {
    "name": "thymidylate synthase-dihydrofolate reductase",
    "novelty": 1,
    "semantic_type_abbreviation": "gngm",
    "semantic_type_name": "Gene or Genome",
    "umls": "C0076591"
  },
  "pmid_count": 1,
  "predicate": "STIMULATES",
  "predication": [
    {
      "object_score": 1000,
      "object_text": "dhfr-ts",
      "pmid": 7479765,
      "predication_id": 107061205,
      "sentence": "Survival and replication of dhfr-ts- in macrophages in vitro were dependent upon thymidine, with parasites differentiating into amastigotes prior to destruction. dhfr-ts- parasites persisted in BALB/c mice for up to 2 months, declining with a half-life of 2-3 days.",
      "sentence_id": 87288544,
      "subject_score": 1000,
      "subject_text": "thymidine"
    }
  ],
  "predication_count": 1,
  "subject": {
    "name": "Thymidine",
    "novelty": 1,
    "semantic_type_abbreviation": "bacs",
    "semantic_type_name": "Biologically Active Substance",
    "umls": "C0040077"
  }
}

NOTE that pubtator 3 also has an API at https://www.ncbi.nlm.nih.gov/research/pubtator3/api, but their usage restrictions mean we should just set up our own...

@andrewsu andrewsu added the data source Data source pending to create a new API label Feb 7, 2024
@ctrl-schaff ctrl-schaff self-assigned this Apr 16, 2024
@ctrl-schaff ctrl-schaff linked a pull request Apr 30, 2024 that will close this issue
@andrewsu
Copy link
Member Author

We perhaps need a better place to document this best-practice, but per the guidelines at https://github.com/biothings/biothings_explorer/blob/main/docs/README-contributing-new-data-source.md, let's add a link to the parser code and a link to an API call with an example record to this issue. @ctrl-schaff can I ask you to handle this please?

@andrewsu andrewsu reopened this Jun 25, 2024
@ctrl-schaff
Copy link
Contributor

We perhaps need a better place to document this best-practice, but per the guidelines at https://github.com/biothings/biothings_explorer/blob/main/docs/README-contributing-new-data-source.md, let's add a link to the parser code and a link to an API call with an example record to this issue. @ctrl-schaff can I ask you to handle this please?

Sure no problem

@ctrl-schaff
Copy link
Contributor

For this plugin the parsing code can be found at https://github.com/biothings/pending.api/tree/master/plugins/pubtator3

Generated API call: https://biothings.ci.transltr.io/pubtator3/association/11270550-Disease|MESH:D008579-ASSOCIATE-Gene|57534

Generated Result:

{
  "_id": "11270550-Disease|MESH:D008579-ASSOCIATE-Gene|57534",
  "_version": 1,
  "object": {
    "identifier": {
      "key": "MESH",
      "value": "D008579"
    },
    "semantic_type_name": "Disease"
  },
  "pmid": 11270550,
  "pmid_count": 1,
  "predicate": "ASSOCIATE",
  "predication_count": 1,
  "subject": {
    "identifier": {
      "key": null,
      "value": "57534"
    },
    "semantic_type_name": "Gene"
  }
}

This plugin is currently deployed on the CI environment so feel free to test it there for more data samples.

Chunlei and I already discussed modifying this structure so we can eliminate the PMID value from the _id field. The internal data provided by pubtator has a fair amount of duplicates which is why I specified the PMID in the _id field in the first place so we would ignore a lot less entries while parsing. This highlighted an error in the difference between our merging backends between sqlite3 and mongodb which I'm currently modifying and will modify the structure of this plugin once I have it ready to test with both. If you have any other suggestions or issues with the data structure please let me know @andrewsu

@andrewsu
Copy link
Member Author

andrewsu commented Jun 26, 2024

There's a layer of aggregation that needs to be added to the parser. Consider this set of records linking D007037 to D008713: https://biothings.ci.transltr.io/pubtator3/query?q=object.identifier.value:D008713%20AND%20subject.identifier.value:D007037&facets=predicate

There are 433 total records joining these terms, 383 using the cause predicate, 49 using the treat predicate, and 1 using the associates predicate. So these 433 original records in pubtator3 should be collapsed into three records in our API with roughly this structure:

    "hits": [
        {
            "_id": "Chemical|MESH:D008713-CAUSE-Disease|MESH:D007037",
            "_score": 16.60503,
            "object": {
                "identifier": {
                    "key": "MESH",
                    "value": "D008713"
                },
                "semantic_type_name": "Chemical"
            },
            "pmid": [729631,17161219,20808432,15820614,...],
            "pmid_count": 381,
            "predicate": "CAUSE",
            "predication_count": 383,
            "subject": {
                "identifier": {
                    "key": "MESH",
                    "value": "D007037"
                },
                "semantic_type_name": "Disease"
            }
        },
        {
            "_id": "Chemical|MESH:D008713-TREAT-Disease|MESH:D007037",
            "_score": 16.60503,
            "object": {
                "identifier": {
                    "key": "MESH",
                    "value": "D008713"
                },
                "semantic_type_name": "Chemical"
            },
            "pmid": [26214210,26799350,23337033,2552340,...]
            "pmid_count": 49,
            "predicate": "TREAT",
            "predication_count": 49,
            "subject": {
                "identifier": {
                    "key": "MESH",
                    "value": "D007037"
                },
                "semantic_type_name": "Disease"
            }
        },
        {
            "_id": "Chemical|MESH:D008713-ASSOCIATE-Disease|MESH:D007037",
            "_score": 16.60503,
            "object": {
                "identifier": {
                    "key": "MESH",
                    "value": "D008713"
                },
                "semantic_type_name": "Chemical"
            },
            "pmid": [37931916],
            "pmid_count": 1,
            "predicate": "ASSOCIATE",
            "predication_count": 1,
            "subject": {
                "identifier": {
                    "key": "MESH",
                    "value": "D007037"
                },
                "semantic_type_name": "Disease"
            }
        },

Note that the predication_count refers to the number of original records with the same subject.identifier.value - predicate - object.identifier.value triple, and the pmid and pmid_count refer to the number of unique PMIDs in the list of predications. Let me know if you have any questions!

@andrewsu
Copy link
Member Author

andrewsu commented Jul 2, 2024

... and adding two other tweaks to the parser. As always, let me know if any clarifications are needed...

1. add name field for diseases, chemicals, and genes

All chemicals use MESH IDs. Those can be resolved to names using mychem.info, e.g., https://mychem.info/v1/query?q=umls.mesh:C579720. The name can be drawn from this list of fields in the JSON (in order of priority):

  • chebi.name
  • chembl.pref_name
  • drugbank.name
  • unii.display_name
  • umls.name

Diseases either use MESH (e.g., D015179 which can be searched using https://mydisease.info/v1/query?q=disease_ontology.xrefs.mesh:D015179%20OR%20umls.mesh.preferred:D015179%20OR%20ctd.mesh:D015179%20OR%20mondo.xrefs.mesh:D015179) or OMIM (e.g., 610251 which can be searched by https://mydisease.info/v1/query?q=mondo.xrefs.omim:610251%20OR%20hpo.omim:610251%20OR%20ctd.omim:610251). The name can be drawn from this list of fields in the JSON:

  • mondo.label
  • disease_ontology.name
  • hpo.disease_name

Genes are always specified by the NCBI Gene ID, resolved using https://mygene.info/v3/gene/1017. The name can be drawn from this list:

  • symbol
  • name

2. ignore CorrespondingGene identifiers

consider this set of relations from the pubtator3 relations file:

$ gzip -cd relation2pubtator3.gz | grep 33847607
33847607        associate       DNAMutation|RS#:1801131;HGVS:c.1298A>C;CorrespondingGene:4524   Disease|MESH:D053713
33847607        associate       Disease|MESH:D053713    Gene|4524
33847607        cause   DNAMutation|RS#:1801133;HGVS:c.677C>T;CorrespondingGene:4524    Disease|MESH:D053713

The corresponding records are here: https://biothings.ci.transltr.io/pubtator3/query?q=pmid:33847607, one of which is pasted below:

{
  "_id": "33847607-DNAMutation|RS#:1801131;HGVS:c.1298A>C;CorrespondingGene:4524-ASSOCIATE-Disease|MESH:D053713",
  "_score": 1,
  "object": [
    {
      "identifier": {
        "key": "RS#",
        "value": "1801131"
      },
      "semantic_type_name": "DNAMutation"
    },
    {
      "identifier": {
        "key": "HGVS",
        "value": "c.1298A>C"
      },
      "semantic_type_name": "DNAMutation"
    },
    {
      "identifier": {
        "key": "CorrespondingGene",
        "value": "4524"
      },
      "semantic_type_name": "DNAMutation"
    }
  ],
  "pmid": 33847607,
  "pmid_count": 1,
  "predicate": "ASSOCIATE",
  "predication_count": 1,
  "subject": {
    "identifier": {
      "key": "MESH",
      "value": "D053713"
    },
    "semantic_type_name": "Disease"
  }
}

The identifier for CorrespondingGene:4524 should be removed, since there is already a separate record linking Gene|4524 to Disease|MESH:D053713. (Spot checking a few other example, this redundancy appears to be universally true.)

@newgene newgene added On CI Match https://github.com/biothings/biothings_explorer/labels On Test Match https://github.com/biothings/biothings_explorer/labels labels Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data source Data source pending to create a new API On CI Match https://github.com/biothings/biothings_explorer/labels On Test Match https://github.com/biothings/biothings_explorer/labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants