The scripts in this repo are used to group biomedical records from different sources based on the drug and disease entities mentioned in them.
The method extracts and normalizes drug and disease names from the free text of various documents. It then groups together records that mention the same diseases and drugs.
The three data sources from which records are grouped are:
- CTgov: a register of clinical trials (https://clinicaltrials.gov/)
- PubMed: a database of biomedical publications (https://pubmed.ncbi.nlm.nih.gov/)
- EMA drug register (https://www.ema.europa.eu/en/medicines)
A sample of 100 CTgov records, 100 PubMed records and 72 EMA records can be found in the data folder.
For details on the bigger experiment and its results, see here.
BERN is a state-of-the-art biomedical named entity recognition tool, which detects entities of types disease, drug, gene, species and mutation. Please refer to the paper by Kim et al. (2019) for further information.
The bernotate.py
script creates a json file per record with all the BERN-detected biomedical entities.
The bernparse.py
script parses the json files created in Step 1 and stores the detected entities in pickled dataframes. The processing is done in batches of 10,000 json files.
The UMLS metathesaurus maps synonymous medical terms to a concept unique identifier (CUI); for example, the disease terms Hb-SS disease, Herrick syndrome and sickle cell anemia (which all refer to the same disease) are mapped to the CUI C0002895. The tool used for mapping is QuickUMLS; please refer to the paper by Soldaini and Goharian (2016) for further information.
The get_cuis.py
script processes the pickled dataframes created in Step 2. The disease and drug entities in the dataframes are mapped to CUIs. The results are stored in pickled dataframes.
NOTE: To run QuickUMLS, you will need to obtain a license from the National Library of Medicine and download the UMLS files. See instructions and links here. The script assumes that a folder called quickUMLS_eng
, which contains the UMLS data for English, is located in the data folder.
The script groupings.py
groups the records based on their disease and drug CUI's using a similarity measure and a threshold (defined by the available parameters; see the script).
The results of grouping the sample data using cosine similarity and a distance threshold of 0.4 (i.e. similarity threshold 0.6) can be found in data/groupings; these results are examined in the notebook results_explore.ipynb
.