This repository contains the code to reproduce the results in: Florian Borchert, Christina Lohr, Luise Modersohn, Thomas Langer, Markus Follmann, Jan Philipp Sachs, Udo Hahn, Matthieu-P. Schapranow. GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines. ArXiv:2007.06400 [Cs]. [arXiv] [Code on GitHub] [data-request@DKG] accepted at [LOUHI@EMNLP'20) https://arxiv.org/abs/2007.06400
- GGPONC source files:
- Follow the instructions of the GGPONC website (Access & Download)
- Copy
cpg-corpus-cms.xml
intosrc/main/resources
- PubMed Abstracts from German Case Reports and Case Descriptions
- Install Entres API from NCBI or EDirect, the commandline tools requesting the PubMed infrastructure
- Open a terminal and type
esearch -db pubmed -query "Case Reports[Publication Type] AND GER[LA]" | efetch -format xml > allGermanPubMedCaseAbstracts.xml
(This step could take an hour.) - export the extracted file
allGermanPubMedCaseAbstracts.xml
intosrc/main/resources
- JSYNCC v1.1: follow the instructions of https://github.com/JULIELab/jsyncc or contact Christina Lohr
- 3000PA: no public access
- KRAUTS Corpus (Strötgen et al):
- WikiWarsDe Corpus (Strötgen et al)
-
You need files from the UMLS.
-
You need a registration at UTS, you can download the UMLS files from the U.S. National Library of Medicine (NIH).
-
For our current work, we used the UMLS release 2019AB and you need the following files:
- 2019AB MRCONSO.RRF
- 2019AB MRSTY.RRF (only accessible from the full release zip file.)
- unzip the files.
-
More information on the UMLS can be found in the UMLS® Reference Manual.
- Java 11 - We prefer Open JDK
- Apache Maven (mvn)
- Python 3 => We prefer to use Eclipse IDE or IntelliJ IDEA
- Configure the project as a Maven project
- In Eclipse: right click on project => Configure => Convert to Maven Project
- Command line:
mvn compile
- Run
mvn compile
before executingmvn exec:java -Dexec.mainClass="de.hpi.guidelines.reader.GGPOncXMLReader" -Dexec.args="<Path to cpg-corpus-cms.xml>"
or runGGPOncXMLReader.java
(in packagede.hpi.guidelines.reader
) in Eclipse (Run As => Java Application) - Wait a minute
- Look into the directory
/output
- We download PubMed data at February 21 2020, if you download PubMed data by esearch commands, you will receive a larger text corpus than our export. The file
src/main/resources/usedPubMedIds_20200221.txt
contains a list with the used PubMed identifiers from February 21 2020. - If you want to create the described data set from PubMed, import your extracted XML file and run the
src/main/extractPubMedCaseAbstracts.java
. This code is able to filter our used PubMed text data from your new created download.
- We worked with JuFit v1.1 - you can find the right jar file in this repository.
- If you want to work with the real JuFit, follow the steps below:
- Download JuFit from https://github.com/JULIELab/jufit
- create the jar file by Apache Maven and run
mvn clean package
- run
java -jar JuFiT.jar MRCONSO.RRF MRSTY.RRF GER --grounded > UMLS_dict.txt
- Run the Java Code
RequestJuFiT.java
(packagede.julielab.dictionaryhandling
) or the Python scriptextended_script_dictionaries/request-jufit.sh
- We used a list of gene names compiled from Entrez Gene and UniProt with the approach originating from Wermter et al.
- Code of JULIELab/gene-name-mapping
- The integration of this code in the GGPOnc Repository is coming soon.
- For the usage of JCoRe Pipelines you will need one large file
global_dictionary.txt
- Run the script
extended_script_dictionaries/createDics.py
to create on large dictionary (before run: adapt path names in the script file) - Or run the Java Code
CreateLargeDictionary.java
(packagede.julielab.dictionaryhandling
) (before run: adapt path names in the script file)
- Unpack the
*.zip
files injcore-pipelines
, there are 2 pipelines:- dectectUMLSentries
- detectStopwords
- Create the folder
data/files
in the pipeline directories and put the data to be analyzed in the directorydata/files
(subdirectories are not read, be carefully with*.tar
files) - Put the global dictionary file into
jcore-pipelines/detectUMLSentries/resources
- Adapt filename of the dictionary and the stopword dictionary in the following files:
desc/GazetteerAnnotator
Template Descriptor with Configurable ExternalResource.xml
descAll/GazetteerAnnotator
Template Descriptor with Configurable ExternalResource.xml
- Open a terminal and root into one of the pipeline directories
- Start the pipeline with
java -jar ../jcore-pipeline-runner-base-0.4.1-SNAPSHOT-cli-assembly.jar run.xml
- Results
offsets.tsv
data/outData/output-xmi
- This JCoRe pipeline is derived of the JULIE Lab own jcore-pipeline-modules (see also https://zenodo.org/record/4066619#.X3sPVS8Rp-U)
- To calculate the inter-annotator-agreement between human annotators follow the instructions of bratiaa
- To calculate precision and recall between automatically created annotations and the human annotated data run:
pip install bratutils
python src/main/python/umls_evaluation.py <path to gold annotations> <path to automatic annotations>