Skip to content

jesusSant/CUTEXT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

CUTEXT - Cvalue Used To EXtract Terms


Introduction


The heavy use of medical terms has motivated the construction of large terminological resources for English, such as the Unified Medical Language System (UMLS) or the Open Biological and Biomedical Ontology (OBO) ontologies. Purely manual construction of terminological resources is by itself very valuable, but it constitutes a highly time-consuming process, it does not guarantee that included concepts or terms do actually align with the medical language and terms as they are being used in clinical documents by healthcare professionals and it requires constant update and revision due to changes and emergence of new biomedical concepts over time.

CUTEXT is a multilingual medical term extraction tool. It allows extracting terms in texts written in English, Spanish, Galician, and Catalan.

The main characteristics of CUTEXT are the following:

  • It is implemented in java, so it is multiplatform. It has been tested under Windows and Linux.
  • It is multilingual: It has been tested in English, Spanish, Catalan, and Galician.
  • It can be adapted easily to other languages by simply changing the lexical tag text file configuration.
  • The entry documents can be in plain text or in pdf.
  • It can be executed in graphic mode or by console (command line).
  • It supports numerous configuration parameters, among the most important: the language, the tagger, the frequency and c-value thresholds, and the entry of the document/s.
  • The output is provided in plain text, in JSON format or/and in BioC.

Una descripción más detallada del sistema se puede encontrar en la revista Sociedad Española para el Procesamiento del Lenguaje Natural.

Prerequisites (Dependency)


CUTEXT requires to have TreeTagger installed on your computer. If you are going to use medical or biomedical texts it is also convenient, although not necessary, to install GeniaTagger, or another specific tagger of this domain.

The route must also be included in the PATH variable, up to the TreeTagger "bin" folder.

To convert texts written in pdf to txt, we use a script that uses the class "ExtractText" of the Apache pdfbox API, which is packaged in the "pdfbox-app-2.0.5.jar" file inside into the "jar_pdf" folder. Therefore, the path to this file must be included in the CLASSPATH variable. Finally, you must also include in the CLASSPATH variable the path to the "es" folder, since CUTEXT is packaged at: "es.cnio.bionlp.cutext".

Directory structure


CUTEXT directory structure corresponds to a unique package nomenclature called es.cnio.bionlp.cutext This allows their 'fully qualified class name' to be unique. Therefore, all packages are within that structure:

  • es/cnio/bionlp/cutext/config_files/: includes files with tags, stop-words, and punctuation marks in Spanish, Galician, Catalan, and English.
  • es/cnio/bionlp/cutext/filter/lin/: contains the Java classes that implement the linguistic filter.
  • es/cnio/bionlp/cutext/filter/sta/: contains the Java classes that implement the statistical filter.
  • es/cnio/bionlp/cutext/gui/: contains the Java classes that implement the graphical user interface (GUI).
  • es/cnio/bionlp/cutext/in/: a possible place to put the input file.
  • es/cnio/bionlp/cutext/intern/TT/in/: internal storage of the input file for treetagger.
  • es/cnio/bionlp/cutext/intern/TT/out/: internal storage of the output file generated by treetagger.
  • es/cnio/bionlp/cutext/intern/TT/x/: internal storage of the intermediate file for treetagger.
  • es/cnio/bionlp/cutext/jar_pdf/: contains the "pdfbox-app-2.0.5.jar" file to convert texts written in pdf to txt.
  • es/cnio/bionlp/cutext/main/: includes the main classes of CUTEXT as well as the file cutext.jar.
  • es/cnio/bionlp/cutext/out/fileTextHashTerms/: store the text output files.
  • es/cnio/bionlp/cutext/out/serHashTerms/: stores serialized objects.
  • es/cnio/bionlp/cutext/postagger/: contains the class that invokes the tagger (TreeTagger).
  • es/cnio/bionlp/cutext/prepro/: contains the classes that preprocess the input corpus.
  • es/cnio/bionlp/cutext/properties/: contains the CUTEXT properties file.
  • es/cnio/bionlp/cutext/stemmer/: contains the classes that allow you to obtain the stem of the words.
  • es/cnio/bionlp/cutext/textmode/: contains the classes that allow CUTEXT to be executed from the terminal.
  • es/cnio/bionlp/cutext/util/: contains utility classes.

Usage


CUTEXT allows its execution in graphic mode or in text mode. In both cases, it is assumed that it will be executed from the "main" folder. If not, change the paths in the properties file "cutext.properties", and include the path of this file as an input parameter when invoking CUTEXT.

To execute CUTEXT in graphic mode:

java es.cnio.bionlp.cutext.main.ExecCutext

To execute CUTEXT in text mode:

java es.cnio.bionlp.cutext.main.ExecCutext -TM [Options] <-inputFile fileName>

Except for the input file, all options have default values, so it is not necessary to include them.

Options:

-TM
	Execute CUTEXT in text mode (TM).
-help
 	Show the line to execute CUTEXT, and the options.
-displayon 
	Show the messages at the standard output. Default TRUE (show).
-postagger 
	POS tagger to tagger the input file. TreeTagger (default) or GeniaTagger.
-language 
	SPANISH (default) or ENGLISH, CATALAN, GALICIAN.
-frecT 
	Frecuency Threshold. Default 0.
-cvalueT 
	C-Value Threshold. Default 0.0.
-bioc 
	Create a BioC output. Default false.
-json 
	Create a JSON output. Default false.
-convert 
	If true then convert the input file into lower case. Default true.
-withoutcvalue 
	If true then execute only the linguistic filter. Default false.
-incremental 
	If true then execute one line of the file as a entire corpus. Default false.
-generateTextFile 
	If true then create one text file per hashTerms, from 'a' to 'z'. Also create a raw text file with terms sorted by cvalue. Default false.
-routeHashTerms 
	Folder where you want to store the hash terms.
-routeTextFileHashTerms 
	Folder where you want to store the text file hash terms.
-routeconfigfiles 
	Folder where it stores config files.
-routeinterntt 
	Temporary folder (TT).
-inputFile 
	The document to use.
-outputFile 
	The file to write the result to.

Examples


Let's assume an input file "in.txt", in the folder "in", if we execute CUTEXT in text mode:

java es.cnio.bionlp.cutext.main.ExecCutext -TM -generateTextFile true -inputFile ../in/in.txt

This generates the text files at the folder "out/fileTextHashTerms" and at "out/serHashTerms" the serialized terms.

If you want also to obtain outputs in the BioC and JSON formats, then you will have to execute CUTEXT by setting these parameters to TRUE, as in:

java es.cnio.bionlp.cutext.main.ExecCutext -TM -generateTextFile true -bioc true -json true -inputFile ../in/in.txt

Execution via JAR file


The cutext.jar file allows to execute CUTEXT directly from a terminal such as cmd, terminator, etc. To do this, you have to write the following command line (from the directory where cutext.jar is located):

java -jar cutext.jar [options]

Where options are those shown in the 'Usage' section. For example, if we type:

java -jar cutext.jar

CUTEXT will run the graphical interface.

The cutext.jar file is found at src/es/cnio/bionlp/cutext/main/cutext.jar

If we change cutext.jar to another directory, we must change the properties file accordingly, which is at:

es/cnio/bionlp/cutext/properties/cutext.properties

Contact


If you have any questions, remarks, bug reports, bug fixes or extensions, I will be happy to hear from you.

Jesús Santamaría (jsantamaria@cnio.es)

License


(This is so-called MIT/X License)

Copyright (c) 2017-2018 Secretaría de Estado para el Avance Digital (SEAD)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published