Skip to content

LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance where needed). The system is open-source and provides a simple baseline function for extracting text from primary research articles using rules that developers can customize. This means that the system works qu…

License

Notifications You must be signed in to change notification settings

SciKnowEngine/lapdftext

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Layout Aware PDF (LAPDF) Extraction

Installation Instructions

This is a Maven project and should be installed by issuing the following commands:

$ git clone https://github.com/SciKnowEngine/lapdftext/
$ cd lapdftext
$ mvn clean install assembly:assembly

This will build the jar archive file: target/lapdftext-1.8.0-SNAPSHOT-jar-with-dependencies.jar

You can execute commands against this library to run extraction tasks from PDF files.

Command-line functionality

Executing commands against the assembled jar file takes the form:

java -cp path/to/lapdftext-1.8.0-SNAPSHOT-jar-with-dependencies.jar edu.isi.bmkeg.lapdf.bin.<COMMAND> options

where COMMAND could be

  • Blockify - constructs text blocks from PDF files and outputs them as XML-formatted files.
  • BlockifyClassify - executes Blockify but also runs rule-based classification on blocks
  • BlockStatistics - provides statistics about each block
  • ExtractFigureImagesFromFile - extracts images of figures from PDF-based scientific articles.

Details of each command is described in usage documentation available by running the code without options.

About

LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance where needed). The system is open-source and provides a simple baseline function for extracting text from primary research articles using rules that developers can customize. This means that the system works qu…

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 82.2%
  • XSLT 12.4%
  • HTML 4.7%
  • Other 0.7%