A python wrapper for extracting metadata, text, section titles, figures, and references from Haptic Device Research Papers. Uses PDFFigures2.0 for extraction of figures and figure captions and GROBID for extraction of references, section text and titles. Also has a cross-reference function to find connections between given paper inputs (which papers cited each other and how many times, shared authors and references between papers).
For More Information: https://haptipediaextractor.readthedocs.io/en/latest/
- Set appropriate settings and directories for input and output files in ConfigPaths.py
- Change directory to src and run main.py
- Python 3.5
- subprocess32 package (pip install subprocess)
- Psycopg2 (for connecting to the database)
- Requests Library
- Clone the repo on the machine
- Have GROBID running in the background somewhere
Grobid is used to extract metadata, text and citations from PDF files. Grobid should be running as a service somwhere. (See Grobid's Github project for more complete installation instructions.)
Pdffigures2.0 is used to extract figures, tables and captions from PDF files. It should be installed as directed by the pdffigures2 Github page. The path to the pdffigures2 binary can be configured in ConfigPaths.py