Skip to content

KMCS-NII/mapPdfToXml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mapPdfToXml

Tool for Extract PDF's Layout Information and embed it into an XML

mapPdfToXml is a tool for extract document layout information from PDF and embed it into an XML.

This tool generates a new XHTML document from an original XHTML document and a PDF document which is converted from the original XHTML. The elements in the generated XHTML will have its layout information which is extracted from PDF (converted from original XHTML); page number, position in a page, width, height, font name, font size, and color.

For example, with the XHTML at http://www.w3.org/TR/2002/REC-xhtml1-20020801/:

<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
:
:
<h2>
 <a name="status" id="status"></a>
 Status of this document
</h2>

<p>
 <em>
  This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.
 </em>
</p>
:

and its saved-as-PDF file, the XHTML described below will generated. (Some line breakings and indents are added for easy comparison.)

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:pdf="http://kmcs.nii.ac.jp/#ns" xml:lang="en">
:
:
<h2>
 <a id="status" name="status"></a>
 <pdf:span pdf:boundaryid="59" pdf:boundarysequence="58" pdf:boundarytype="text" pdf:fontcolor="#000000" pdf:fontfamily="CZKWDF+LiberationSans-Bold" pdf:fontsize="13.5" pdf:height="16.5" pdf:left="60" pdf:page="1" pdf:text="Status of this document" pdf:top="583" pdf:width="166.5">Status of this document</pdf:span>
</h2>

<p>
 <em>
  <pdf:span pdf:boundaryid="60" pdf:boundarysequence="59" pdf:boundarytype="text" pdf:fontcolor="#000000" pdf:fontfamily="TCYTND+LiberationSans-Italic" pdf:fontsize="8.5" pdf:height="11" pdf:left="60" pdf:page="1" pdf:text="This section describes the status of this document at the time of its publication. Other documents may supersede" pdf:top="612.5" pdf:width="486">This section describes the status of this document at the time of its publication. Other documents may supersede </pdf:span>
  <pdf:span pdf:boundaryid="61" pdf:boundarysequence="60" pdf:boundarytype="text" pdf:fontcolor="#000000" pdf:fontfamily="TCYTND+LiberationSans-Italic" pdf:fontsize="8.5" pdf:height="11" pdf:left="60" pdf:page="1" pdf:text="this document. The latest status of this document series is maintained at the W3C." pdf:top="624.5" pdf:width="355">this document. The latest status of this document series is maintained at the W3C.</pdf:span>
 </em>
</p>
:

Installation

Ubuntu 14

  1. Save the line below as /etc/apt/sources.list.d/nii-xml-pdf.list. Administrative privileges are required.

     deb https://raw.githubusercontent.com/KMCS-NII/mapPdfToXml/master/ubuntu/14/packages ./
    
  2. Type following commands as the root user:

     apt-get update
     apt-get install nii-tex-pdf
    

Enter "y" for the message below:

    WARNING: The following packages cannot be authenticated!
      nii-xml-pdf-kyotocabinet-perl nii-xml-pdf
    Install these packages without verification? [y/N]

CentOS 6, 7

  1. Type following commands as the root user:

     rpm -i https://raw.githubusercontent.com/KMCS-NII/mapPdfToXml/master/centos/nii-xml-pdf-repo-1-1.noarch.rpm
     yum install --enablerepo=nii-xml-pdf nii-xml-pdf
    

It may take several minutes to install perl modules.

Others

  1. Copy sources/ to your system. If subversion (svn) is available in your system, typing command below will create mapPdfToXml directory under the current directory.

     svn export https://github.com/KMCS-NII/mapPdfToXml/trunk/sources mapPdfToXml
    
  2. Install packages listed below:

  1. Add the absolute path of nii.xml-pdf (which is generated by svn export) to the environment variable PERL5LIB.

     ex) env PERL5LIB=~/programs/mapPdfToXml/nii.xml-pdf/ ~/programs/mapPdfToXml/mapPdfToXml
    

Usage

Prepare an XML and a PDF file and type:

mapPdfToXml path/to/source/xml/(file|directory) path/to/source/pdf/(file|directory) path/to/destination/directory

If a destination directory does not exist, it will be automatically created.

Single pair processing mode

When both path/to/source/xml and path/to/source/pdf are a path to a file, generated XML will be saved as path/to/destination/directory/pdf_file_name.xml. If the file already exists at path, it will be overwritten.

Directory processing mode

When both path/to/source/xml and path/to/source/pdf are a path to a directory, XML/PDF will be paired by their path (relative path without extension), and processed recursively.

Generated XML will be saved as path/to/destination/directory/pdf_file_name.xml. If the file already exists at path, it will be overwritten.

For example, an XML file placed at

(path/to/xml/directory)/2015/1/abc.xml

and a PDF file placed at

(path/to/pdf/directory)/2015/1/abc.pdf

are processed as a pair. And the result is saved as

(path/to/destination/directory)/2015/1/abc.xml

Miscellaneous Options

Log output supression

-q : Turn off console log output.

Output format

--json : Output result as a JSON file (extension is .json).

Input file extension

--xml-extention= xyz : Process xml files which extension match with .xyz under the directory (path/to/xml/directory).

By default, the tool processes files with extensions .xml, .xhtml or .html. If there are files which have same relative path except file extension, for example somefile.xml, somefile.xhtml and somefile.html, priority is xml > xhtml > html.

XML/PDF mapping for directory processing mode

Mapping file makes it possible to process XML/PDF pairs of any file names. Each line of a mapping file indicates the relative paths of paired XML/PDF:

(path/to/XML_file)(TAB)(path/to/PDF_file)

XML/PDF file paths must contain extensions. The generated XML file name will be basename of PDF with xml extension.

Specify mapping file path by --map-file option.

--map-file=(path of the mapping file)

Ignorable XML elements

Some XML elements, for instance , might not have layout information in PDF. Ignoring these elements will improve accuracy of mapping between layout information and element.

For example, to ignore title element which have class attribute with value "foo" and all head element, specify:

--skip-conditions=title:foo,head

Without --skip-conditions command line argument, "head" will be used as default value. If you want to process "head" element, specify empty string:

--skip-conditions=

Result format

  • The most of generated XML's DOM structure is copy of the input XML file's DOM structure, containing extra layout information extracted from PDF.
  • This tool estimates mappings between PDF boundaries and XML text. PDF boundary is a rectangle-shaped layout unit having its position and size.
  • Each part of text in the input XML which is mapped to a PDF boundary is represented as a pdf:span element. URI for the namespace "pdf" is http://kmcs.nii.ac.jp/#ns.
  • A pdf:span element has some attributes.
    • pdf:boundarytype : type of boundary, for now this value is always "text".
    • pdf:boundaryid : boundary serial number through the file, one-based.
    • pdf:boundarysequence : boundary serial number through the page, zero-origin.
    • pdf:page : page number.
    • pdf:text : text.
    • pdf:left : left edge position.
    • pdf:top : top edge position.
    • pdf:width : width.
    • pdf:height : height.
    • pdf:fontcolor : font color in #RRGGBB format.
    • pdf:fontfamily : name of font or fontset. Optional. "+" is used for composite font names like "EDLXCL+RyuminPro-Light-Identity-H".
    • pdf:fontsize : size of font. Optional and the value is sometimes "0".
  • Position / size unit is in point (pt).
  • PDF boundaries which its counterpart is not found in the input XML will be embedded as a pdf:span element which contains no text.
  • Each part of text in the input XML which counterpart is not found in the PDF will be represented as a pdf:unmapped element.

Examples

Here is an example on an input XHTML:

<span>ABCDEFG</span><span>HIJKLMN</span>
  • When an XHTML element and a PDF boundary have exactly same text, pdf:span is inserted as the child of the XHTML element.

      <span><pdf:span>ABCDEFG</pdf:span></span><span><pdf:span>HIJKLMN</pdf:span></span>
    
  • When a sequence of XHTML elements has exactly same text as a PDF boundary, pdf:span is inserted as the child of each XHTML elements. Width of each pdf:span is proportional to its number of characters.

  • When a sequence of PDF boundaries has exactly same text as an XML element, multiple "pdf:span"s are inserted as the child of the XHTML element.

      <span><pdf:span>ABC</pdf:span><pdf:span>DEFG</pdf:span></span><span><pdf:span>HIJKLMN</pdf:span></span>
    
  • Each part of text in the input XML which counterpart is not found in the PDF is represented as a pdf:unmapped element.

      <span><pdf:unmapped>ABCD</pdf:unmapped><pdf:span>EFG</pdf:span></span><span><pdf:span>HIJKLMN</pdf:span></span>
    
  • Each PDF boundary which counterpart is not found in the XML is embedded, just after the previous mapped element, as a pdf:span element which contains no text

      <span><pdf:span>ABCDEFG</pdf:span></span><pdf:span text="1234567"/><span><pdf:span>HIJKLMN</pdf:span></span>
    

License

This tool is released under the MIT License.

Copyright (c) 2015 National Institute of Informatics, Japan.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages