Skip to content

Latest commit

 

History

History
35 lines (24 loc) · 1.43 KB

README.md

File metadata and controls

35 lines (24 loc) · 1.43 KB

iiif2annos

Read a manifest, OCR the images, create AnnotationLists and add them to a copy of the manifest

This tool uses the tesseract OCR engine. Ensure you have this installed and on your $PATH before running the code below.

usage: ocr.py [-h] [--base-output-uri OUTPUTURI] [--lang LANG] [-c] manifest output

Read a manifest, OCR all the pages then adds the results as annotation lists

positional arguments:
  manifest              URL to Manifest file
  output                Output directory for annotation lists

options:
  -h, --help            show this help message and exit
  --base-output-uri OUTPUTURI
                        Output URI for annotations and annotation list
  --lang LANG           Language to pass to the OCR engine see: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html
  -c, --confidence      Include OCR confidence value in text of the annotation?

This should work with v2 manifests and v3 manifest. For v2 AnnotationLists are created for v3 AnnotationPages are created.

Example

python iiif2annos/ocr.py --lang frk --base-output-uri http://localhost:5500/newspaper https://preview.iiif.io/cookbook/update_newspaper/recipe/0068-newspaper/newspaper_issue_1-manifest.json  newspaper

Using these blogs as a guide: