Glossary of terms from the domain of image processing/OCR and how they are used within the OCR-D framework
This section is non-normative.
See Region
From the PAGE-XML content schema documentation
Border of the actual page (if the scanned image contains parts not belonging to the page).
Within OCR-D, font family refers to grouping elements by font similarity. The semantics of a font family are up to the data producer.
Within OCR-D, a glyph is the atomic unit within a word.
See Glyph
See TextLine
Reading order describes the logical sequence of regions within a document.
A region is described by a polygon inside a page.
The semantics or function of a region such as heading, page number, column, table...
See Glyph
A text line is a single row of words within a text region. (Depending on the region's or page's orientation, and the script's writing direction, it can be horizontal or vertical.)
From the PAGE-XML content schema documentation
Determines the effective area on the paper of a printed page. Its size is equal for all pages of a book (exceptions: titlepage, multipage pictures).
It contains all living elements (except marginalia) like paragraphs and headings, as well as footnotes, headings, running titles.
It does not contain pagenumber (if not part of running title), marginalia, signature mark, preview words.
A word is a sequence of glyphs within a line which does not contain any word-bounding whitespace. (That is, it includes punctuation and is synonym to token in NLP.)
Ground truth (GT) in the context of OCR-D are transcriptions, specific structure descriptions and word lists. These are essentially available in PAGE XML format in combination with the original image. Essential parts of the GT were created manually.
We distinguish different usage scenarios for GT:
With the term reference data, we refer to data that illustrates different stages of an OCR/OLR process on representative materials. They are supposed to support the assessment of commonly encountered difficulties and challenges when running certain analysis operations and are therefore manually annotated at all levels.
Evaluation data are used to quantitatively evaluate the performance of OCR tools and/or algorithms. Parts of these data which correspond to the tool(s) under consideration are guaranteed to be recorded manually.
Many OCR-related tools need to be adapted to the specific domain of the works which are to be processed. This domain adaptation is called training. Data used to guide this process are called training data. It is essential that those parts of these data which are fed to the training algorithm are captured manually.
Binarization means converting all color or grayscale pixels in an image to either black or white.
Controlled term: binarized
(comments
of a mets:file), preprocessing/optimization/binarization
(step
in ocrd-tool.json)
See Felix' Niklas interactive demo
Manipulate an image in such a way that all text lines are straightened and any geometrical distortions have been corrected.
Controlled term: preprocessing/optimization/dewarping
See Matt Zucker's entry on Dewarping.
Remove artifacts such as smudges, ink blots, underlinings etc. from an image. Typically applied to remove "salt-and-pepper" noise resulting from Binarization.
Controlled term: preprocessing/optimization/despeckling
Rotate an image so that all text lines are horizontal.
Controlled term: preprocessing/optimization/deskewing
Detect the font type(s) used in the document, either before or after an OCR run.
Controlled term: recognition/font-identification
ISSUE: #41
Controlled term:
gray_normalized
(comments
in file)preprocessing/optimization/cropping
(step)
Gray normalization is similar to binarization but instead of a purely bitonal image, the output can also contain shades of gray to avoid inadvertently combining glyphs when they are very close together.
Document analysis is the detection of structure on the document level to e.g. create a table of contents.
Detect the reading order of regions.
Detecting the print space in a page, as opposed to the margins. It is a form of region segmentation.
Controlled term: preprocessing/optimization/cropping
.
--> Cropping
Segmentation means detecting areas within an image.
Specific segmentation algorithms are labelled by the semantics of the regions they detect not the semantics of the input, i.e. an algorithm that detects regions is called region segmentation.
Segment an image into regions. Also determines whether this is a text or non-text region (e.g. images).
Controlled term:
SEG-REGION
(USE
)layout/segmentation/region
(step)
Determine the type of a detected region.
Segment text regions into textlines.
Controlled term:
SEG-LINE
(USE
)layout/segmentation/line
(step)
See OCR.
Map pixel areas to glyphs and words.
Controlled term:
SEG-LINE
(USE
)layout/segmentation/word
(step)
Segment a textline into glyphs
Controlled term: SEG-GLYPH
See OCR.
Text optimization encompasses the manipulations to the text based on the steps up to and including text recognition. This includes (semi-)automatically correcting recognition errors, orthographical harmonization, fixing segmentation errors etc.
The software repository contains all OCR-D algorithms and tools developed during the project including tests. It will also contain the documentation and installation instructions for deploying a document analysis workflow.
Contains all the ground truth data.
The research data repository may contain the results of all activities during document analysis. At least it contains the end results of every processed document and its full provenance. The research data repository must be available locally.
Contains all trained (OCR) models for text recognition. The model repository has to be available at least locally. Ideally, a publicly available model repository will be developed.
A workspace is a representation for some document in the local file system. Minimally it consists of a directory with a copy of the METS file. Additionally, that directory may contain physical data files and sub-directories belonging to the document (required or generated by run-time OCR-D processing), as referenced by the METS via mets:file/mets:FLocat/@href
and mets:fileGrp/@USE
. Files and sub-directories without reference (like log or config files) are not part of the workspace, as are references to remote locations. They can be added to the workspace by referencing them in the METS via their relative local path names.
The OCR-D project divided the various elements of an OCR workflow into six abstract modules.
Manipulating the input images for subsequent layout analysis and text recognition.
Detection of structure within the page.
Recognition of text and post-correction of recognition errors.
Generating data files from aligned ground truth text and images to configure the prediction of text and layout recognition engines.
Storing results of OCR and OLR indefinitely, taking into account versioning, multiple runs, provenance/parametrization and providing access to these saved snapshots in a granular fashion.
Providing measures, algorithms and software to estimate the quality of the individual processes within the OCR-D domain.
Application composed of various servers that can execute processors; can be a desktop computer or workstation, a distributed system comprising a controller and multiple processing servers, or an HPC cluster.
As proposed in OCR-D/spec#173, the OCR-D Web API defines uniform and interdependent services that can be distributed across network components, depending on the use case.
Group of endpoints of the OCR-D Web API; discovery/workspace/processing/workflow/...
Concrete implementation of a subset of OCR-D services, or the network host providing it.
OCR-D Server (implementing at least discovery, workspace and workflow services) executing workflows (a single workflow or multiple workflows simultaneously), distributing tasks to configured processing servers, managing workspace data management. Should also manage load balancing.
OCR-D server (implementing at least discovery and processing services) that can execute one or more (locally installed) processors or evaluators, manages workspace data; implementer should consider whether a single OCR-D processing server (with page-parallel processing) best fits the use case, or multiple OCR-D processing servers (with document-parallel processing), or even dedicated OCR-D processing servers with GPU/CUDA support.
Software component of a server concerned with network operations; e.g. Python library with request handlers, implementing service discovery and network-capable workspace data management.
Software component of a server or processor concerned with OCR systems modelling; e.g. Python library in OCR-D/core providing classes for all essential functional components (OcrdPage
, OcrdMets
, Workspace
, Resolver
, Processor
, ProcessorTask
, Workflow
, WorkflowTask
...), including mechanisms for signalling and orchestration of workflows, on top of which components (from processor to controller) can be implemented.
Central software component of the controller, executing workflows, including control structures (in a linear/parallel/incremental way). Also needed in single-host CLI deployments (where it can be based on inter-process communication and file system I/O alone), like ocrd process
.
A processor is a tool that implements the uniform OCR-D command-line-interface for run-time data processing. That is, it executes a single workflow step, or a combination of multiple workflow steps, on the workspace (represented by local METS), reading input files for all or requested physical pages of the input fileGrp(s), and writing output files for them into the output fileGrp(s). It may take a number of optional or mandatory parameters.
An evaluator is a tool that implements the uniform OCR-D CLI for run-time quality estimation, assessing an activity's annotation (i.e. a processor's output) with some quality metric to yield a score and applying a given threshold against it to signal full or partial success/failure.
Software package/repository providing one or more processors or evaluators, possibly encompassing additional areas of functionality (training, format conversion, creation of GT, visualization)
Modules can comprise multiple methods/activities that are called processors for OCR-D. There were eight MP in the second phase of OCR-D (2018-2020).
Messaging service on the basis of Publish/Subscribe architecture (or similar) to coordinate network components, in particular for the distribution of tasks and load balancing, as well as signalling processor/evaluator results.
Combination of activities via concrete processors and evaluators and their parameterization configured as a sequence or lattice, depending on their success or failure. Implemented in the OCR-D Workflow Runtime Library and serializable in a yet-to-specifcy format (as of 2020/10).
The term Workflow is understood to encompass more features in other contexts, such as manual intervention by the user. In contrast to the terminology in workflow engines like Taverna or digitization frameworks like Kitodo, an OCR-D workflow is a fully automatic process.