Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. (Source: The PROV Data Model (PROV-DM))
The PROV Data Model (PROV-DM) is used to store all provenance metadata.
All provenance have to be stored in files.
There's provenance at the page level, but there's also provenance at the document level.
All files regarding provenance are stored in a subfolder metadata
.
The workflow provenance is stored in PROV-XML.
All Activities, Entities belonging to the OCR-D workflow have the same namespace.
- Prefix
- ocrd
- Namespace
- http://www.ocr-d.de
Type | Data Type | Description |
---|---|---|
Entity | ocrd:mets | Filename of METS file |
Entity | ocrd:mets_referencedFile | ID of the file referenced inside METS. |
Entity | ocrd:parameter_file | Content of the parameter file. |
Activity | ocrd:processor | Processor that was executed |
Activity | ocrd:workflow | Workflow that was executed |
Only the following information is stored for provenance:
(a) General data
- Workflow engine
- Label including version
- Start date
- End date
(b) Processor data
- Processor
- Label including version, conforming to OCR-D
mets:agent/mets:name
(e.g.:ocrd-kraken-binarize_Version 0.1.0
,ocrd/core 1.0.0
) - Start date
- End date
- Content of METS file before executing the processor
- Content of METS file after executing processor
- ID of the input file(s)
- ID of output file(s)
- Content of parameter.json (optional)
All files referenced in METS must also be referenced in provenance by their mets:file/@ID
.
A file may be linked to its location (URL). The location may be replaced due to
different uses:
- local files
- external files
All files not referenced in METS must be linked to their content in provenance. (e.g.: parameter.json)
At least before ingesting into repository/LTA, the entire provenance must be stored in one file (metadata/ocrd_provenance.xml) to make the provenance searchable. Therefore all the provenance files are merged into one big file. This file replaces all provenance files stored in subfolder 'metadata'
The file structure could look like this after a workflow with 4 steps has been executed.
metadata/
|
+-- mets.xml.'workflowid'_0000
|
+-- mets.xml.'workflowid'_0001
|
+-- mets.xml.'workflowid'_0002
|
+-- mets.xml.'workflowid'_0003
|
+-- mets.xml.'workflowid'_0004
|
+-- ocrd_provenance.xml
|
+-- provenance_'workflowid'.xml (optional)
The provenance MAY be stored as tag directory in the bagIt container. E.g.:
<base directory>/
|
+-- bagit.txt
|
+-- manifest-<algorithm>.txt
|
+-- [additional tag files]
|
+-- data/
| |
| +-- mets.xml
| |
| +-- ...
|
+-- metadata
|
+-- mets.xml.'workflowid'_0000
|
+-- ...
|
+-- mets.xml.'workflowid'_XXXX
|
+-- ocrd_provenance.xml