Ground truth and models for 17th c. French prints.
- This repo is not updated anymore. Please use the OCR17plus repo, which uses XML files and not .png/.txt pairs.
For the OCR17plus repo, cf. here.
|-Models
|-Kraken
|-Calamari
|-Testing_data
|-XIX
|-XVI
|-XVIII
|-Training_data
|-72dpi
|-Print_1
|-extracted
|-training_data
|-README.md
|transcription.txt
|-Print_2
|-400dpi
|-400dpi_MUFI
|-600dpi
Most of the training data are taken from literary texts, and especially plays, printed throughout the 17th century. Each print is described in depth in its own folder.
Transcripts are almost diplomatic. Long ſ is maintained ( plaiſir and not plaisir). Ligatures which have disappeared ( ſt, st, ct) are not kept, but not those that are maintained in contemporary French (œ, æ).
For certain prints only, unicode and MUFI ligatures are maintained (folder 400dpi_mufi
) for testing purposes. Ground truth is provided both with and without them.
@dataset{simon_gabay_2020_3826894,
author = {Simon Gabay},
title = {OCR17: GT for 17th French prints},
month = may,
year = 2020,
publisher = {Zenodo},
version = {1.0},
doi = {10.5281/zenodo.3826894},
url = {https://doi.org/10.5281/zenodo.3826894}
}
Please keep me posted if you use this data! simon.gabay[at]unige.ch
This work is licensed under a Creative Commons Attribution 4.0 International Licence.
Special thanks to Thibault Clérice for his magic xslt stylesheets (and many other things)!