RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages.
This repository supports research and development in the field of post-OCR error correction, especially focusing on low-resource Devanagari languages. The RoundTripOCR technique generates synthetic datasets that replicate real-world OCR errors, enabling robust training and evaluation for error correction models.
Paper link - https://aclanthology.org/2024.icon-1.33
The following datasets have been generated using the RoundTripOCR technique. They are hosted on Hugging Face:
- Synthetic Data Generation: Mimics OCR errors in low-resource Devanagari scripts.
- Language Diversity: Covers six low-resource Devanagari-based languages: Marathi, Bodo, Sanskrit, Hindi, Konkani, and Nepali.
- High Quality: Carefully designed to capture typical OCR error patterns for training and evaluation purposes.
- Download the datasets from the provided links on Hugging Face.
- Use them to train, evaluate, or benchmark OCR error correction models.
- Incorporate the data into existing workflows for improving OCR accuracy in Devanagari scripts.
If you use the RoundTripOCR datasets in your research or applications, please cite this work appropriately.
For more details, contributions, or support, feel free to reach out!