Skip to content

RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages

License

Notifications You must be signed in to change notification settings

harshvivek14/RoundTripOCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RoundTripOCR

RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages.

This repository supports research and development in the field of post-OCR error correction, especially focusing on low-resource Devanagari languages. The RoundTripOCR technique generates synthetic datasets that replicate real-world OCR errors, enabling robust training and evaluation for error correction models.


Paper link - https://aclanthology.org/2024.icon-1.33


Dataset Links

The following datasets have been generated using the RoundTripOCR technique. They are hosted on Hugging Face:

Features

  • Synthetic Data Generation: Mimics OCR errors in low-resource Devanagari scripts.
  • Language Diversity: Covers six low-resource Devanagari-based languages: Marathi, Bodo, Sanskrit, Hindi, Konkani, and Nepali.
  • High Quality: Carefully designed to capture typical OCR error patterns for training and evaluation purposes.

Usage

  1. Download the datasets from the provided links on Hugging Face.
  2. Use them to train, evaluate, or benchmark OCR error correction models.
  3. Incorporate the data into existing workflows for improving OCR accuracy in Devanagari scripts.

Citation

If you use the RoundTripOCR datasets in your research or applications, please cite this work appropriately.


For more details, contributions, or support, feel free to reach out!

About

RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published