Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sandwich pdf-rederer is not working. Getting gibberish text from output PDF. #17

Open
sanjay-nit opened this issue Oct 15, 2024 · 4 comments

Comments

@sanjay-nit
Copy link

sanjay-nit commented Oct 15, 2024

Hi, Even after using --pdf-renderer=sandwich option; getting gibberish text when I select the text from output pdf and paste somewhere.

FYI: I'm using MacOS(M1) 15.0.1

Below are the steps I took.

  1. pip install git+https://github.com/ocrmypdf/OCRmyPDF-EasyOCR.git
  2. command: ocrmypdf --pdf-renderer sandwich image.pdf test.pdf

Version info:

easyocr==1.7.2
ocrmypdf==16.5.0
ocrmypdf-easyocr==0.2.1

Below are logs:
image

I'm attaching the PDF file I tested with.
image.pdf
ouput.pdf

I also tried processing image version of this PDF, this isn't working too.
command: ocrmypdf --pdf-renderer=sandwich --force-ocr --image-dpi 300 image.jpg test.pdf
image

@jbarlow83
Copy link
Contributor

The text found in the PDF is precisely what EasyOCR detected in this case, so it must be struggling to make sense of the multiple colors and formatting.

From the debug log, the gibberish is coming directly from EasyOCR:

[2024-10-15 10:36:54,935] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '|+_"' in-image bbox: 479, 2, 662, 2, 662, 56, 479, 56
[2024-10-15 10:36:54,936] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '|+_"' PDF bbox: 479, 882, 662, 882, 662, 828, 479, 828
[2024-10-15 10:36:54,937] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline ']2' in-image bbox: 387, 17, 459, 17, 459, 61, 387, 61
[2024-10-15 10:36:54,937] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline ']2' PDF bbox: 387, 867, 459, 867, 459, 823, 387, 823
[2024-10-15 10:36:54,938] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '@_@:' in-image bbox: 362, 27, 817, 27, 817, 144, 362, 144
[2024-10-15 10:36:54,938] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '@_@:' PDF bbox: 362, 857, 817, 857, 817, 740, 362, 740

That's a fantastic test image. Please report the issue to EasyOCR to see if they will address it.

Regular OCRmyPDF with Tesseract seems to work okay-ish.

@sanjay-nit
Copy link
Author

@jbarlow83 I'm not sure about this tool how it's using the easyocr under the hood. But below is what I got when I ran easyocr separately on this image. OCR is giving correct results on this image.

image

@sanjay-nit
Copy link
Author

The text found in the PDF is precisely what EasyOCR detected in this case, so it must be struggling to make sense of the multiple colors and formatting.

From the debug log, the gibberish is coming directly from EasyOCR:

[2024-10-15 10:36:54,935] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '|+_"' in-image bbox: 479, 2, 662, 2, 662, 56, 479, 56
[2024-10-15 10:36:54,936] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '|+_"' PDF bbox: 479, 882, 662, 882, 662, 828, 479, 828
[2024-10-15 10:36:54,937] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline ']2' in-image bbox: 387, 17, 459, 17, 459, 61, 387, 61
[2024-10-15 10:36:54,937] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline ']2' PDF bbox: 387, 867, 459, 867, 459, 823, 387, 823
[2024-10-15 10:36:54,938] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '@_@:' in-image bbox: 362, 27, 817, 27, 817, 144, 362, 144
[2024-10-15 10:36:54,938] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '@_@:' PDF bbox: 362, 857, 817, 857, 817, 740, 362, 740

That's a fantastic test image. Please report the issue to EasyOCR to see if they will address it.

Regular OCRmyPDF with Tesseract seems to work okay-ish.

@jbarlow83
I appreciate your quick closure of this GitHub issue! 😄

@jbarlow83 jbarlow83 reopened this Oct 26, 2024
@alaminkouser
Copy link

Set language to English.

ocrmypdf.ocr(os.path.join(save_path, "input.pdf"), os.path.join(save_path, "output.pdf"), force_ocr=True, pdf_renderer="sandwich", language="eng")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants