sandwich pdf-rederer is not working. Getting gibberish text from output PDF. #17

sanjay-nit · 2024-10-15T07:47:53Z

Hi, Even after using --pdf-renderer=sandwich option; getting gibberish text when I select the text from output pdf and paste somewhere.

FYI: I'm using MacOS(M1) 15.0.1

Below are the steps I took.

pip install git+https://github.com/ocrmypdf/OCRmyPDF-EasyOCR.git
command: ocrmypdf --pdf-renderer sandwich image.pdf test.pdf

Version info:

easyocr==1.7.2
ocrmypdf==16.5.0
ocrmypdf-easyocr==0.2.1

Below are logs:

I'm attaching the PDF file I tested with.
image.pdf
ouput.pdf

I also tried processing image version of this PDF, this isn't working too.
command: ocrmypdf --pdf-renderer=sandwich --force-ocr --image-dpi 300 image.jpg test.pdf
image

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2024-10-15T08:43:31Z

The text found in the PDF is precisely what EasyOCR detected in this case, so it must be struggling to make sense of the multiple colors and formatting.

From the debug log, the gibberish is coming directly from EasyOCR:

[2024-10-15 10:36:54,935] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '|+_"' in-image bbox: 479, 2, 662, 2, 662, 56, 479, 56
[2024-10-15 10:36:54,936] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '|+_"' PDF bbox: 479, 882, 662, 882, 662, 828, 479, 828
[2024-10-15 10:36:54,937] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline ']2' in-image bbox: 387, 17, 459, 17, 459, 61, 387, 61
[2024-10-15 10:36:54,937] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline ']2' PDF bbox: 387, 867, 459, 867, 459, 823, 387, 823
[2024-10-15 10:36:54,938] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '@_@:' in-image bbox: 362, 27, 817, 27, 817, 144, 362, 144
[2024-10-15 10:36:54,938] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '@_@:' PDF bbox: 362, 857, 817, 857, 817, 740, 362, 740

That's a fantastic test image. Please report the issue to EasyOCR to see if they will address it.

Regular OCRmyPDF with Tesseract seems to work okay-ish.

sanjay-nit · 2024-10-15T08:51:58Z

@jbarlow83 I'm not sure about this tool how it's using the easyocr under the hood. But below is what I got when I ran easyocr separately on this image. OCR is giving correct results on this image.

sanjay-nit · 2024-10-15T08:56:31Z

The text found in the PDF is precisely what EasyOCR detected in this case, so it must be struggling to make sense of the multiple colors and formatting.

From the debug log, the gibberish is coming directly from EasyOCR:
[2024-10-15 10:36:54,935] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '|+_"' in-image bbox: 479, 2, 662, 2, 662, 56, 479, 56
[2024-10-15 10:36:54,936] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '|+_"' PDF bbox: 479, 882, 662, 882, 662, 828, 479, 828
[2024-10-15 10:36:54,937] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline ']2' in-image bbox: 387, 17, 459, 17, 459, 61, 387, 61
[2024-10-15 10:36:54,937] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline ']2' PDF bbox: 387, 867, 459, 867, 459, 823, 387, 823
[2024-10-15 10:36:54,938] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '@_@:' in-image bbox: 362, 27, 817, 27, 817, 144, 362, 144
[2024-10-15 10:36:54,938] - ocrmypdf_easyocr._pdf -   DEBUG -    1  Textline '@_@:' PDF bbox: 362, 857, 817, 857, 817, 740, 362, 740
That's a fantastic test image. Please report the issue to EasyOCR to see if they will address it.

Regular OCRmyPDF with Tesseract seems to work okay-ish.

@jbarlow83
I appreciate your quick closure of this GitHub issue! 😄

alaminkouser · 2024-11-17T18:49:12Z

Set language to English.

ocrmypdf.ocr(os.path.join(save_path, "input.pdf"), os.path.join(save_path, "output.pdf"), force_ocr=True, pdf_renderer="sandwich", language="eng")

jbarlow83 closed this as completed Oct 15, 2024

jbarlow83 reopened this Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sandwich pdf-rederer is not working. Getting gibberish text from output PDF. #17

sandwich pdf-rederer is not working. Getting gibberish text from output PDF. #17

sanjay-nit commented Oct 15, 2024 •

edited

Loading

jbarlow83 commented Oct 15, 2024

sanjay-nit commented Oct 15, 2024

sanjay-nit commented Oct 15, 2024

alaminkouser commented Nov 17, 2024

sandwich pdf-rederer is not working. Getting gibberish text from output PDF. #17

sandwich pdf-rederer is not working. Getting gibberish text from output PDF. #17

Comments

sanjay-nit commented Oct 15, 2024 • edited Loading

jbarlow83 commented Oct 15, 2024

sanjay-nit commented Oct 15, 2024

sanjay-nit commented Oct 15, 2024

alaminkouser commented Nov 17, 2024

sanjay-nit commented Oct 15, 2024 •

edited

Loading