Default pdftotext to UTF-8: Issue 1582 #103

seth-shaw-unlv · 2020-08-18T17:16:10Z

GitHub Issue: Islandora/documentation#1582

What does this Pull Request do?

Changes the default output of pdftotext form Latin-1 to UTF-8.

What's new?

Added -enc UTF-8 as a parameter to pdftotext.

How should this be tested?

Take an existing Islandora 8 site and add a Digital Document node/media with a PDF original file including non-latin characters. I used https://fsi-languages.yojik.eu/languages/PeaceCorps/Arabic-Moroccan/MO_Arabic_Language_Lessons.pdf for my test case.
Ensure the text was extracted by looking at the node's extracted text media. You won't see any of the non-latin characters.
Apply the PR
Run the Extract Text action on the Digital Document node again to extract the text again.
Look at the extracted text media again and you should now see non-latin characters in your output.

Interested parties

@Islandora/8-x-committers

codecov · 2020-08-18T17:18:27Z

Codecov Report

Merging #103 into dev will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##                dev     #103   +/-   ##
=========================================
  Coverage     94.85%   94.85%           
  Complexity      168      168           
=========================================
  Files             9        9           
  Lines           700      700           
=========================================
  Hits            664      664           
  Misses           36       36

Impacted Files	Coverage Δ	Complexity Δ
Hypercube/src/Controller/HypercubeController.php	`100.00% <100.00%> (ø)`	`5.00 <0.00> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f870cb1...97ea158. Read the comment docs.

jordandukart · 2020-08-19T17:32:08Z

Hypercube/src/Controller/HypercubeController.php

@@ -78,7 +78,7 @@ public function get(Request $request)
        $this->log->debug("Got Content-Type:", ['type' => $content_type]);

        if ($content_type == 'application/pdf') {
-            $cmd_string = $this->pdftotext_executable . " $args - -";
+            $cmd_string = $this->pdftotext_executable . " $args -enc UTF-8 - -";


Seems to make sense. Reading through pdftotext's docs any idea why they defaulted to Latin1?

I went back to look and this appears to be a version issue. Older versions default to Latin1, probably due to not paying attention to Unicode, while the newer versions default to UTF-8.

Oddly, the man pages for a default site say UTF-8 is the default, but running the version flag on it shows an older version that requires the flag.

Had a peak through about available encodings and what may or may not be available on the OS. Know we've ran into things in PHP where things like LC_ALL and UTF-8 aren't available but I can't seem to find reference to that in the pdftotext context so this gets a 👍 from me.

mjordan · 2020-08-21T14:25:52Z

I can't replicate the problem. In other words, all PDFs that I test have the non-latin characters intact.

I am on the default dev branch:

vagrant@islandora8:/var/www/html/Crayfish$ git status
On branch dev
Your branch is up to date with 'origin/dev'.

nothing to commit, working tree clean

I have tried PDFs that contain Traditional and Simplified Chinese, Hebrew, and even the test PDF linked above. They all work. Here's some extracted Arabic text (screenshot of redering Extracted Text.txt for that PDF) from the test PDF above:

Am I doing something wrong?

mjordan · 2020-08-21T14:32:18Z

I'll add that when I issue file on the extracted text file, it says it's UTF-8:

file "7-Extracted Text.txt" 
7-Extracted Text.txt: UTF-8 Unicode text

mjordan · 2020-08-22T21:21:38Z

Extracting text from PDFs with non-latin characters works same way with this patch applied (non-latin characters come through OK). Unless the fact that I couldn't get it to fail in the dev branch is a problem, I can merge this at any time.

seth-shaw-unlv · 2020-08-24T20:12:19Z

@mjordan when did you build the box? I did the PR based on @dannylamb's pre-built release box. As I mentioned to @jordandukart, this is likely dependent on the pdftotext version on the box where you have Crayfish installed.

mjordan · 2020-08-24T20:24:22Z

Dev Playbook. Let me try to get it to fail on a release box.

mjordan · 2020-08-25T14:57:30Z

I just built an 8.x-1.1 vagrant, and PDF text extraction is not working on the Arabic file above - Extracted Text for it never shows up. (This happens with and without the PR applied.) In fact, no other derivatives are created for that file, and when I view its parent node, even though I have the PDFjs display hint selected, it's not rendering in the node. Watchdog is empty. Derivatives are being created for other PDFs and the display hint is working as expected for them as well, so if I had to guess I'd say that PDF file is invalid or corrupted.

OTOH, text extraction on the other PDFs containing non-latin characters I was testing with earlier is successful in both the dev and 8.x-1.1 vms without the PR applied. In the 8.x-1.1 box without the PR, here's the extracted text from a PDF created from a LibreOffice word processing file encoded as Windows 1255 (Hebrew):

The text I extracted from the LibreOffice Writer file is in Windows 1255/ISO 8859:

mark@mark-ThinkPad-X1-Carbon-6th:~/Downloads$ file עוֹלָם.txt 
עוֹלָם.txt: ISO-8859 text, with no line terminators

yet the text from the LibreOffice-derived PDF extracted by Islandora is in UTF-8:

mark@mark-ThinkPad-X1-Carbon-6th:~/Downloads$ file encodingtest.txt 
encodingtest.txt: UTF-8 Unicode text

As long as adding the -enc UTF-8 parameter won't have any side effects, I don't see any harm in adding it, but I have been unable to reproduce the problem that adding it would solve while running pdftotext version 0.62.0, which is the version on both the dev and 8.x-1.1 vms.

Maybe someone else should test?

seth-shaw-unlv · 2020-08-25T19:10:13Z

Huh, I just re-tested this by rolling back my VM to before the fix and it still worked... 🤷‍♂️

I guess this PR isn't really necessary. No sense changing something doesn't need to be changed.

default pdftotext to UTF-8

97ea158

jordandukart reviewed Aug 19, 2020

View reviewed changes

dannylamb assigned mjordan Aug 19, 2020

seth-shaw-unlv closed this Aug 25, 2020

seth-shaw-unlv deleted the issue-1582 branch August 25, 2020 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default pdftotext to UTF-8: Issue 1582 #103

Default pdftotext to UTF-8: Issue 1582 #103

seth-shaw-unlv commented Aug 18, 2020

codecov bot commented Aug 18, 2020 •

edited

Loading

jordandukart Aug 19, 2020 •

edited

Loading

seth-shaw-unlv Aug 19, 2020

jordandukart Aug 19, 2020

mjordan commented Aug 21, 2020

mjordan commented Aug 21, 2020

mjordan commented Aug 22, 2020

seth-shaw-unlv commented Aug 24, 2020

mjordan commented Aug 24, 2020

mjordan commented Aug 25, 2020

seth-shaw-unlv commented Aug 25, 2020

Default pdftotext to UTF-8: Issue 1582 #103

Default pdftotext to UTF-8: Issue 1582 #103

Conversation

seth-shaw-unlv commented Aug 18, 2020

What does this Pull Request do?

What's new?

How should this be tested?

Interested parties

codecov bot commented Aug 18, 2020 • edited Loading

Codecov Report

jordandukart Aug 19, 2020 • edited Loading

Choose a reason for hiding this comment

seth-shaw-unlv Aug 19, 2020

Choose a reason for hiding this comment

jordandukart Aug 19, 2020

Choose a reason for hiding this comment

mjordan commented Aug 21, 2020

mjordan commented Aug 21, 2020

mjordan commented Aug 22, 2020

seth-shaw-unlv commented Aug 24, 2020

mjordan commented Aug 24, 2020

mjordan commented Aug 25, 2020

seth-shaw-unlv commented Aug 25, 2020

codecov bot commented Aug 18, 2020 •

edited

Loading

jordandukart Aug 19, 2020 •

edited

Loading