Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default pdftotext to UTF-8: Issue 1582 #103

Closed
wants to merge 1 commit into from
Closed

Conversation

seth-shaw-unlv
Copy link
Contributor

GitHub Issue: Islandora/documentation#1582

What does this Pull Request do?

Changes the default output of pdftotext form Latin-1 to UTF-8.

What's new?

  • Added -enc UTF-8 as a parameter to pdftotext.

How should this be tested?

  • Take an existing Islandora 8 site and add a Digital Document node/media with a PDF original file including non-latin characters. I used https://fsi-languages.yojik.eu/languages/PeaceCorps/Arabic-Moroccan/MO_Arabic_Language_Lessons.pdf for my test case.
  • Ensure the text was extracted by looking at the node's extracted text media. You won't see any of the non-latin characters.
  • Apply the PR
  • Run the Extract Text action on the Digital Document node again to extract the text again.
  • Look at the extracted text media again and you should now see non-latin characters in your output.

Interested parties

@Islandora/8-x-committers

@codecov
Copy link

codecov bot commented Aug 18, 2020

Codecov Report

Merging #103 into dev will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##                dev     #103   +/-   ##
=========================================
  Coverage     94.85%   94.85%           
  Complexity      168      168           
=========================================
  Files             9        9           
  Lines           700      700           
=========================================
  Hits            664      664           
  Misses           36       36           
Impacted Files Coverage Δ Complexity Δ
Hypercube/src/Controller/HypercubeController.php 100.00% <100.00%> (ø) 5.00 <0.00> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f870cb1...97ea158. Read the comment docs.

@@ -78,7 +78,7 @@ public function get(Request $request)
$this->log->debug("Got Content-Type:", ['type' => $content_type]);

if ($content_type == 'application/pdf') {
$cmd_string = $this->pdftotext_executable . " $args - -";
$cmd_string = $this->pdftotext_executable . " $args -enc UTF-8 - -";
Copy link
Member

@jordandukart jordandukart Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to make sense. Reading through pdftotext's docs any idea why they defaulted to Latin1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back to look and this appears to be a version issue. Older versions default to Latin1, probably due to not paying attention to Unicode, while the newer versions default to UTF-8.

Oddly, the man pages for a default site say UTF-8 is the default, but running the version flag on it shows an older version that requires the flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a peak through about available encodings and what may or may not be available on the OS. Know we've ran into things in PHP where things like LC_ALL and UTF-8 aren't available but I can't seem to find reference to that in the pdftotext context so this gets a 👍 from me.

@mjordan
Copy link
Contributor

mjordan commented Aug 21, 2020

I can't replicate the problem. In other words, all PDFs that I test have the non-latin characters intact.

I am on the default dev branch:

vagrant@islandora8:/var/www/html/Crayfish$ git status
On branch dev
Your branch is up to date with 'origin/dev'.

nothing to commit, working tree clean

I have tried PDFs that contain Traditional and Simplified Chinese, Hebrew, and even the test PDF linked above. They all work. Here's some extracted Arabic text (screenshot of redering Extracted Text.txt for that PDF) from the test PDF above:

arabic_text

Am I doing something wrong?

@mjordan
Copy link
Contributor

mjordan commented Aug 21, 2020

I'll add that when I issue file on the extracted text file, it says it's UTF-8:

file "7-Extracted Text.txt" 
7-Extracted Text.txt: UTF-8 Unicode text

@mjordan
Copy link
Contributor

mjordan commented Aug 22, 2020

Extracting text from PDFs with non-latin characters works same way with this patch applied (non-latin characters come through OK). Unless the fact that I couldn't get it to fail in the dev branch is a problem, I can merge this at any time.

@seth-shaw-unlv
Copy link
Contributor Author

@mjordan when did you build the box? I did the PR based on @dannylamb's pre-built release box. As I mentioned to @jordandukart, this is likely dependent on the pdftotext version on the box where you have Crayfish installed.

@mjordan
Copy link
Contributor

mjordan commented Aug 24, 2020

Dev Playbook. Let me try to get it to fail on a release box.

@mjordan
Copy link
Contributor

mjordan commented Aug 25, 2020

I just built an 8.x-1.1 vagrant, and PDF text extraction is not working on the Arabic file above - Extracted Text for it never shows up. (This happens with and without the PR applied.) In fact, no other derivatives are created for that file, and when I view its parent node, even though I have the PDFjs display hint selected, it's not rendering in the node. Watchdog is empty. Derivatives are being created for other PDFs and the display hint is working as expected for them as well, so if I had to guess I'd say that PDF file is invalid or corrupted.

OTOH, text extraction on the other PDFs containing non-latin characters I was testing with earlier is successful in both the dev and 8.x-1.1 vms without the PR applied. In the 8.x-1.1 box without the PR, here's the extracted text from a PDF created from a LibreOffice word processing file encoded as Windows 1255 (Hebrew):

hebrew_text

The text I extracted from the LibreOffice Writer file is in Windows 1255/ISO 8859:

mark@mark-ThinkPad-X1-Carbon-6th:~/Downloads$ file עוֹלָם.txt 
עוֹלָם.txt: ISO-8859 text, with no line terminators

yet the text from the LibreOffice-derived PDF extracted by Islandora is in UTF-8:

mark@mark-ThinkPad-X1-Carbon-6th:~/Downloads$ file encodingtest.txt 
encodingtest.txt: UTF-8 Unicode text

As long as adding the -enc UTF-8 parameter won't have any side effects, I don't see any harm in adding it, but I have been unable to reproduce the problem that adding it would solve while running pdftotext version 0.62.0, which is the version on both the dev and 8.x-1.1 vms.

Maybe someone else should test?

@seth-shaw-unlv
Copy link
Contributor Author

Huh, I just re-tested this by rolling back my VM to before the fix and it still worked... 🤷‍♂️

I guess this PR isn't really necessary. No sense changing something doesn't need to be changed.

@seth-shaw-unlv seth-shaw-unlv deleted the issue-1582 branch August 25, 2020 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants