-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default pdftotext to UTF-8: Issue 1582 #103
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #103 +/- ##
=========================================
Coverage 94.85% 94.85%
Complexity 168 168
=========================================
Files 9 9
Lines 700 700
=========================================
Hits 664 664
Misses 36 36
Continue to review full report at Codecov.
|
@@ -78,7 +78,7 @@ public function get(Request $request) | |||
$this->log->debug("Got Content-Type:", ['type' => $content_type]); | |||
|
|||
if ($content_type == 'application/pdf') { | |||
$cmd_string = $this->pdftotext_executable . " $args - -"; | |||
$cmd_string = $this->pdftotext_executable . " $args -enc UTF-8 - -"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to make sense. Reading through pdftotext
's docs any idea why they defaulted to Latin1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back to look and this appears to be a version issue. Older versions default to Latin1, probably due to not paying attention to Unicode, while the newer versions default to UTF-8.
Oddly, the man pages for a default site say UTF-8 is the default, but running the version flag on it shows an older version that requires the flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a peak through about available encodings and what may or may not be available on the OS. Know we've ran into things in PHP where things like LC_ALL and UTF-8 aren't available but I can't seem to find reference to that in the pdftotext
context so this gets a 👍 from me.
I can't replicate the problem. In other words, all PDFs that I test have the non-latin characters intact. I am on the default dev branch:
I have tried PDFs that contain Traditional and Simplified Chinese, Hebrew, and even the test PDF linked above. They all work. Here's some extracted Arabic text (screenshot of redering Extracted Text.txt for that PDF) from the test PDF above: Am I doing something wrong? |
I'll add that when I issue
|
Extracting text from PDFs with non-latin characters works same way with this patch applied (non-latin characters come through OK). Unless the fact that I couldn't get it to fail in the |
@mjordan when did you build the box? I did the PR based on @dannylamb's pre-built release box. As I mentioned to @jordandukart, this is likely dependent on the pdftotext version on the box where you have Crayfish installed. |
Dev Playbook. Let me try to get it to fail on a release box. |
I just built an 8.x-1.1 vagrant, and PDF text extraction is not working on the Arabic file above - Extracted Text for it never shows up. (This happens with and without the PR applied.) In fact, no other derivatives are created for that file, and when I view its parent node, even though I have the PDFjs display hint selected, it's not rendering in the node. Watchdog is empty. Derivatives are being created for other PDFs and the display hint is working as expected for them as well, so if I had to guess I'd say that PDF file is invalid or corrupted. OTOH, text extraction on the other PDFs containing non-latin characters I was testing with earlier is successful in both the dev and 8.x-1.1 vms without the PR applied. In the 8.x-1.1 box without the PR, here's the extracted text from a PDF created from a LibreOffice word processing file encoded as Windows 1255 (Hebrew): The text I extracted from the LibreOffice Writer file is in Windows 1255/ISO 8859:
yet the text from the LibreOffice-derived PDF extracted by Islandora is in UTF-8:
As long as adding the Maybe someone else should test? |
Huh, I just re-tested this by rolling back my VM to before the fix and it still worked... 🤷♂️ I guess this PR isn't really necessary. No sense changing something doesn't need to be changed. |
GitHub Issue: Islandora/documentation#1582
What does this Pull Request do?
Changes the default output of pdftotext form Latin-1 to UTF-8.
What's new?
-enc UTF-8
as a parameter to pdftotext.How should this be tested?
Interested parties
@Islandora/8-x-committers