Add font fallback + Support for font IDs containing hyphens #614

GreyWyvern · 2023-07-12T20:08:46Z

If part of a text stream is positioned after an incorrect font command, then undecoded, jumbled bytes will appear in the getText() output. This change adds code to check this output for UTF-8 control characters (\x00-\x1f + \x7f) and if they appear, loop through all available fonts to see if we can find one that decodes this output properly. If none is found, the original string is used. Resolves #586.

Also add support for font IDs containing hyphens. Previously these were ignored as invalid. Resolves #145.

If a text stream is "decoded" and contains UTF-8 control characters, it probably wasn't decoded using the proper font code page. Add a loop that cycles through all the available fonts to see if there's a better decode choice. Resolves Issue 586. As well, add the ability to parse font IDs containing dashes (-). Resolves Issue 145

Simplify these tests in case future edits change spacing rules.

k00ni

Its good to see you are still with us!

Just a few remarks/questions.

src/Smalot/PdfParser/PDFObject.php

samples/FontIDHyphen.pdf

samples/ImproperFontFallback.pdf

Let PCRE handle the conversion rather than PHP. Hopefully fixes PHPStan complaints about null byte.

src/Smalot/PdfParser/PDFObject.php

ref: https://cs.symfony.com/doc/rules/function_notation/nullable_type_declaration_for_default_null_value.html

Remove the Font ID with hyphen test case PDF as we could not contact the submitter to get permission to use it. Change the unit test to directly test if a Font ID with a hyphen is correctly parsed.

Add one more test for font-fallback. This addition also resolves smalot#495. Catches situations where a null byte \x00 may not be found by preg_match in a unicode context. Null bytes in the text string usually means that a CIDMap encoded string has been passed through as UTF-8 bytes without being translated by any matching CIDMap pairs.

k00ni · 2023-07-23T15:41:08Z

Are you done here @GreyWyvern?

GreyWyvern · 2023-07-30T02:32:09Z

Are you done here @GreyWyvern?

Yes, sorry. I was on vacation this week. :)

k00ni · 2023-07-31T05:54:57Z

Are you done here @GreyWyvern?

Yes, sorry. I was on vacation this week. :)

All good, hope you had a good one.

GreyWyvern added 2 commits July 12, 2023 15:57

Update PDFObjectTest.php

5f1c888

Simplify these tests in case future edits change spacing rules.

k00ni requested changes Jul 13, 2023

View reviewed changes

src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved

src/Smalot/PdfParser/PDFObject.php Outdated Show resolved Hide resolved

samples/FontIDHyphen.pdf Outdated Show resolved Hide resolved

samples/ImproperFontFallback.pdf Outdated Show resolved Hide resolved

k00ni added fix de-/encoding issue labels Jul 13, 2023

This was referenced Jul 13, 2023

text encoding breaks in the middle of the line #586

Closed

Cannot extract text from PDF (internal font naming issue) #145

Closed

GreyWyvern added 2 commits July 13, 2023 13:44

Refactor duplicate code into a function

863bd89

Use single quoted regexp

3a8fbda

Let PCRE handle the conversion rather than PHP. Hopefully fixes PHPStan complaints about null byte.

k00ni reviewed Jul 14, 2023

View reviewed changes

src/Smalot/PdfParser/PDFObject.php Show resolved Hide resolved

GreyWyvern and others added 6 commits July 14, 2023 10:38

Add @param for $command and ?Page

b993db1

Proper indentation.

981bc39

fixing coding style issues in PDFObject.php

34f1b54

ref: https://cs.symfony.com/doc/rules/function_notation/nullable_type_declaration_for_default_null_value.html

reverted coding style adaptions

dea98d1

Remove test case PDF

d7a378f

Remove the Font ID with hyphen test case PDF as we could not contact the submitter to get permission to use it. Change the unit test to directly test if a Font ID with a hyphen is correctly parsed.

GreyWyvern mentioned this pull request Jul 21, 2023

Various output errors #492

Closed

k00ni linked an issue Jul 23, 2023 that may be closed by this pull request

Various output errors #492

Closed

k00ni approved these changes Jul 26, 2023

View reviewed changes

k00ni merged commit ce434c1 into smalot:master Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add font fallback + Support for font IDs containing hyphens #614

Add font fallback + Support for font IDs containing hyphens #614

GreyWyvern commented Jul 12, 2023

k00ni left a comment

k00ni commented Jul 23, 2023

GreyWyvern commented Jul 30, 2023

k00ni commented Jul 31, 2023

Add font fallback + Support for font IDs containing hyphens #614

Add font fallback + Support for font IDs containing hyphens #614

Conversation

GreyWyvern commented Jul 12, 2023

k00ni left a comment

Choose a reason for hiding this comment

k00ni commented Jul 23, 2023

GreyWyvern commented Jul 30, 2023

k00ni commented Jul 31, 2023