-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major Update to PDFObject.php + Ancillary #634
Conversation
This is a major update to the PHPObject.php file. Where previously PdfParser would attempt to gather document stream data using a series of multiline regular expressions, this update changes the behaviour of `cleanContent()` to the following: - Hide all (strings) - Remove all newlines and carriage-returns - Hide all <<dictionaries>> - Normalize all whitespace - Using a list of valid Operators from the PDF Reference, add \r\n back to the remaining text so that there is a single PDF command on every line - Restore the <<dictionaries>> and (strings) By using this system, it is then much easier to examine and parse the document stream in a line-by-line manner, instead of PREG extraction. `getSections()` text has been updated to do just this, stepping through the output of `cleanContent()` line-by-line and returning an array of only the relevant commands needed to position and display text. The guts of `getText()` have been moved to `getTextArray()` to reduce code duplication. `getTextArray()` now takes into account both the current graphics position `cm` and the scaling factors of the text matrix `Tm` when adding \n and \t whitespace for positioning. Positioning is only taken into account at the point of inserting text, rather than whenever a `Tm` or `Td`/`TD` was found. It also treats `Q` and `q` as a stack of stored states rather than a single stored state. The presence of `ActualText` `BDC` commands is also taken into account and the contents of the `ActualText` will replace the formerly output text in both content and position. This requires the new `parseDictionary()` method to reliably extract such commands as well as any others PdfParser may take into account in the future. `decodeText()` in **Font.php** now takes into account the current text matrix when considering whether or not to add spaces between words. Instead of `implode()`-ing the result array with a space joiner, rely on the positioning check to add all required spacing. In `decodeContent()` in **Font.php** add a check to see if the string to decode has the UTF-16BE BOM and decode it directly as Unicode if so.
Add a unit test for correctly decoding an emdash in Cyrillic text. Use sample PDF from issue smalot#585 User @se-ti allowed use of this file in issue smalot#586 (comment) In `cleanContent()` once all strings and dictionaries are hidden, do a MIME-type check on the remaining content. If it doesn't register as text/plain, then return an empty string. This prevents non-document-stream data from being passed to `cleanContent()` such as JPEG data in file '12249.pdf' from smalot#458 Remove whitespace-adding code from **Font.php**. I originally added this code as a "shim" because `decodeText()` did not take into account the current Text Matrix when considering what counted as "words". Now that it does, the previous code of just `implode()`-ing with a space character works.
Modify several code comments to be clearer. Remove the `$key => ` from `decodeText()` in **Font.php** as it's no longer needed. Also, now that `cleanContent()` is ignoring non `text/plain` content, there should be no errant `q` or `Q` commands that cause the stored-state stack to try restoring a state that doesn't exist. Remove the kludgy code that prevented this.
Remove unnecessary `$whitespace` line.
Edit: This has been resolved by using a different method to detect whether a content stream is binary and can be safely ignored.
Edit: This has been resolved by checking for a fixed numerical value for memory use above the
|
@GreyWyvern Thank you very much for all your work. I will see how I manage to give more feedback soon, but I hope that our community has time to comment too. Surprisingly I can't mark this PR as draft. |
The correct matrix elements to use for scaling the x-axis are actually the first *column*, so 'a' and 'i', not 'a' and 'b'. My bad! It worked before because almost always the x-axis scaling is equal to the y-axis scaling.
The Fileinfo functions are not installed by default on Windows, so use a different method to determine whether the stream is valid or binary.
PHP CS Fixer native_function_invocation
Make the cases a little bit more alphabetical. Remove cases/commands that aren't relevant to getting and positioning text.
Can you elaborate a bit about which kind of PDFs are not there yet? We could "reproduce" some missing things by using unit tests instead. |
I just meant unknown PDFs "in-the-wild" in general. I run it on my own org's collection of PDFs (~400) for searching and it works fine. Almost all of them worked in v2.7.0 too, but those are PDFs all generated by one entity. There's just no telling, with a change this large, if there might be a PDF out there that works in v2.7.0 but not with this PR. We can't really tell without people actually using it, so I think putting out a release candidate is a very good idea. |
Add some documentation comment text to PDFObject.php and fix a comment typo in Font.php. Add a test accounting for text-matrix scaling in Font::decodeText(). Add a test verifying that a string prefixed by a UTF-16BE BOM is decoded directly by Font::decodeContent(). Move "ET in font name" test from testCleanContent() to testGetSectionsText() as that is the function the test uses. Add a test that verifies cleanContent() returns an empty string for binary content. Remove unnecessary variable reset from ET command in Page::getDataTm. Only needed under BT.
Account for the entire font-factor (font-size multiplied by the horizontal scaling factor of the text-matrix) when estimating the width of the current text block. Insert a fix when decoding octal strings in Font::decodeOctal(), check further ahead for escaped backslashes. Remove tests for images in DocumentGeneratorFocusTest.php. These also fail in the current v2.7.0 release and they should be looked at in a separate PR.
Octal strings can include series of backslashes of arbitrary length. If there is an odd number of backslashes, a following octal code is valid, but if there's an even number, the following octal code should not be translated. Previously PdfParser would only account for two backslashes directly preceding an octal code. A commit from in-progress PR smalot#634 extended this to three which probably covers 99.99% of all cases. This change ups that to 100% in that there could be a string with any number of backslashes in a row, and codes will be correctly translated. Also update decodeEntities() to use a preg_replace_callback() instead of the bulkier preg_split() + foreach loop. Make sure it matches all hexadecimal digits including a-f. Add new tests for both of these.
Please only mark the functions you added in this PR with About the images: Just remove test-code then, if you didn't change any image related code. I thought about the size of this PR and would like to know, if there is a reasonable way to split this PR up, so we can start to integrate it iteratively? There is no rush here, but on the other hand there is only feedback if people can try it out in the wild (using release candidates). It is up to you @GreyWyvern, its just a suggestion. |
Will do! These functions are not really something a regular user would use, so
Removed.
Well... the main change of function is in the two functions The BOM-related changes in Font.php, and all the changes in Page.php (except removing lines 403-404) are really the only changes that are essentially entirely unrelated to the meat of this PR. I will revert my change to I think at most I would be able to split this into two PRs, the first of which would be all the additions that don't actually change execution flow (like adding I'm not in a hurry. I'm happy to allow you all the time you need to review :) |
Revert the change to Font::decodeOctal() as it's been superceded by PR smalot#640. Add @internal notes to formatContent() and parseDictionary().
The `@internal` tag hides the content that comes _after_ it from the documentation, so adjust these comments as appropriate. See: https://manual.phpdoc.org/HTMLSmartyConverter/HandS/phpDocumentor/tutorial_tags.internal.pkg.html
This test will succeed once PR smalot#640 is merged. It doesn't have anything to do with the current PR, so disable it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the following I am proposing a few changes in the function headers. As far as I understand the doc, @internal
should reflect why a function is for internal use only. It was accompanied with general information here in some cases.
Switch to tagging method for `@internal`. Adjust comments.
PHP-CS-Fixer requires spaces between `@` statements I guess.
* Better octal and hex-entity decode Octal strings can include series of backslashes of arbitrary length. If there is an odd number of backslashes, a following octal code is valid, but if there's an even number, the following octal code should not be translated. Previously PdfParser would only account for two backslashes directly preceding an octal code. A commit from in-progress PR #634 extended this to three which probably covers 99.99% of all cases. This change ups that to 100% in that there could be a string with any number of backslashes in a row, and codes will be correctly translated. Also update decodeEntities() to use a preg_replace_callback() instead of the bulkier preg_split() + foreach loop. Make sure it matches all hexadecimal digits including a-f. Add new tests for both of these. * Use #2D to ensure we're capturing hex letters * Change order of special string replacement Move the special string replacement after the unescaping of parentheses so we don't unescape any parentheses we shouldn't. Add more tests to make sure this is working. * Apply suggestions from code review Co-authored-by: Konrad Abicht <hi@inspirito.de> --------- Co-authored-by: Konrad Abicht <hi@inspirito.de>
In some edge cases, the formatContent() method may return a document stream row containing an invalid command. Make sure we just ignore these commands instead of triggering warnings for missing $matches array elements.
Re-enable this assertion, now that we have merged smalot#640.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is massive rewrite and it's hard to follow up 😅
src/Smalot/PdfParser/PDFObject.php
Outdated
* @internal | ||
*/ | ||
public function formatContent(?string $content): string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you define it as private
instead? It'll avoid adding the @internal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's called explicitly in PDFObjectTest.php, but I can work around that with ReflectionMethod
. Initially I thought it would be useful to be able to use this method publicly to format any old document stream, but that's probably not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this can be applied to other @internal
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The less public methods we have the better, because @internal
is just a label and can't be enforced.
About the testing of private methods: I support the view that private functionality is not to test, because it is the objects obligation. One should only test public methods. In practice this might lead to complicated situations in which it is hard to cover a certain case sometimes.
Initially I thought it would be useful to be able to use this method publicly to format any old document stream, but that's probably not necessary.
I was thinking that a method like that should be extracted from PDFObject and moved to a utility class or something. It is useful and might be handy outside of PDFObject context. But we should finalize this one first and then see. Because it is private now, we could extract it from PDFObject and make it available later on, if we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this can be applied to other
@internal
method?
Almost certainly. I wouldn't want to do it in this PR though. :)
About the testing of private methods: I support the view that private functionality is not to test, because it is the objects obligation. One should only test public methods. In practice this might lead to complicated situations in which it is hard to cover a certain case sometimes.
Since we can test private methods by making them accessible via Reflection
(well supported for pretty much exactly this purpose since PHP 5) I don't see why we shouldn't, personally. The more targeted the tests, the easier they are to isolate and fix. More tests!!!
I was thinking that a method like that should be extracted from PDFObject and moved to a utility class or something. It is useful and might be handy outside of PDFObject context.
Definitely. For instance, I feel like my added function parseDictionary()
duplicates much of the protected
parsing functionality from Parser.php that breaks apart PDF 'dictionary' structures, which aren't only used in document streams, but trailer info, etc. In the future, there should probably be one global class or function that parses dictionaries (a fundamental PDF data structure) for all situations.
Make the formatContent() method private to PDFObject so that `@internal` isn't required. Adjust the unit tests with `ReflectionMethod` to account for this.
@GreyWyvern A quick follow up: Are you planning any updates on this one in the near future? I won't have that much time in the next weeks/months most likely. I remember your suggestion to "throw further PDFs" at the code. I suggest the following next steps:
This way your work gets out to more people and we can observe, if there are any remaining bugs in your code. I can help organize further steps regarding releases or issues. WDYT? |
I think this is a good plan. I have actually found a couple more PDFs that PdfParser has trouble with since I last visited, but related to embedded images rather than text-extraction. Putting out a release candidate will certainly help get the new code more testing. The most important thing to find are PDFs that v2.7.0 parses "correctly" while this new version does not. That's where the most tweaking information will come from. :) I will mark this PR as Ready for review. |
@k00ni This is good to me |
Big thanks to @GreyWyvern and all commentators. |
Okay, here it is! I fully expect this will take quite some time to review and merge and likely require many more commits before it's ready to go; please mark it as a draft if that fits better.
It fully passes all unit tests here (PHP 8.2.7), even a couple that were marked "linux only" from which I removed that criteria. Several existing unit test assertions have been altered simply because of the way the update now parses document stream data differently, and thus generates arrays of commands differently. As examples:
BT
commands (as well as several others) are now stored instead of discarded, and outside of(strings)
and<<dictionaries>>
whitespace is normalized.However, since it parses document stream data in a completely new way, what I'm most interested in is whether or not it causes new errors in PDFs that aren't in the test suite. So I hope quite a few people decide to test drive this.
PHPObject.php
This is a major update to the PHPObject.php file. Where previously PdfParser would attempt to gather document stream data using a series of multiline regular expressions focusing on
BT ... ET
blocks, this updatechanges the behaviour ofadds a new functioncleanContent()
formatContent()
that considers the entire document stream. It takes the following steps:(strings)
<<dictionaries>>
<<dictionaries>>
(strings)
as\n
and\r
and restore them as wellBy using this system, it is then much easier to examine and parse the document stream in a line-by-line manner, instead of multiline PCRE extraction.
getSectionsText()
has been updated to do just this, stepping through the output ofcleanContent()
formatContent()
line-by-line and returning an array of only the relevant commands needed to position and display text.The guts of
getText()
have been moved togetTextArray()
to reduce code duplication.getTextArray()
now takes into account both the current graphics positioncm
and the scaling factors of the text matrixTm
when adding \n, \t and space " " whitespace for positioning. Positioning is only taken into account at the point of inserting text, rather than whenever aTm
orTd
/TD
is found.getTextArray()
now also treatsQ
andq
as a stack of stored states rather than a single stored state. Both fontTf
and graphics positioningcm
are stored.The presence of
ActualText
BDC
commands is also taken into account and the contents of theActualText
will replace the formerly output text in both content and position. This requires the newparseDictionary()
method to reliably extract such commands as well as any others PdfParser may take into account in the future.Font.php
decodeText()
in Font.php now takes into account the currenttext matrixfont size and scale when considering whether or not to define strings of text as "words" that require spaces between them.In
decodeContent()
in Font.php add a check to see if the string to decode has the UTF-16BE BOM and decode it directly as Unicode if so.Page.php
In Page.php remove the addition of a "fake" BT command as the content stream now records them.
Add a check to see if there are remaining texts to use from
PDFObject::getTextArray()
before proceeding ingetDataTm()
which prevents "undefined array key" PHP errors.Also prevent
ET
commands from resetting the fontTf
as PR #629 did forBT
commands.Issues affected
Resolves #219.
Resolves #353.
Resolves #398. Current v2.7.0 fixes text direction, but this PR fixes all spacing issues.
Resolves #464. Removes duplicated text by examining
ActualText
commands.Resolves #474.
Resolves #508.
Resolves #528. Fixes spacing issues.
Resolves #537.
Resolves #564. Current v2.7.0 fixes text extraction, but this PR fixes spacing issues.
Resolves #568. Fixes spacing issues.
Resolves #575.
Resolves #576.
Resolves #585.
Resolves #608. Fixes headings by taking into account the graphics position
cm
.Resolves #628.
Resolves #637.
Footnotes
This prevents non-document-stream data from being passed to
cleanContent()
formatContent()
such as JPEG data in file '12249.pdf' from https://github.com/smalot/pdfparser/issues/458 ↩https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf - Appendix A; https://archive.org/download/pdf320002008/PDF32000_2008.pdf - Annex A ↩