-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix to iconv() illegal character error (issue #549) #580
Conversation
@Stasky745 Thank you for this pull request. Could you add a test to proof its working? @LucianoHanna can you check it out if it works for you? (#549) |
caca.txt <?php
$text = file_get_contents("caca.txt");
echo iconv("CP1252", "UTF-8", $text); While doing it this way returns the text perfectly: <?php
$text = file_get_contents("caca.txt");
echo iconv("CP1252", "UTF-8//TRANSLIT//IGNORE", $text); Is this useful for you? |
Not really, sorry :) Were your examples meant to be used for a test? I get the point about demonstrating the failing code though. If you need help with writing a test, let me know. The failing Windows tests are not because of your change, I will propably fix them in a separate pull request, so ignore them for now. |
Yeah I don't really know how to write a test. Is there a manual or something? |
No problem. Give us some time and we will bring this forward. |
Sorry for the late response. Besides obvious changes I also merged in master branch. To verify your code change I have to first create a mini-demo which triggers the notice. Your linked file + example code didn't produce the notice on my machine (PHP 8.1.16). I then tried to create an artificial example, but for some reason it always uses UTF-8, so no luck either: pdfparser/tests/PHPUnit/Integration/FontTest.php Lines 460 to 463 in b1242d4
When running this code part (using PHPUnit) locally it should at least mention a notice. This doesn't mean I don't believe your code works. It is about the fact, that each code change must be accompanied by at least one test, which proofs (to some extent) the code works as expected. Otherwise we would risk a situation in which this library is not reliable anymore. Could you provide another example or a hint, how to create a string which triggers this notice? |
Hi @k00ni @Stasky745 ! I've just added my issue to #549 and this PR does fix it! DISCLAIMER: I also mentioned an alternate implementation as I was concerned that |
@eapacheco thank you for reporting this. I would love to merge it in right away, but I can't get my head around the fact, that my test approach doesn't work. Also, changing character-related functionality resulted in issues in the past, therefore I think its important to back such changes with test code. My spare time is to rare currently to look deeper into it, so if anyone has an idea how to test this fix, I would appreciate it! |
I tried also to fix the test, it was tricky. Is it possible, that all the characters used in your test are already included in the encodings CP1252 and UTF8 - so no warning should be thrown and it's not a problem, that there is no error? So here are two unit tests using the provided "caca.txt" file.
The first unit tests the error. The second the expected result (i'm not speaking Spanish? - i hope i got the expected result) Using PHPUnit 9.5 with configuration:
Tested on PHP 8.0.27 and 8.1.14 I have to say, i have expected a ISO-8859-1 or ISO-8859-15 encoding of the provided file, but using this encoding scheme using
-> Also mb_detect_encoding prefers 'iso-8859-1' over 'CP1252' (which is not helping) Since a while, i'm using the library forceutf8 for problems like this and i gave it a try using If i'm looking at their code, there seems more things to do, to convert a (faulty?) string to UTF8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fbett thank you for the feedback.
If i'm looking at their code, there seems more things to do, to convert a (faulty?) string to UTF8. Would it probably better to use a library like this for the utf8 conversion?
Encoding and character sets is a huge field and many libraries struggle when using encoding, multi byte strings etc. Using a library which does this for us might be a good approach, but my argument against is, that it might apply unexpected string transformations and produces broken output. We should rather believe the data coming from the PDF. If this data is broken in the first place, than there is a mess one has to make the best out of it. One way could be: you can read the raw data and transform it afterwards with a library like forceUTF8. Or use another PDF generator which doesn't generate faulty PDFs.
As far as I know there is no best way here, but I am open for suggestions.
How I fixed my test?
The trick was to find a character in CP1252 which has no valid counterpart in UTF-8. In Google I found: https://stackoverflow.com/a/26330256, which mentioned the following bytes:
0x81, 0x8D, 0x8F, 0x90, 0x9D
They can't be assigned in UTF-8, which lead to the iconv-warning when used. This is backed by https://www.ibm.com/docs/en/rational-synergy/7.2.1?topic=uc-text-encoding-illegal-character-detection-tool.
Also, I had to use PHPUnit 9 to see the notice, with PHPUnit 10 there wasn't any. It seems they changed the config in some way. I wll leave this open for a few days to see if there is any feedback. Will merge it soon.
Acknowledgement: This revelation was possible because of feedback and the help of @zozlak. Thank you!
Can we get a new release? Thanks! |
In some PDFs this line crashes with:
This can be seen in issue #549
This change fixes it. According to
iconv
documentation:"If you append the string //TRANSLIT to out_charset transliteration is activated. This means that when a character can't be represented in the target charset, it can be approximated through one or several similarly looking characters. If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded. Otherwise, str is cut from the first illegal character and an E_NOTICE is generated."