Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing characters in extracted body #1360

Closed
lucc opened this issue Dec 24, 2018 · 7 comments
Closed

Missing characters in extracted body #1360

lucc opened this issue Dec 24, 2018 · 7 comments

Comments

@lucc
Copy link
Collaborator

lucc commented Dec 24, 2018

For some content transfer encodings and charsets some characters are missing in the extracted body. An example can be seen in the test added in #1359. As that test demonstrates the problem seems to be somewhere below alot.db.utils.extract_body(). We should try to find more relevant combinations of encoding and charset, add tests and ultimatly fix it :)

Software Versions

  • Python version: 3.7
  • Alot version: 0.8-4-g9ff655ba
@josch
Copy link
Contributor

josch commented Jan 5, 2019

I was about to provide a test case for the issue I saw but then I found tests.db.utils_test.TestExtractBody.test_simple_utf8_file and the email that it reads tests/static/mail/utf8.eml looks just like the kind that produces the problems in my case. So why is that test case marked as @unittest.expectedFailure?

@lucc
Copy link
Collaborator Author

lucc commented Jan 6, 2019

It is marked as an expected failure because the bug is known and we did not yet fix it. I created this test after the bug report was issued in order to help us fix it. See the #1359 I mentioned above.

If you want to provide more test cases that is very much appreciated. So if you have any other combinations of encoding and charset I am very interested.

@josch
Copy link
Contributor

josch commented Jan 6, 2019

Nope, that is exactly the combination that fails for me as well.

@jljusten
Copy link
Contributor

jljusten commented Feb 4, 2019

@lucc: Hey. I think you've been the only one to look into this issue, but I guess it's stalled a bit?

We're holding off on packaging 0.8 for debian due to this issue. We would probably cherry-pick a patch for this back to the 0.8 release if a fix is found.

mjg added a commit to mjg/alot that referenced this issue Feb 4, 2019
176cffc ("refactor alot.db.utils.remove_cte", 2018-12-04) created a few
problems with 8bit quoted-printable e-mails, see pazz#1291 pazz#1360.

This commit restores the old libmagic fallback which did not cause this
problem.
@mjg
Copy link
Contributor

mjg commented Feb 4, 2019

The problem bisects back to commit 176cffc ("refactor alot.db.utils.remove_cte", 2018-12-04) which claims to refactor and make the code more lenient.

In fact, it does one more thing: it changes the fallback for the case of decoding errors to bp.decode(enc, errors='ignore') rather than helper.try_decode(bp). Whether this is more lenient or not is a matter of definition; in any case it's the cause of the missing characters. I don't know if this change is intentional and maybe improves other things. The "fix" for me is a one-liner, obviously, but may restore behaviour that this commit wanted to change. See PR #1375

In general, I'd suggest to refactor in one commit and change behaviour in another one...

mjg added a commit to mjg/alot that referenced this issue Feb 4, 2019
176cffc ("refactor alot.db.utils.remove_cte", 2018-12-04) created a few
problems with 8bit quoted-printable e-mails, see pazz#1291 pazz#1360.

This commit restores the old libmagic fallback which did not cause this
problem.
pazz pushed a commit that referenced this issue Feb 7, 2019
176cffc ("refactor alot.db.utils.remove_cte", 2018-12-04) created a few
problems with 8bit quoted-printable e-mails, see #1291 #1360.

This commit restores the old libmagic fallback which did not cause this
problem.
@pazz
Copy link
Owner

pazz commented Feb 10, 2019

I guess this can be closed then?

@josch
Copy link
Contributor

josch commented Feb 10, 2019

Yes, it is fixed by PR #1375.

@pazz pazz closed this as completed Feb 11, 2019
ryneeverett pushed a commit to ryneeverett/alot that referenced this issue Mar 13, 2020
176cffc ("refactor alot.db.utils.remove_cte", 2018-12-04) created a few
problems with 8bit quoted-printable e-mails, see pazz#1291 pazz#1360.

This commit restores the old libmagic fallback which did not cause this
problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants