Work around unicode titles not working with resuming and fix truncation when resuming #436

Pokechu22 · 2022-09-13T19:09:30Z

Before, you would get UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal. The %s versus {} change was needed because otherwise you would get UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128). There is probably a better way of solving that, but this one does work.

nemobis · 2022-09-13T19:50:30Z

dumpgenerator.py

@@ -2188,7 +2188,7 @@ def resumePreviousDump(config={}, other={}):



Can't we decode already in line 2183?

As in this?

for l in f: l = l.decode('utf-8') if l == '</mediawiki>': # ...

I think so, though it probably would require other changes (u'</mediawiki>', and maybe the regex would need to be changed too - I'm not sure how that works in python 2). It might also be possible to change reverse_readline to decode each line. I'd need to test; the current version of this PR is something I quickly hacked together to resume, but now that I'm not in the middle of a huge download I can try more things.

Yes, something like that. Ok, please update this PR if you manage, otherwise let's merge it like this and maybe leave a TODO comment to remember to check.

I've changed reverse_readline, and also fixed its truncation behavior. I tested using python2 -u dumpgenerator.py --xml --api https://fr.wikiversity.org/w/api.php --force --namespace 11 (as there are only about 200 pages there, and the namespace is Discussion modèle which is easy enough to read but still contains unicode) and interrupting with ^C.

Before, you would get UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal. The %s versus {} change was needed because otherwise you would get UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128). There is probably a better way of solving that, but this one does work.

There already was code that looks like it was supposed to truncate files, but it calculated the index wrong and didn't properly check all lines. It worked out, though, because it didn't actually call the truncate function. Now, truncation occurs to the last `</page>` tag. If the XML file ends with a `</page>` tag, then nothing gets truncated. The page is added after that; if nothing was truncated, this will result in the same page being listed twice (which already happened with the missing truncation), but if truncation did happen then the file should no longer be invalid.

nemobis reviewed Sep 13, 2022

View reviewed changes

Pokechu22 added 2 commits September 16, 2022 22:15

Pokechu22 force-pushed the unicode-resume branch from 410080c to 9b2c6e4 Compare September 17, 2022 05:20

Pokechu22 changed the title ~~Work around unicode titles not working with resuming~~ Work around unicode titles not working with resuming and fix truncation when resuming Sep 17, 2022

nemobis merged commit 9808279 into WikiTeam:master Sep 17, 2022

yzqzss mentioned this pull request Jan 3, 2023

Try to keep up with upstream, and other improvements. (Part 1) mediawiki-client-tools/mediawiki-dump-generator#49

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work around unicode titles not working with resuming and fix truncation when resuming #436

Work around unicode titles not working with resuming and fix truncation when resuming #436

Pokechu22 commented Sep 13, 2022

nemobis Sep 13, 2022

Pokechu22 Sep 13, 2022

nemobis Sep 14, 2022

Pokechu22 Sep 17, 2022

nemobis Sep 17, 2022

		@@ -2188,7 +2188,7 @@ def resumePreviousDump(config={}, other={}):

Work around unicode titles not working with resuming and fix truncation when resuming #436

Work around unicode titles not working with resuming and fix truncation when resuming #436

Conversation

Pokechu22 commented Sep 13, 2022

nemobis Sep 13, 2022

Choose a reason for hiding this comment

Pokechu22 Sep 13, 2022

Choose a reason for hiding this comment

nemobis Sep 14, 2022

Choose a reason for hiding this comment

Pokechu22 Sep 17, 2022

Choose a reason for hiding this comment

nemobis Sep 17, 2022

Choose a reason for hiding this comment