-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work around unicode titles not working with resuming and fix truncation when resuming #436
Conversation
dumpgenerator.py
Outdated
@@ -2188,7 +2188,7 @@ def resumePreviousDump(config={}, other={}): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we decode already in line 2183?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in this?
for l in f:
l = l.decode('utf-8')
if l == '</mediawiki>':
# ...
I think so, though it probably would require other changes (u'</mediawiki>'
, and maybe the regex would need to be changed too - I'm not sure how that works in python 2). It might also be possible to change reverse_readline
to decode each line. I'd need to test; the current version of this PR is something I quickly hacked together to resume, but now that I'm not in the middle of a huge download I can try more things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something like that. Ok, please update this PR if you manage, otherwise let's merge it like this and maybe leave a TODO comment to remember to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed reverse_readline
, and also fixed its truncation behavior. I tested using python2 -u dumpgenerator.py --xml --api https://fr.wikiversity.org/w/api.php --force --namespace 11
(as there are only about 200 pages there, and the namespace is Discussion modèle
which is easy enough to read but still contains unicode) and interrupting with ^C.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Before, you would get UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal. The %s versus {} change was needed because otherwise you would get UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128). There is probably a better way of solving that, but this one does work.
There already was code that looks like it was supposed to truncate files, but it calculated the index wrong and didn't properly check all lines. It worked out, though, because it didn't actually call the truncate function. Now, truncation occurs to the last `</page>` tag. If the XML file ends with a `</page>` tag, then nothing gets truncated. The page is added after that; if nothing was truncated, this will result in the same page being listed twice (which already happened with the missing truncation), but if truncation did happen then the file should no longer be invalid.
410080c
to
9b2c6e4
Compare
Before, you would get
UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
. The%s
versus{}
change was needed because otherwise you would getUnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
. There is probably a better way of solving that, but this one does work.