Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around unicode titles not working with resuming and fix truncation when resuming #436

Merged
merged 2 commits into from
Sep 17, 2022

Conversation

Pokechu22
Copy link
Contributor

Before, you would get UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal. The %s versus {} change was needed because otherwise you would get UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128). There is probably a better way of solving that, but this one does work.

dumpgenerator.py Outdated
@@ -2188,7 +2188,7 @@ def resumePreviousDump(config={}, other={}):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we decode already in line 2183?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in this?

            for l in f:
                l = l.decode('utf-8')
                if l == '</mediawiki>':
                # ...

I think so, though it probably would require other changes (u'</mediawiki>', and maybe the regex would need to be changed too - I'm not sure how that works in python 2). It might also be possible to change reverse_readline to decode each line. I'd need to test; the current version of this PR is something I quickly hacked together to resume, but now that I'm not in the middle of a huge download I can try more things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something like that. Ok, please update this PR if you manage, otherwise let's merge it like this and maybe leave a TODO comment to remember to check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed reverse_readline, and also fixed its truncation behavior. I tested using python2 -u dumpgenerator.py --xml --api https://fr.wikiversity.org/w/api.php --force --namespace 11 (as there are only about 200 pages there, and the namespace is Discussion modèle which is easy enough to read but still contains unicode) and interrupting with ^C.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Before, you would get UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal. The %s versus {} change was needed because otherwise you would get UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128). There is probably a better way of solving that, but this one does work.
There already was code that looks like it was supposed to truncate files, but it calculated the index wrong and didn't properly check all lines. It worked out, though, because it didn't actually call the truncate function.

Now, truncation occurs to the last `</page>` tag. If the XML file ends with a `</page>` tag, then nothing gets truncated. The page is added after that; if nothing was truncated, this will result in the same page being listed twice (which already happened with the missing truncation), but if truncation did happen then the file should no longer be invalid.
@Pokechu22 Pokechu22 changed the title Work around unicode titles not working with resuming Work around unicode titles not working with resuming and fix truncation when resuming Sep 17, 2022
@nemobis nemobis merged commit 9808279 into WikiTeam:master Sep 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants