-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The tokenizer incorrectly handles some difficult tag-related markup #40
Comments
Regarding (1), a line from MediaWiki's source:
|
Also, this.
|
So it seems italics/bold can't cross links but can cross templates. I need to figure exactly which nodes are restrictive. |
Hi! There seems to be a case you've missed. Bold (and italics I guess) are implicitly closed when wikitable cells end. E.g. http://wiki.teamliquid.net/starcraft2/index.php?title=2014_WCS_Season_1_Europe/Premier&oldid=687367
gives
|
Hmm... yeah, that's tough because the parser doesn't understand tables yet. I'll need to add that before this is fixable. |
Pulling in a workaround from #80: @earwig suggested passing To get this feature, I had to track the development version on github rather than the released version on PyPI. Here's the line from my
|
Most of this is going to require an overhaul of how parsing is done (I finally have an idea how I'm going to do it, but it'll be a lot of work)... so pushing this back as the main task for v1.0. |
Consider this wikitext:
MediaWiki 1.26 parses this as
which suggests that style markup cannot span across multiple lines. mwparserfromhell does this the hard/old? way:
|
Oh joy. |
The attached file is a reduced version of https://en.wikipedia.org/w/index.php?title=Almond&oldid=706024513. I'd like to reduce it more, but any structural change anywhere in the text makes the problem disappear, so I don't know if this is actually an instance of this bug. The initial table is parsed correctly, subject to point 2 above, i.e. the unclosed <small> and <center> tags are returned as plain text. But everything after the table is returned as plain text too, with the exception of headings and lists. For example: === Almond flour and skins === \n[[Almond flour]] is often used as a [[gluten-free]] alternative to wheat flour Replicating the initial line, like this: {| |- | Production<small>(million tonnes) |- | Production<small>(million tonnes) |- | {{flag|USA}} || style="text-align:center;"|<center> 1.8 |- Results in the rest of the table not being parsed either: < table > < tr > < td > Production<small>(million tonnes)\n </ td > </ tr > |-\n| Production<small>(million tonnes)\n|-\n| {{flag|USA}} || style="text-align:center;"|<center> 1.8\n|-\n| {{flag|Australia}} || style="text-align:center;"|<center> 0.16\n|-\n| {{flag|Spain}} || style="text-align:center;" |<center> 0.15\n|-\n| {{flag|Morocco}} || style="text-align:center;"|<center> 0.1\n|-\n| {{flag|Iran}} || style="text-align:center;"|<center> 0.09\n|-\n!'''World''' !! style="text-align:center;"|<center> '''2.92'''\n </ table > |
Here's a really weird example from https://fr.wikipedia.org/w/index.php?title=Opposition_p%C3%A9rih%C3%A9lique&oldid=112493222 :
With the template interrupted by the end of the image context, MediaWiki appears to actually invoke the template twice in order to achieve the author's (presumed) intention. |
Answer on #148 Tables placed in one sections of pages, but parser doesn't see templates in other sections. Could add function recognition "== ==" as secondary mark end of tables? |
Other weird ones with malformed italics in templates: mwparserfromhell.parse("{{foo|''bar}} {{foo|bar''}}").filter_templates()
# => ["{{foo|''bar}}", "{{foo|bar''}}"]
mwparserfromhell.parse("{{foo|''bar}} ''...'' {{foo|bar''}}").filter_templates()
# => ["{{foo|bar''}}"]
mwparserfromhell.parse("{{foo|''bar}} ''").filter_templates()
# => []
mwparserfromhell.parse("{{foo|''bar}} ''bar''").filter_templates()
# => []
mwparserfromhell.parse("{{foo|''bar}}").filter_templates()
# => ["{{foo|''bar}}"] |
''foo'''bar''baz'''
, or''foo{{bar|baz''}}
). Fixing this will probably be very difficult.;
in the block before any text and uses this as the maximum number of parsable:
s after. The current implementation only allows one:
regardless of how many;
s there are.[ ]
tags, but MediaWiki also accepts some other syntax (e.g.[http://example.com/''Example'']
is valid).1, 4, and 5 are high priority, whereas 2 is mid and 3 is low.
The text was updated successfully, but these errors were encountered: