-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement MediaWiki table parsing #81
Conversation
Started parsing table support and added the start of table support. This is a big commit (ugh) and it should probably be split up into multiple smaller ones if possible, but that seems unworkable as of right now because of all the dependencies. Also breaks tests of CTokenizer (double ugh) because I haven't started table support there. May want to pick line by line on this commit later but I need to save my work for now.
Added another stack layer for tokenizing table cells because of styling/correctness of implementation. Added many tests cases.
Started support for parsing table style attributes. I suspect some of this is incorrect, need to add more tests to see.
Support for header cells was mostly in already, just needed minor changes. Added two tests as well.
Started styling attributes for table row and table start. Still not entirely sure about this, definitely need to make changes regarding padding.
Added support for allowing different wiki syntax for replacing the opening and closing tags. Added for table support.
Added comments, tried to keep to 80 character lines.
Tables and rows use newlines as padding, partly because these characters are pretty important to the integrity of the table. They might need to be in the preceding whitespace of inner tags instead as padding after, not sure.
Changed row recursion handling to make sure the tag is emitted even when hitting recursion limits. Need to test table recursion to make sure that works. Also fixed a bug in which tables were eating the trailing token. Added several tests for rows and trailing tokens with tables.
Removed the `StopIteration()` exception for handling table style and instead call `_handle_table_cell_end()` with a new parameter. Also added some random tests for table openings.
Make sure py tokenizer methods only call methods that have been declared earlier. Not necessary but makes it much easier to maintain/write the C tokenizer if methods are in the same order.
Padding now included on all wiki table cells. With wiki table cells that include attributes, `wiki_markup` is also included (unchanged).
Fix problem in which fake table closes were causing a problem inside cells. Changed inline table handling to fix this.
Fix problem in which invalid table attributes were being parsed incorrectly. Added tests.
Various changes to avoid returning tuples - working on the C tokenizer made me realize this was a bad idea for compatability/similarity between the two.
For C compatability, switch table cell end to return the stack. Now context is kept by using `keep_context` when calling `self._pop()`.
For the C tokenizer, include `<stdint.h>` and use `uint64_t` instead of `int` for context. Changes to tables mean that context can be larger than 32 bits, and it is possible for `int` to only have 16 bits anyways (though this is very unlikely).
CTokenizer is completely implemented in this commit - it didn't make much sense to me to split it up. All tests passing, memory test shows no leaks on Linux.
Okay! Good work, for starters. I'm actually really impressed that anyone managed to do this. Let's go through some various points:
Will let you know if anything comes up. |
In 94a9e32, I fixed a test ( Another thing: why are |
I've been testing on Python 2.7 so I'm not sure what the problem with 3.4 On Monday, July 21, 2014, Ben Kurtovic notifications@github.com wrote:
|
So I noticed that >>> import mwparserfromhell
>>> text = u'{| \n ! name="foo bar" | test ||color="red"| markup!!foo | time \n|}'
>>> code = mwparserfromhell.parse(text)
>>> code == text
False
>>> print code
{|
! name="foo bar"| test ||color="red"| markup!!foo | time
|}
>>> print text
{|
! name="foo bar" | test ||color="red"| markup!!foo | time
|} Perhaps this needs to be re-ordered? I'm not sure if that would break anything else. One final thing. I'm trying to understand the tree structure you're generating here, and it doesn't seem right to me. From the same example markup above, we get this: >>> import re
>>> print re.sub("\s+", " ", code.get_tree())
< table > < th name = foo bar /> test < th color = red /> markup < th foo /> time \n </ table > I asked above why |
Back now and working on this, here are my thoughts: tree structure
This is the closest transliteration of wiki syntax to HTML table syntax, which is why I was trying to emulate it. But it doesn't make sense with the parse tree to treat them as self closing tags because then the table rows and table cells are separate from their contents. I think removing the self-closing bits and just setting the Inline cell attributes Travis/Python 3.4 Today I'll plan on going over the issues in this order:
|
Self-closing wiki syntax tags have incorrectly ordered wiki syntax and padding, fixed the ordering.
Sorry, I've been dead lately due to a variety of real life things. Wanted to get on this before I leave for college (which is tomorrow...) but it doesn't look like that's going to happen. I'll take a look first thing after stuff gets settled down, which I expect to be within the next two weeks. |
Alright, I'm back alive again and trying to work on this. My memory is a bit fuzzy from two months ago, so I'm going to try to figure out what still needs to be done for this:
|
Okay! I'm reasonably happy with the code (for now), all tests are passing, and there are no obvious examples of stuff that's not covered (with the exception of Edit: >500,000 pages tested, about 10% with tables, all roundtripped correctly so far. Going to sleep now, will merge tomorrow assuming nothing changes. Edit 2: Noticed a bug on Natasha Kaplinsky. Will try to fix. Edit 3: There is definitely a bug in how the code is handling closing pipes when present on the same line as the open. For example:
MediaWiki treats this as a table with a single cell "foo" (the first Edit 4: The syntax that looks like Edit 5: Pushed updates; will re-run roundtripping tests. Edit 6: There's a problem with Rafael Nadal but after several hours on this one I can't figure it out. I refactored a lot of the table error recovery code, but that doesn't seem to have fixed it. I want to blame #42, but I'm not sure if that's the real cause here. |
Just saw your edits, that's very weird - I was testing against syntax on test pages and the "close on same line as open" behavior was making valid tables, I guess it's not supposed to be. The Nadal page is definitely hitting the max cycles on my machine, maybe #42 is to blame but I'm not sure, I can take a closer look as well this weekend. |
Okay, gonna go ahead and merge this. The remaining issue re: Nadal can be dealt with separately. |
Exactly 666 commits! |
(#10) These tables are just substitutes for HTML tags, so I used the existing Tag node with a few changes. This is a big pull request but I'm available to make any changes you would like to see. This also doesn't fix the issues from #55 regarding comments/templates as whitespace or handle implicitly closing tags from #40. Because the scope of those issues are greater than just tables I didn't implement them here.
One note: I also changed
context
in the CTokenizer to useuint64_t
to ensure an exact width, which requires an include of<stdint.h>
. I'm not sure how well this include works with VS2008 and VS2010, and I currently don't have access to a Windows computer, but it might be necessary intokenizer.h
to do something like: