Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MediaWiki table parsing #81

Merged
merged 45 commits into from
Oct 24, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
a8d2983
Started table parsing in PyTokenizer
davidswinegar Jul 14, 2014
b7e40d7
Table cells now recurse
davidswinegar Jul 14, 2014
a13bc94
Started table cell attribute support
davidswinegar Jul 15, 2014
0bba69d
Added tests/support for header cells
davidswinegar Jul 15, 2014
9f159ec
Add table start/row start style attribute support
davidswinegar Jul 15, 2014
d356a57
Added closing_wiki_markup support to Tag node
davidswinegar Jul 15, 2014
9e4bb0c
Clean up and style changes
davidswinegar Jul 15, 2014
ec08001
Tables and rows now use newline as padding
davidswinegar Jul 16, 2014
f1664a8
Updated row and table handling
davidswinegar Jul 16, 2014
842af20
fixed hacky table cell style exception, added tests
davidswinegar Jul 16, 2014
ddaa3ec
Reorder table tokenizer methods for forward declaration
davidswinegar Jul 16, 2014
457b224
Add padding to table cell tags
davidswinegar Jul 16, 2014
8b5d6f9
Changes to table close handling
davidswinegar Jul 16, 2014
151a73e
Fix issue with incorrect table attributes
davidswinegar Jul 16, 2014
e6ec5dc
Refactor methods to avoid returning tuples
davidswinegar Jul 17, 2014
406dd3a
All tokenizer end methods return a stack
davidswinegar Jul 17, 2014
2d945b3
Use uint64_t for context
davidswinegar Jul 17, 2014
0128b1f
Implement CTokenizer for tables
davidswinegar Jul 19, 2014
94a9e32
Add missing comma to test output.
earwig Jul 21, 2014
7bbeb68
Fix ordering of tag representation
davidswinegar Jul 22, 2014
64869fe
Remove style test
davidswinegar Jul 22, 2014
213c105
Table tags are no longer self-closing
davidswinegar Jul 22, 2014
1b3e3c3
Change wiki tags to use style separators
davidswinegar Jul 22, 2014
c631080
Fix C code to make declarations before statements
davidswinegar Jul 22, 2014
8dc70bc
Add test coverage
davidswinegar Jul 22, 2014
c802b1f
Change context to uint64_t
davidswinegar Jul 25, 2014
1a4c88e
Correctly handle no table endings
davidswinegar Jul 25, 2014
e446c51
Adjust table test labels for consistency.
earwig Oct 19, 2014
b7c46a6
Add tables to changelog.
earwig Oct 20, 2014
bd85805
Add integration tests for token roundtripping.
earwig Oct 20, 2014
67c2365
Merge branch 'develop' into feature/tables
earwig Oct 20, 2014
7489253
Break at 80 cols for most lines.
earwig Oct 20, 2014
92cf8f2
Add a couple more tests involving templates.
earwig Oct 22, 2014
c638746
Add a test for tokenizer line 1384.
earwig Oct 22, 2014
457355d
Remove try/except that is impossible to fail inside of.
earwig Oct 22, 2014
5d29bff
Remove an incorrect usage of Py_XDECREF().
earwig Oct 23, 2014
504b8ba
Add test code for a missing branch of Tag.wiki_markup.setter; cleanup.
earwig Oct 23, 2014
913ff59
Cleanup; add a missing test.
earwig Oct 23, 2014
e1ebb59
Ensure token list is copied before being fed to the builder.
earwig Oct 23, 2014
640005d
Tokenizer cleanup; make inline table syntax invalid as it should be.
earwig Oct 24, 2014
4d40459
Update table tests to reflect new grammar.
earwig Oct 24, 2014
fb26145
Port tokenizer updates to C.
earwig Oct 24, 2014
8480381
Credit for table parsing code. [skip ci]
earwig Oct 24, 2014
9fc4b90
Refactor a lot of table error recovery code.
earwig Oct 24, 2014
a15f617
Minor bugfix.
earwig Oct 24, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ v0.4 (unreleased):

- The parser is now distributed with Windows binaries, fixing an issue that
prevented Windows users from using the C tokenizer.
- Added support for parsing wikicode tables (patches by David Winegar).
- Added a script to test for memory leaks in scripts/memtest.py.
- Added a script to do releases in scripts/release.sh.
- skip_style_tags can now be passed to mwparserfromhell.parse() (previously,
Expand Down
1 change: 1 addition & 0 deletions docs/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Unreleased

- The parser is now distributed with Windows binaries, fixing an issue that
prevented Windows users from using the C tokenizer.
- Added support for parsing wikicode tables (patches by David Winegar).
- Added a script to test for memory leaks in :file:`scripts/memtest.py`.
- Added a script to do releases in :file:`scripts/release.sh`.
- *skip_style_tags* can now be passed to :func:`mwparserfromhell.parse()
Expand Down
2 changes: 1 addition & 1 deletion mwparserfromhell/definitions.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@

# [mediawiki/core.git]/includes/Sanitizer.php @ 87a0aef762
SINGLE_ONLY = ["br", "hr", "meta", "link", "img"]
SINGLE = SINGLE_ONLY + ["li", "dt", "dd"]
SINGLE = SINGLE_ONLY + ["li", "dt", "dd", "th", "td", "tr"]

MARKUP_TO_HTML = {
"#": "li",
Expand Down
61 changes: 54 additions & 7 deletions mwparserfromhell/nodes/tag.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ class Tag(Node):

def __init__(self, tag, contents=None, attrs=None, wiki_markup=None,
self_closing=False, invalid=False, implicit=False, padding="",
closing_tag=None):
closing_tag=None, wiki_style_separator=None,
closing_wiki_markup=None):
super(Tag, self).__init__()
self._tag = tag
if contents is None and not self_closing:
Expand All @@ -52,13 +53,28 @@ def __init__(self, tag, contents=None, attrs=None, wiki_markup=None,
self._closing_tag = closing_tag
else:
self._closing_tag = tag
self._wiki_style_separator = wiki_style_separator
if closing_wiki_markup is not None:
self._closing_wiki_markup = closing_wiki_markup
elif wiki_markup and not self_closing:
self._closing_wiki_markup = wiki_markup
else:
self._closing_wiki_markup = None

def __unicode__(self):
if self.wiki_markup:
if self.attributes:
attrs = "".join([str(attr) for attr in self.attributes])
else:
attrs = ""
padding = self.padding or ""
separator = self.wiki_style_separator or ""
close = self.closing_wiki_markup or ""
if self.self_closing:
return self.wiki_markup
return self.wiki_markup + attrs + padding + separator
else:
return self.wiki_markup + str(self.contents) + self.wiki_markup
return self.wiki_markup + attrs + padding + separator + \
str(self.contents) + close

result = ("</" if self.invalid else "<") + str(self.tag)
if self.attributes:
Expand All @@ -73,10 +89,10 @@ def __unicode__(self):
def __children__(self):
if not self.wiki_markup:
yield self.tag
for attr in self.attributes:
yield attr.name
if attr.value is not None:
yield attr.value
for attr in self.attributes:
yield attr.name
if attr.value is not None:
yield attr.value
if self.contents:
yield self.contents
if not self.self_closing and not self.wiki_markup and self.closing_tag:
Expand Down Expand Up @@ -174,6 +190,27 @@ def closing_tag(self):
"""
return self._closing_tag

@property
def wiki_style_separator(self):
"""The separator between the padding and content in a wiki markup tag.

Essentially the wiki equivalent of the TagCloseOpen.
"""
return self._wiki_style_separator

@property
def closing_wiki_markup(self):
"""The wikified version of the closing tag to show instead of HTML.

If set to a value, this will be displayed instead of the close tag
brackets. If tag is :attr:`self_closing` is ``True`` then this is not
displayed. If :attr:`wiki_markup` is set and this has not been set, this
is set to the value of :attr:`wiki_markup`. If this has been set and
:attr:`wiki_markup` is set to a ``False`` value, this is set to
``None``.
"""
return self._closing_wiki_markup

@tag.setter
def tag(self, value):
self._tag = self._closing_tag = parse_anything(value)
Expand All @@ -185,6 +222,8 @@ def contents(self, value):
@wiki_markup.setter
def wiki_markup(self, value):
self._wiki_markup = str(value) if value else None
if not value or not self.closing_wiki_markup:
self._closing_wiki_markup = self._wiki_markup

@self_closing.setter
def self_closing(self, value):
Expand Down Expand Up @@ -212,6 +251,14 @@ def padding(self, value):
def closing_tag(self, value):
self._closing_tag = parse_anything(value)

@wiki_style_separator.setter
def wiki_style_separator(self, value):
self._wiki_style_separator = str(value) if value else None

@closing_wiki_markup.setter
def closing_wiki_markup(self, value):
self._closing_wiki_markup = str(value) if value else None

def has(self, name):
"""Return whether any attribute in the tag has the given *name*.

Expand Down
7 changes: 6 additions & 1 deletion mwparserfromhell/parser/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,20 +249,24 @@ def _handle_tag(self, token):
close_tokens = (tokens.TagCloseSelfclose, tokens.TagCloseClose)
implicit, attrs, contents, closing_tag = False, [], None, None
wiki_markup, invalid = token.wiki_markup, token.invalid or False
wiki_style_separator, closing_wiki_markup = None, wiki_markup
self._push()
while self._tokens:
token = self._tokens.pop()
if isinstance(token, tokens.TagAttrStart):
attrs.append(self._handle_attribute(token))
elif isinstance(token, tokens.TagCloseOpen):
wiki_style_separator = token.wiki_markup
padding = token.padding or ""
tag = self._pop()
self._push()
elif isinstance(token, tokens.TagOpenClose):
closing_wiki_markup = token.wiki_markup
contents = self._pop()
self._push()
elif isinstance(token, close_tokens):
if isinstance(token, tokens.TagCloseSelfclose):
closing_wiki_markup = token.wiki_markup
tag = self._pop()
self_closing = True
padding = token.padding or ""
Expand All @@ -271,7 +275,8 @@ def _handle_tag(self, token):
self_closing = False
closing_tag = self._pop()
return Tag(tag, contents, attrs, wiki_markup, self_closing,
invalid, implicit, padding, closing_tag)
invalid, implicit, padding, closing_tag,
wiki_style_separator, closing_wiki_markup)
else:
self._write(self._handle_token(token))
raise ParserError("_handle_tag() missed a close token")
Expand Down
24 changes: 22 additions & 2 deletions mwparserfromhell/parser/contexts.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,15 @@
* :const:`FAIL_ON_RBRACE`
* :const:`FAIL_ON_EQUALS`

* :const:`TABLE`

* :const:`TABLE_OPEN`
* :const:`TABLE_CELL_OPEN`
* :const:`TABLE_CELL_STYLE`
* :const:`TABLE_TD_LINE`
* :const:`TABLE_TH_LINE`
* :const:`TABLE_CELL_LINE_CONTEXTS`

Global contexts:

* :const:`GL_HEADING`
Expand Down Expand Up @@ -155,15 +164,26 @@
SAFETY_CHECK = (HAS_TEXT + FAIL_ON_TEXT + FAIL_NEXT + FAIL_ON_LBRACE +
FAIL_ON_RBRACE + FAIL_ON_EQUALS)

TABLE_OPEN = 1 << 30
TABLE_CELL_OPEN = 1 << 31
TABLE_CELL_STYLE = 1 << 32
TABLE_ROW_OPEN = 1 << 33
TABLE_TD_LINE = 1 << 34
TABLE_TH_LINE = 1 << 35
TABLE_CELL_LINE_CONTEXTS = TABLE_TD_LINE + TABLE_TH_LINE + TABLE_CELL_STYLE
TABLE = (TABLE_OPEN + TABLE_CELL_OPEN + TABLE_CELL_STYLE + TABLE_ROW_OPEN +
TABLE_TD_LINE + TABLE_TH_LINE)

# Global contexts:

GL_HEADING = 1 << 0

# Aggregate contexts:

FAIL = TEMPLATE + ARGUMENT + WIKILINK + EXT_LINK_TITLE + HEADING + TAG + STYLE
FAIL = (TEMPLATE + ARGUMENT + WIKILINK + EXT_LINK_TITLE + HEADING + TAG +
STYLE + TABLE)
UNSAFE = (TEMPLATE_NAME + WIKILINK_TITLE + EXT_LINK_TITLE +
TEMPLATE_PARAM_KEY + ARGUMENT_NAME + TAG_CLOSE)
DOUBLE = TEMPLATE_PARAM_KEY + TAG_CLOSE
DOUBLE = TEMPLATE_PARAM_KEY + TAG_CLOSE + TABLE_ROW_OPEN
NO_WIKILINKS = TEMPLATE_NAME + ARGUMENT_NAME + WIKILINK_TITLE + EXT_LINK_URI
NO_EXT_LINKS = TEMPLATE_NAME + ARGUMENT_NAME + WIKILINK_TITLE + EXT_LINK
Loading