Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient iteration and replacement of nodes #195

Open
lahwaacz opened this issue Aug 24, 2018 · 1 comment
Open

Efficient iteration and replacement of nodes #195

lahwaacz opened this issue Aug 24, 2018 · 1 comment

Comments

@lahwaacz
Copy link
Contributor

lahwaacz commented Aug 24, 2018

Consider the following problem: we want to replace all templates in a wikicode with their first argument. (This is basically a very simplified template expansion problem (#100) which I'm working on.)

The obvious way to solve this is this code:

def expand(wikicode):
    for template in wikicode.ifilter_templates(recursive=wikicode.RECURSE_OTHERS):
        if template.has(1):
            replacement = template.get(1).value
            expand(replacement)
        else:
            replacement = ""
        wikicode.replace(template, replacement)

However, it is very inefficient if you run it on a moderately complex real-world wiki page like Wireless network configuration on the Arch wiki (output from line_profiler):

Wrote profile results to mwph_benchmark.py.lprof
Timer unit: 1e-06 s

Total time: 1.85382 s
File: /usr/lib/python3.7/site-packages/mwparserfromhell/wikicode.py
Function: replace at line 414

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   414                                               @profile
   415                                               def replace(self, obj, value, recursive=True):
   416                                                   """Replace *obj* with *value*.
   417                                           
   418                                                   *obj* can be either a string, a :class:`.Node`, or another
   419                                                   :class:`.Wikicode` object (as created by :meth:`get_sections`, for
   420                                                   example). If *obj* is a string, we will operate on all instances of
   421                                                   that string within the code, otherwise only on the specific instance
   422                                                   given. *value* can be anything parsable by :func:`.parse_anything`.
   423                                                   If *recursive* is ``True``, we will try to find *obj* within our child
   424                                                   nodes even if it is not a direct descendant of this :class:`.Wikicode`
   425                                                   object. If *obj* is not found, :exc:`ValueError` is raised.
   426                                                   """
   427       226        274.0      1.2      0.0          if isinstance(obj, (Node, Wikicode)):
   428       226    1840916.0   8145.6     99.3              context, index = self._do_strong_search(obj, recursive)
   429       452        765.0      1.7      0.0              for i in range(index.start, index.stop):
   430       226       2874.0     12.7      0.2                  context.nodes.pop(index.start)
   431       226       8987.0     39.8      0.5              context.insert(index.start, value)
   432                                                   else:
   433                                                       for exact, context, index in self._do_weak_search(obj, recursive):
   434                                                           if exact:
   435                                                               for i in range(index.start, index.stop):
   436                                                                   context.nodes.pop(index.start)
   437                                                               context.insert(index.start, value)
   438                                                           else:
   439                                                               self._slice_replace(context, index, str(obj), str(value))

Total time: 1.88649 s
File: mwph_benchmark.py
Function: expand at line 7

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     7                                           @profile
     8                                           def expand(wikicode):
     9       435      24104.0     55.4      1.3      for template in wikicode.ifilter_templates(recursive=wikicode.RECURSE_OTHERS):
    10       226       3036.0     13.4      0.2          if template.has(1):
    11       208       2670.0     12.8      0.1              replacement = template.get(1).value
    12       208        463.0      2.2      0.0              expand(replacement)
    13                                                   else:
    14        18         15.0      0.8      0.0              replacement = ""
    15       226    1856200.0   8213.3     98.4          wikicode.replace(template, replacement)

The problem is that the position of the template in the wikicode is known to the ifilter_templates iterator, but it cannot be reused by the replace method, so it has to iterate from the start...

There is a Wikicode._indexed_ifilter method, but it doesn't immediately help with this problem, because it returns only the top-level index and not a multiindex like (i, j, k) where the returned node is the k-th child of the j-th child of the i-th child of the top-level wikicode.

Can you think of an efficient solution? I don't mind doing the iteration manually, but of course it would be nice to provide a nice interface for it from mwparserfromhell itself at some point.

@lahwaacz
Copy link
Contributor Author

lahwaacz commented Aug 24, 2018

Got some progress with custom ifilter-like functions. First attempt:

def indexed_ifilter(wikicode, recursive=True, matches=None, flags=FLAGS,
                    forcetype=None):
    """Iterate over nodes and their corresponding indices and parents.

    The arguments are interpreted as for :meth:`ifilter`. For each tuple
    ``(parent, i, node)`` yielded by this method, ``parent`` is the direct
    parent wikicode of ``node`` and ``parent.index(node) == i``.
    """
    match = wikicode._build_matcher(matches, flags)
    if recursive:
        restrict = forcetype if recursive == wikicode.RECURSE_OTHERS else None
        def getter(node):
            for parent, ch in wikicode._get_children(node, restrict=restrict, contexts=True, parent=wikicode):
                i = parent.index(ch)
                yield (parent, i, ch)
        inodes = chain(*(getter(n) for n in wikicode.nodes))
    else:
        inodes = ((wikicode, i, node) for i, node in enumerate(wikicode.nodes))
    for parent, i, node in inodes:
        if (not forcetype or isinstance(node, forcetype)) and match(node):
            yield (parent, i, node)

def expand_2(wikicode):
    for parent, i, template in indexed_ifilter(wikicode, forcetype=mwparserfromhell.nodes.template.Template, recursive=wikicode.RECURSE_OTHERS):
        if template.has(1):
            replacement = template.get(1).value
            expand_2(replacement)
        else:
            replacement = ""
        assert parent.get(i) is template, (parent.nodes[i], i, template)
        parent.nodes.pop(i)
        parent.insert(i, replacement)

Which performs like this:

Total time: 0.757134 s
File: mwph_benchmark.py
Function: expand_2 at line 63

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    63                                           @profile
    64                                           def expand_2(wikicode):
    65       435     739851.0   1700.8     97.7      for parent, i, template in indexed_ifilter(wikicode, forcetype=mwparserfromhell.nodes.template.Template, recursive=wikicode.RECURSE_OTHERS):
    66       226       2906.0     12.9      0.4          if template.has(1):
    67       208       2646.0     12.7      0.3              replacement = template.get(1).value
    68       208        441.0      2.1      0.1              expand_2(replacement)
    69                                                   else:
    70        18         14.0      0.8      0.0              replacement = ""
    71       226       1084.0      4.8      0.1          assert parent.get(i) is template, (parent.nodes[i], i, template)
    72       226       2192.0      9.7      0.3          parent.nodes.pop(i)
    73       226       8000.0     35.4      1.1          parent.insert(i, replacement)

And second attempt:

def parented_ifilter(wikicode, recursive=True, matches=None, flags=FLAGS,
                     forcetype=None):
    """Iterate over nodes and their corresponding parents.

    The arguments are interpreted as for :meth:`ifilter`. For each tuple
    ``(parent, node)`` yielded by this method, ``parent`` is the direct
    parent wikicode of ``node``.
    """
    match = wikicode._build_matcher(matches, flags)
    if recursive:
        restrict = forcetype if recursive == wikicode.RECURSE_OTHERS else None
        def getter(node):
            for parent, ch in wikicode._get_children(node, restrict=restrict, contexts=True, parent=wikicode):
                yield (parent, ch)
        inodes = chain(*(getter(n) for n in wikicode.nodes))
    else:
        inodes = ((wikicode, node) for node in wikicode.nodes)
    for parent, node in inodes:
        if (not forcetype or isinstance(node, forcetype)) and match(node):
            yield (parent, node)

def expand_3(wikicode):
    for parent, template in parented_ifilter(wikicode, forcetype=mwparserfromhell.nodes.template.Template, recursive=wikicode.RECURSE_OTHERS):
        if template.has(1):
            replacement = template.get(1).value
            expand_3(replacement)
        else:
            replacement = ""
        parent.replace(template, replacement, recursive=False)

Which is more than 10 times faster than the original: 😉

Total time: 0.178676 s
File: mwph_benchmark.py
Function: expand_3 at line 75

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    75                                           @profile
    76                                           def expand_3(wikicode):
    77       435      23522.0     54.1     13.2      for parent, template in parented_ifilter(wikicode, forcetype=mwparserfromhell.nodes.template.Template, recursive=wikicode.RECURSE_OTHERS):
    78       226       2722.0     12.0      1.5          if template.has(1):
    79       208       2594.0     12.5      1.5              replacement = template.get(1).value
    80       208        447.0      2.1      0.3              expand_3(replacement)
    81                                                   else:
    82        18         14.0      0.8      0.0              replacement = ""
    83       226     149377.0    661.0     83.6          parent.replace(template, replacement, recursive=False)

I put the whole benchmark here for future reference.

@lahwaacz lahwaacz changed the title How to efficiently iterate over and replace nodes? Efficient iteration and replacement of nodes Aug 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant