separate parser into separate library #12

ctindall · 2018-04-03T13:07:05Z

No description provided.

kaushalmodi · 2018-08-09T12:42:58Z

May be you and @novoid can collaborate on an Org mode parser in Python?

Ref: https://github.com/novoid/lazyblorg

novoid · 2018-08-09T13:13:49Z

Hi folks,

I just saw the mentioning - thank for that.

https://github.com/novoid/lazyblorg/blob/master/lib/orgparser.py is a very naïvely written dumb Org mode parser that parses only a sub-set of Org that I need for my blog: please read https://github.com/novoid/lazyblorg/wiki/FAQs#can-i-use-the-org-mode-parser-in-python-for-other-purposes-as-well

My parser "knows" the current line, the previous line, and the "status" which reflects the Org element if it is a multi-line Org mode syntax element. It writes parsed data into a large python list (list of lists of optional lists, ...) I came up with: https://github.com/novoid/lazyblorg/wiki/Data-Structures#org-mode-element-overview (representation column)

I know how to implement proper syntax parsers using syntactical and lexical analysis - I already did it multiple times some years ago but only once using Python and this was a rather simple language. However, for lazyblorg, the stupid parser from my proof-of-concept worked that well, that I never felt I had to write the "proper one". But it is definitely not a piece of art I'd recommend to others. ;-) Further more, for various Org mode elements (like lists) and as a fall-back for any "unknown" syntax element, my parser is using the Python library of Pandoc to convert the elements. In general, if you don't need in-depth knowledge of the content (which lazyblorg does), you might be fine using the pandoc library instead of writing your own parser alltogether.

https://github.com/novoid/lazyblorg/issues does not list an issue like this one but it's on my internal todo-list for some time: extracting the parser into a separate library.

I don't know org-blog and from a quick view I guess it is still in Alpha status.

Last time I checked the Python-based Org mode parsers mentioned on https://orgmode.org/worg/org-tools/ did not result in a satisfying result IMHO. I needed a deeper level of information on the parsed items that those parsers delivered. Some of them were much more beautifully crafted but implemented a very small sub-set of Org.

If my very basic programming skills are of any help for a combined org-parser library that fulfills at least the requirements that my org-parser.py offers, I am open to discussion.

ctindall · 2018-08-09T14:04:31Z

I'm in largely the same boat: the parser works well enough for what I'm doing, but doesn't handle really any significant corner cases (for example, lists where different levels use different "bullet" characters). My thought is that separating it out as a separate library would allow me to think about these things in the abstract a little bit better, and would possibly attract more participation.

As you've noted, there is no batteries-included org-mode parser library for Python, and that seems like a big gap to be filled. I'm probably not qualified to produce such a package on my own, but I plan to give it a shot anyway. I work full time and take night classes, so projects like this usually have to wait until a semester break (after this week) or some vacation time I can devote to it.

@kaushalmodi thanks, good idea.

novoid · 2018-08-09T14:11:57Z

I am in a similar situation time-wise. Some weekends I may find time, most weekends probably not. Until August 24 I don't have time because of high business load.

novoid · 2018-08-09T14:25:13Z

Out of my mind and a quick shot:

Proposed High-Level Steps

(waterfall)

Analysis and comparison of the two separate parsers (Org syntax sub-set coverage)
Definition of a common Python data representation of parsed data
OPTIONAL: adaptation of current parsers to generate the new data representation (for testing purposes of the data model)
Development of a new parsing concept
Implementation of the parsing concept as one commonly shared parsing library
Adaptation of org-blog and lazyblorg so that they are able to use the new parsing library

kaushalmodi · 2018-08-09T14:40:37Z

@novoid Also it would be good to follow a Test Driven Development for this. Start putting together a test flow and then test Org files covering all the possible Org syntax. With the parser yet non-existent, that test should fail. Then gradually add lexing for each Org syntax (bold, italics, etc.).

I started off pretty small with this test Org file for my project. But day by day, as I added more features and fixed more bugs, it grew to what it's today.

novoid · 2018-08-09T15:30:58Z

Hi @kaushalmodi,
I too am fan of TDD. For various reasons, I started with unit tests for lazyblorg as well. Then it turned out that changes did result in too much overhead in the unit tests. Therefore I ended up with a set of unit tests for single functions (one per Python file, no 100% coverage unfortunately) and a set of end-to-end tests which I did with following concept:

have a set of Org mode files in a minimal viable blog test configuration
generate the blog result
compare the resulting HTML files with pre-stored expected HTML results

Take a look at https://github.com/novoid/lazyblorg/tree/master/testdata/end_to_end_test

IMHO, this concept is quite good in terms of "I know that most parts work as expected" which I need when, e.g., I upgrade pandoc to a newer version.

Independent of that: https://github.com/novoid/lazyblorg/blob/master/testdata/end_to_end_test/orgfiles/currently_supported_orgmode_syntax.org covers the Org mode syntax my lazyblorg is able to parse.

This is my current approach for lazyblorg in general.

For our hypothetical common parsing library, the game is reset to the beginning ;-)

Btw, your test file would not be parse-able with my parser since I rely on empty lines between different Org mode syntax elements. The biggest design-issue with my current approach of the parser.

kaushalmodi · 2018-08-09T15:34:04Z

Btw, your test file would not be parse-able with my parser since I rely on empty lines between different Org mode syntax elements. The biggest design-issue with my current approach of the parser.

There you have your first test case:

* Heading A
* Heading B

:)

novoid · 2018-08-09T15:35:52Z

Burn in hell 😈

ctindall · 2018-08-09T15:47:22Z

Ideally the test cases would rely on the common Python data representation, rather than the HTML. We already have a pretty universal way to turn Org into HTML, which is pandoc.

Unfortunately, there is not really such thing as a "malformed" org-mode file, in the sense that org-mode does not "reject" or throw errors on any sequence of characters it tries to read as an org-mode file. That's because org-mode itself doesn't really parse the file in any strict compiler-class sense.

Thinking about this, I'm wondering if it doesn't make sense to define some standardized subset of org-mode syntax that has a strict grammar, and then create a minor mode that will prevent people from straying outside that syntax. Without something like this, there is always the possibility that an org-file will seem fine in Emacs but choke out the parser.

kaushalmodi · 2018-08-09T15:50:40Z

There is a draft spec: https://orgmode.org/worg/dev/org-syntax.html

ctindall · 2018-08-09T16:05:43Z

Nice, I wasn't aware of this. A good step would be converting the English syntax to a YACC file or some language-neutral representation. Then we could use literally the same input file to generate the Python parser and the elisp parser (and whatever other parsers people want languages for). Unfortunately, this means basically a rewrite of the parser. Fortunately, there appear to be tools that can do that rewrite largely for us: http://www.dabeaz.com/ply/. In a cursory search, I don't see any similar programs to emit elisp, but there appear to be options for other lisps that we can crib from; (https://github.com/jech/cl-yacc). This seems like a fun project in itself.

So, I think some realistic options to make progress without biting off a huge project would be:

Converting the Org grammar to Yacc syntax.
Defining a Python data representation of Org-mode data as a Python package. This package wouldn't need to do any parsing itself, just provide the class, but it would could then be transformed bit-by-bit into the package we all actually want, which is a pip-installable python package that lets us introspect any aspect of an org-mode source file.

Both of these things would be useful of themselves, so they make good mini-projects.

novoid · 2018-08-10T06:12:35Z

Wow. I totally agree that those ideas are the way to go when you want to go the extra mile.

Unfortunately, I myself can't spend that much effort/time in a new parser for lazyblorg. 😔

I there exists such a parser any time in history, I probably won't spend the effort of migrating lazyblorg there too since this parser must be much more general than mine, resulting in a high work load on doing the adaptations. Despite the fact that it would be a very good idea to have a common parser in Python, of course!

ctindall · 2018-08-12T15:50:49Z

Digging into this a little more today, I'm actually wrong that org-mode doesn't properly parse org-mode files. That was definitely at one point true, but my knowledge of org-mode internals is badly out of date. The parser lives in a file called org-element.el. That takes care of most of the Emacs-side work, since the minor mode can simply call into org-element.el for its parsing.

On the Python side, it probably still makes sense to work on converting this syntax to a formal grammar, since that will be useful to people and projects beyond this one, but it's tempting to simply go for a more or less mechanical port of the existing, debugged, elisp parser to Python. I'd probably introduce fewer bugs doing it that way than trying to derive a correct BNF grammar from a combination of English grammar and the org-element.el source.

kaushalmodi · 2018-08-20T20:27:24Z

Today I came across this PyOrgMode. I am not a Python coder. But looking at the test Org files, this seems to be a mature library.

novoid · 2018-08-21T07:48:04Z

@kaushalmodi on https://orgmode.org/worg/org-tools/ you find a list of Org mode parsers.

kaushalmodi · 2018-08-21T10:41:15Z

@novoid I see, thanks! There are a lot of other cool parsers listed there too.

I mentioned the PyOrgMode library specifically as it seems to have a good foundation to build towards what this GitHub issue is about.

ctindall added the enhancement New feature or request label Apr 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

separate parser into separate library #12

separate parser into separate library #12

ctindall commented Apr 3, 2018

kaushalmodi commented Aug 9, 2018

novoid commented Aug 9, 2018

ctindall commented Aug 9, 2018

novoid commented Aug 9, 2018

novoid commented Aug 9, 2018

kaushalmodi commented Aug 9, 2018

novoid commented Aug 9, 2018

kaushalmodi commented Aug 9, 2018

novoid commented Aug 9, 2018

ctindall commented Aug 9, 2018 •

edited

Loading

kaushalmodi commented Aug 9, 2018

ctindall commented Aug 9, 2018 •

edited

Loading

novoid commented Aug 10, 2018

ctindall commented Aug 12, 2018 •

edited

Loading

kaushalmodi commented Aug 20, 2018

novoid commented Aug 21, 2018

kaushalmodi commented Aug 21, 2018 •

edited

Loading

separate parser into separate library #12

separate parser into separate library #12

Comments

ctindall commented Apr 3, 2018

kaushalmodi commented Aug 9, 2018

novoid commented Aug 9, 2018

ctindall commented Aug 9, 2018

novoid commented Aug 9, 2018

novoid commented Aug 9, 2018

Proposed High-Level Steps

kaushalmodi commented Aug 9, 2018

novoid commented Aug 9, 2018

kaushalmodi commented Aug 9, 2018

novoid commented Aug 9, 2018

ctindall commented Aug 9, 2018 • edited Loading

kaushalmodi commented Aug 9, 2018

ctindall commented Aug 9, 2018 • edited Loading

novoid commented Aug 10, 2018

ctindall commented Aug 12, 2018 • edited Loading

kaushalmodi commented Aug 20, 2018

novoid commented Aug 21, 2018

kaushalmodi commented Aug 21, 2018 • edited Loading

ctindall commented Aug 9, 2018 •

edited

Loading

ctindall commented Aug 9, 2018 •

edited

Loading

ctindall commented Aug 12, 2018 •

edited

Loading

kaushalmodi commented Aug 21, 2018 •

edited

Loading