Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separate parser into separate library #12

Open
ctindall opened this issue Apr 3, 2018 · 17 comments
Open

separate parser into separate library #12

ctindall opened this issue Apr 3, 2018 · 17 comments
Labels
enhancement New feature or request

Comments

@ctindall
Copy link
Owner

ctindall commented Apr 3, 2018

No description provided.

@ctindall ctindall added the enhancement New feature or request label Apr 3, 2018
@kaushalmodi
Copy link

May be you and @novoid can collaborate on an Org mode parser in Python?

Ref: https://github.com/novoid/lazyblorg

@novoid
Copy link

novoid commented Aug 9, 2018

Hi folks,

I just saw the mentioning - thank for that.

https://github.com/novoid/lazyblorg/blob/master/lib/orgparser.py is a very naïvely written dumb Org mode parser that parses only a sub-set of Org that I need for my blog: please read https://github.com/novoid/lazyblorg/wiki/FAQs#can-i-use-the-org-mode-parser-in-python-for-other-purposes-as-well

My parser "knows" the current line, the previous line, and the "status" which reflects the Org element if it is a multi-line Org mode syntax element. It writes parsed data into a large python list (list of lists of optional lists, ...) I came up with: https://github.com/novoid/lazyblorg/wiki/Data-Structures#org-mode-element-overview (representation column)

I know how to implement proper syntax parsers using syntactical and lexical analysis - I already did it multiple times some years ago but only once using Python and this was a rather simple language. However, for lazyblorg, the stupid parser from my proof-of-concept worked that well, that I never felt I had to write the "proper one". But it is definitely not a piece of art I'd recommend to others. ;-) Further more, for various Org mode elements (like lists) and as a fall-back for any "unknown" syntax element, my parser is using the Python library of Pandoc to convert the elements. In general, if you don't need in-depth knowledge of the content (which lazyblorg does), you might be fine using the pandoc library instead of writing your own parser alltogether.

https://github.com/novoid/lazyblorg/issues does not list an issue like this one but it's on my internal todo-list for some time: extracting the parser into a separate library.

I don't know org-blog and from a quick view I guess it is still in Alpha status.

Last time I checked the Python-based Org mode parsers mentioned on https://orgmode.org/worg/org-tools/ did not result in a satisfying result IMHO. I needed a deeper level of information on the parsed items that those parsers delivered. Some of them were much more beautifully crafted but implemented a very small sub-set of Org.

If my very basic programming skills are of any help for a combined org-parser library that fulfills at least the requirements that my org-parser.py offers, I am open to discussion.

@ctindall
Copy link
Owner Author

ctindall commented Aug 9, 2018

I'm in largely the same boat: the parser works well enough for what I'm doing, but doesn't handle really any significant corner cases (for example, lists where different levels use different "bullet" characters). My thought is that separating it out as a separate library would allow me to think about these things in the abstract a little bit better, and would possibly attract more participation.

As you've noted, there is no batteries-included org-mode parser library for Python, and that seems like a big gap to be filled. I'm probably not qualified to produce such a package on my own, but I plan to give it a shot anyway. I work full time and take night classes, so projects like this usually have to wait until a semester break (after this week) or some vacation time I can devote to it.

@kaushalmodi thanks, good idea.

@novoid
Copy link

novoid commented Aug 9, 2018

I am in a similar situation time-wise. Some weekends I may find time, most weekends probably not. Until August 24 I don't have time because of high business load.

@novoid
Copy link

novoid commented Aug 9, 2018

Out of my mind and a quick shot:

Proposed High-Level Steps

(waterfall)

  • Analysis and comparison of the two separate parsers (Org syntax sub-set coverage)
  • Definition of a common Python data representation of parsed data
  • OPTIONAL: adaptation of current parsers to generate the new data representation (for testing purposes of the data model)
  • Development of a new parsing concept
  • Implementation of the parsing concept as one commonly shared parsing library
  • Adaptation of org-blog and lazyblorg so that they are able to use the new parsing library

@kaushalmodi
Copy link

@novoid Also it would be good to follow a Test Driven Development for this. Start putting together a test flow and then test Org files covering all the possible Org syntax. With the parser yet non-existent, that test should fail. Then gradually add lexing for each Org syntax (bold, italics, etc.).

I started off pretty small with this test Org file for my project. But day by day, as I added more features and fixed more bugs, it grew to what it's today.

@novoid
Copy link

novoid commented Aug 9, 2018

Hi @kaushalmodi,
I too am fan of TDD. For various reasons, I started with unit tests for lazyblorg as well. Then it turned out that changes did result in too much overhead in the unit tests. Therefore I ended up with a set of unit tests for single functions (one per Python file, no 100% coverage unfortunately) and a set of end-to-end tests which I did with following concept:

  1. have a set of Org mode files in a minimal viable blog test configuration
  2. generate the blog result
  3. compare the resulting HTML files with pre-stored expected HTML results

Take a look at https://github.com/novoid/lazyblorg/tree/master/testdata/end_to_end_test

IMHO, this concept is quite good in terms of "I know that most parts work as expected" which I need when, e.g., I upgrade pandoc to a newer version.

Independent of that: https://github.com/novoid/lazyblorg/blob/master/testdata/end_to_end_test/orgfiles/currently_supported_orgmode_syntax.org covers the Org mode syntax my lazyblorg is able to parse.

This is my current approach for lazyblorg in general.

For our hypothetical common parsing library, the game is reset to the beginning ;-)

Btw, your test file would not be parse-able with my parser since I rely on empty lines between different Org mode syntax elements. The biggest design-issue with my current approach of the parser.

@kaushalmodi
Copy link

Btw, your test file would not be parse-able with my parser since I rely on empty lines between different Org mode syntax elements. The biggest design-issue with my current approach of the parser.

There you have your first test case:

* Heading A
* Heading B

:)

@novoid
Copy link

novoid commented Aug 9, 2018

Burn in hell 😈

@ctindall
Copy link
Owner Author

ctindall commented Aug 9, 2018

Ideally the test cases would rely on the common Python data representation, rather than the HTML. We already have a pretty universal way to turn Org into HTML, which is pandoc.

Unfortunately, there is not really such thing as a "malformed" org-mode file, in the sense that org-mode does not "reject" or throw errors on any sequence of characters it tries to read as an org-mode file. That's because org-mode itself doesn't really parse the file in any strict compiler-class sense.

Thinking about this, I'm wondering if it doesn't make sense to define some standardized subset of org-mode syntax that has a strict grammar, and then create a minor mode that will prevent people from straying outside that syntax. Without something like this, there is always the possibility that an org-file will seem fine in Emacs but choke out the parser.

@kaushalmodi
Copy link

There is a draft spec: https://orgmode.org/worg/dev/org-syntax.html

@ctindall
Copy link
Owner Author

ctindall commented Aug 9, 2018

Nice, I wasn't aware of this. A good step would be converting the English syntax to a YACC file or some language-neutral representation. Then we could use literally the same input file to generate the Python parser and the elisp parser (and whatever other parsers people want languages for). Unfortunately, this means basically a rewrite of the parser. Fortunately, there appear to be tools that can do that rewrite largely for us: http://www.dabeaz.com/ply/. In a cursory search, I don't see any similar programs to emit elisp, but there appear to be options for other lisps that we can crib from; (https://github.com/jech/cl-yacc). This seems like a fun project in itself.

So, I think some realistic options to make progress without biting off a huge project would be:

  1. Converting the Org grammar to Yacc syntax.
  2. Defining a Python data representation of Org-mode data as a Python package. This package wouldn't need to do any parsing itself, just provide the class, but it would could then be transformed bit-by-bit into the package we all actually want, which is a pip-installable python package that lets us introspect any aspect of an org-mode source file.

Both of these things would be useful of themselves, so they make good mini-projects.

@novoid
Copy link

novoid commented Aug 10, 2018

Wow. I totally agree that those ideas are the way to go when you want to go the extra mile.

Unfortunately, I myself can't spend that much effort/time in a new parser for lazyblorg. 😔

I there exists such a parser any time in history, I probably won't spend the effort of migrating lazyblorg there too since this parser must be much more general than mine, resulting in a high work load on doing the adaptations. Despite the fact that it would be a very good idea to have a common parser in Python, of course!

@ctindall
Copy link
Owner Author

ctindall commented Aug 12, 2018

Digging into this a little more today, I'm actually wrong that org-mode doesn't properly parse org-mode files. That was definitely at one point true, but my knowledge of org-mode internals is badly out of date. The parser lives in a file called org-element.el. That takes care of most of the Emacs-side work, since the minor mode can simply call into org-element.el for its parsing.

On the Python side, it probably still makes sense to work on converting this syntax to a formal grammar, since that will be useful to people and projects beyond this one, but it's tempting to simply go for a more or less mechanical port of the existing, debugged, elisp parser to Python. I'd probably introduce fewer bugs doing it that way than trying to derive a correct BNF grammar from a combination of English grammar and the org-element.el source.

@kaushalmodi
Copy link

Today I came across this PyOrgMode. I am not a Python coder. But looking at the test Org files, this seems to be a mature library.

@novoid
Copy link

novoid commented Aug 21, 2018

@kaushalmodi on https://orgmode.org/worg/org-tools/ you find a list of Org mode parsers.

@kaushalmodi
Copy link

kaushalmodi commented Aug 21, 2018

@novoid I see, thanks! There are a lot of other cool parsers listed there too.

I mentioned the PyOrgMode library specifically as it seems to have a good foundation to build towards what this GitHub issue is about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants