-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
separate parser into separate library #12
Comments
May be you and @novoid can collaborate on an Org mode parser in Python? |
Hi folks, I just saw the mentioning - thank for that. https://github.com/novoid/lazyblorg/blob/master/lib/orgparser.py is a very naïvely written dumb Org mode parser that parses only a sub-set of Org that I need for my blog: please read https://github.com/novoid/lazyblorg/wiki/FAQs#can-i-use-the-org-mode-parser-in-python-for-other-purposes-as-well My parser "knows" the current line, the previous line, and the "status" which reflects the Org element if it is a multi-line Org mode syntax element. It writes parsed data into a large python list (list of lists of optional lists, ...) I came up with: https://github.com/novoid/lazyblorg/wiki/Data-Structures#org-mode-element-overview (representation column) I know how to implement proper syntax parsers using syntactical and lexical analysis - I already did it multiple times some years ago but only once using Python and this was a rather simple language. However, for lazyblorg, the stupid parser from my proof-of-concept worked that well, that I never felt I had to write the "proper one". But it is definitely not a piece of art I'd recommend to others. ;-) Further more, for various Org mode elements (like lists) and as a fall-back for any "unknown" syntax element, my parser is using the Python library of Pandoc to convert the elements. In general, if you don't need in-depth knowledge of the content (which lazyblorg does), you might be fine using the pandoc library instead of writing your own parser alltogether. https://github.com/novoid/lazyblorg/issues does not list an issue like this one but it's on my internal todo-list for some time: extracting the parser into a separate library. I don't know org-blog and from a quick view I guess it is still in Alpha status. Last time I checked the Python-based Org mode parsers mentioned on https://orgmode.org/worg/org-tools/ did not result in a satisfying result IMHO. I needed a deeper level of information on the parsed items that those parsers delivered. Some of them were much more beautifully crafted but implemented a very small sub-set of Org. If my very basic programming skills are of any help for a combined org-parser library that fulfills at least the requirements that my org-parser.py offers, I am open to discussion. |
I'm in largely the same boat: the parser works well enough for what I'm doing, but doesn't handle really any significant corner cases (for example, lists where different levels use different "bullet" characters). My thought is that separating it out as a separate library would allow me to think about these things in the abstract a little bit better, and would possibly attract more participation. As you've noted, there is no batteries-included org-mode parser library for Python, and that seems like a big gap to be filled. I'm probably not qualified to produce such a package on my own, but I plan to give it a shot anyway. I work full time and take night classes, so projects like this usually have to wait until a semester break (after this week) or some vacation time I can devote to it. @kaushalmodi thanks, good idea. |
I am in a similar situation time-wise. Some weekends I may find time, most weekends probably not. Until August 24 I don't have time because of high business load. |
Out of my mind and a quick shot: Proposed High-Level Steps(waterfall)
|
@novoid Also it would be good to follow a Test Driven Development for this. Start putting together a test flow and then test Org files covering all the possible Org syntax. With the parser yet non-existent, that test should fail. Then gradually add lexing for each Org syntax (bold, italics, etc.). I started off pretty small with this test Org file for my project. But day by day, as I added more features and fixed more bugs, it grew to what it's today. |
Hi @kaushalmodi,
Take a look at https://github.com/novoid/lazyblorg/tree/master/testdata/end_to_end_test IMHO, this concept is quite good in terms of "I know that most parts work as expected" which I need when, e.g., I upgrade pandoc to a newer version. Independent of that: https://github.com/novoid/lazyblorg/blob/master/testdata/end_to_end_test/orgfiles/currently_supported_orgmode_syntax.org covers the Org mode syntax my lazyblorg is able to parse. This is my current approach for lazyblorg in general. For our hypothetical common parsing library, the game is reset to the beginning ;-) Btw, your test file would not be parse-able with my parser since I rely on empty lines between different Org mode syntax elements. The biggest design-issue with my current approach of the parser. |
There you have your first test case:
:) |
Burn in hell 😈 |
Ideally the test cases would rely on the common Python data representation, rather than the HTML. We already have a pretty universal way to turn Org into HTML, which is pandoc. Unfortunately, there is not really such thing as a "malformed" org-mode file, in the sense that org-mode does not "reject" or throw errors on any sequence of characters it tries to read as an org-mode file. That's because org-mode itself doesn't really parse the file in any strict compiler-class sense. Thinking about this, I'm wondering if it doesn't make sense to define some standardized subset of org-mode syntax that has a strict grammar, and then create a minor mode that will prevent people from straying outside that syntax. Without something like this, there is always the possibility that an org-file will seem fine in Emacs but choke out the parser. |
There is a draft spec: https://orgmode.org/worg/dev/org-syntax.html |
Nice, I wasn't aware of this. A good step would be converting the English syntax to a YACC file or some language-neutral representation. Then we could use literally the same input file to generate the Python parser and the elisp parser (and whatever other parsers people want languages for). Unfortunately, this means basically a rewrite of the parser. Fortunately, there appear to be tools that can do that rewrite largely for us: http://www.dabeaz.com/ply/. In a cursory search, I don't see any similar programs to emit elisp, but there appear to be options for other lisps that we can crib from; (https://github.com/jech/cl-yacc). This seems like a fun project in itself. So, I think some realistic options to make progress without biting off a huge project would be:
Both of these things would be useful of themselves, so they make good mini-projects. |
Wow. I totally agree that those ideas are the way to go when you want to go the extra mile. Unfortunately, I myself can't spend that much effort/time in a new parser for lazyblorg. 😔 I there exists such a parser any time in history, I probably won't spend the effort of migrating lazyblorg there too since this parser must be much more general than mine, resulting in a high work load on doing the adaptations. Despite the fact that it would be a very good idea to have a common parser in Python, of course! |
Digging into this a little more today, I'm actually wrong that org-mode doesn't properly parse org-mode files. That was definitely at one point true, but my knowledge of org-mode internals is badly out of date. The parser lives in a file called org-element.el. That takes care of most of the Emacs-side work, since the minor mode can simply call into org-element.el for its parsing. On the Python side, it probably still makes sense to work on converting this syntax to a formal grammar, since that will be useful to people and projects beyond this one, but it's tempting to simply go for a more or less mechanical port of the existing, debugged, elisp parser to Python. I'd probably introduce fewer bugs doing it that way than trying to derive a correct BNF grammar from a combination of English grammar and the org-element.el source. |
Today I came across this PyOrgMode. I am not a Python coder. But looking at the test Org files, this seems to be a mature library. |
@kaushalmodi on https://orgmode.org/worg/org-tools/ you find a list of Org mode parsers. |
@novoid I see, thanks! There are a lot of other cool parsers listed there too. I mentioned the PyOrgMode library specifically as it seems to have a good foundation to build towards what this GitHub issue is about. |
No description provided.
The text was updated successfully, but these errors were encountered: