Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar Remake #342

Merged
merged 24 commits into from
Mar 30, 2020
Merged

Grammar Remake #342

merged 24 commits into from
Mar 30, 2020

Conversation

patmagee
Copy link
Member

Motivation

For a long time I have been bothered by the fact that the current Grammar and Parser language is more or less a dead project. While it has served us well, the knowledge of how to use it has largely since left the openWDL community. This generally means that modification of the Grammars is not an active part of the PR or development process, nor is there an easy way to test changes added the specification which is user accessible.

While writing grammars is not something that absolutely everyone knows how to do, I wanted to choose a langauge that was accessible for even a beginner which fit our needs, and had a large amount of supporting documenation. To that end I chose to write the grammars as EBNF's using ANTLR4.

By changing the grammar I hope to encourage more community members to edit it as part of the PR process (potentially opening the way of requiring changes as part of the PR), as well as help facilitate testing of new features as being discussed #338

Why ANTLR

There are a large number of languages for writing grammars out there with varying different syntaxes. I originally started by trying to implement the spec strictly as a BNF. This quickly ran into issues as 1) there are not alot of tools for parser generation fo strict BNF's 2) WDL is already to complext to properly implement without special rules for command and string interpolation.

Antlr4 provided a syntax which was familiar (EBNF), while still providing the ability to create parsers in a 8 different languages while allowing the grammar to be (mostly) agnostic of the target language.

The Changes

  1. I have ONLY implmenented the CURRENT development specificaiton, including all PR's whose voting has previoulsy passed. This means: Clarified Runtime and Added Hints #315 Add min, max to list of standard library methods #304 Struct literal Revisit - Revisit #297 Revert "Clarify type conversions and meanings" #250 Regular expression evaluation should be fixed. #243 Clarify the expression placeholder and string interpolation behaviour #229 Fix: Uncapitalize pair fields in inputs json #222
  2. My intention is to fully deprecate hermes in a future version of wdl (not 2.0), but for now I have moved the parsers and code under a directory named hermes
  3. I have implemented a parser and two different lexers in antlr. The duplication simply has to do with the way that antlr calls actions within the lexer. If you diff both lexers you will find the only difference is this and self references
  4. I have added rudimentary implemenations of the Python3, Python2, Java, and Javascript parsers, with some preliminary tests. The parsers do not implement any custom Visitors or Listeners and I do not try to modify any of the tokens/tree that is returned. There are interfaces and base implementations for eachlanguage of visitors and listeners that anyone can extend to fit their needs
  5. I have added documentation on building and setup for each language. The actual code generated by antlr is NOT included in this repo. Instead the base lexer needed for each language is added. Simply follow the build instructions to create the code

A big shout out to everyone whoe formally edited the hermes files, and for @mlin with their miniwdl implementation in Lark that helped with some decisions

@patmagee patmagee added clean-up Issues that are related to clean-up tasks. enhancement An enhancement to the WDL language. labels Nov 20, 2019
@patmagee patmagee added this to the v2.0 milestone Nov 20, 2019
@orodeh
Copy link
Contributor

orodeh commented Nov 21, 2019

This is great! It would be some much better than the hermes grammar that we have now.

Copy link
Contributor

@rhpvorderman rhpvorderman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! Great work! This ANTLR language looks very clean and learnable. This lowers the threshold for contributing to the grammar immensely.

@@ -0,0 +1,28 @@
# WDL Python2 Parser
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the python3 parser. Something is wrong in the generation script.

# Installing the runtime

```bash
pip install antlr4-python2-runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again python2 instead of python3

As a more general comment: Python2 is going to be deprecated within two months.
It might be better to remove the python2 parser all together.

## Parsers

The grammars themselves cannot be used to parse a WDL as is, they first must be converted into a parser in one of the supported target languages. Currently
Java, Python3, Python2, and JavaScript are supported. For Python and JavaScript you can use the `antlr4` command line tool to generate the required code that is used in the Parsers. For the java target, the easiest way to generate the code is to use the `mvn` plugin defined directly within the `pom.xml`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not merge a PR that combines "python2" and "supported" in one sentence. Python2 will be deprecated in a few days, but it still has a user base. It's much easier to not ship python2 support now than to get rid of it later.

@patmagee
Copy link
Member Author

Wow @rhpvorderman I completely forgot that the deprecation was happening. I will remove python2 support. I have no desire of maintaining deprecated code

@patmagee
Copy link
Member Author

@rhpvorderman I have removed the python 2 code

@geoffjentry
Copy link
Member

Not an official review but this is amazing. I suspect it'll take a long time (perhaps forever) before Cromwell is off of Hermes but this is a huge benefit for WDL in general

@cjllanwarne
Copy link
Contributor

I agree, very awesome indeed!

@patmagee a few minor comments -

  1. I don't think "Cromwell adoption" should necessarily block adoption of this grammar as the openWDL standard. In the same way that miniWDL rolls its own grammar outside of the openWDL repo, so could Cromwell until it catches up.
    • That said, having at least one working engine in this new grammar sooner rather than later would be a big win IMO.
    • The way WDL->WDLOM->WOM works in Cromwell, I don't think it would be an enormous project to use this as an alternative parser (assuming I can still traverse over Java-object parse trees that look reasonably similar to what Hermes generated) - for WDL 1.0 at least. it's not a completely trivial project for us though so we'd probably need to be able to justify it (say because the grammar has moved on from what the Hermes versions can do, for example)
  2. Do you imagine that submitter-proposed changes in this grammar become a new bar to allowing PRs to be one or more of submitted/reviewed/merged?
  3. Is anything in this PR generated code? It seemed like it from a quick scan, and in an ideal world I'd rather leave generated code out of the repo if possible (but, for example, I know that was not possible for hermes!)
  4. I have a whole bunch of test cases for the more (let's say) annoying parts of the grammar to represent in the hermes parser - would you be open to my submission of some test cases for the new grammar prior to final adoption?
  5. How deep do you allow "interpolation-within-string-within-interpolation-within-string-within....." cycles to go?
    • eg String s = "~{"~{"~{"~{"hello!"}"}"}"}"
  6. Do you support WDL 1.0 or development? Or both?
    • I'm assuming that draft-2 is not supported, but let me know if I'm wrong!

@patmagee
Copy link
Member Author

@cjllanwarne Thanks for the comments!

  1. I think its a good idea to keep both hermes and this grammar for at least a few version before making any decisions to deprecate. After testing we may indeed find out that this is actually easier to use! I know from the past that using the parsers with a visitor is actually pretty easy for complex languages
  2. It would be my hope that in teh future the RFC process would eventually require grammar changes before a PR is merged. But, I think keeping with test cases for now at least would be a good stop over
  3. Nope, there is 0 generated code checked in!! For java you use the maven wrapper to build it, for python, make, and for javascript npm!
  4. Bring on the tests!!!!
  5. Theoreticaly it could be infinite. But at the moment I only allow one layer deep using the same quotations. Upon entering a quoted section a flag is set that will prevent entering anotehr quote of the same type
  6. This is just for development right now, but it would not be much work to retrofit to 1.0. Not sure if we should though

@patmagee
Copy link
Member Author

patmagee commented Feb 7, 2020

In think this should be backported to v1.0 so we can adequately test this against existing workflows before we can think of merging. I will get working on making a v1.0

@patmagee
Copy link
Member Author

@orodeh I have added support for 1.0

@patmagee
Copy link
Member Author

@cjllanwarne I have made a number of fixes and Improvements to the grammar based on the feedback from @orodeh as he works through creating a scala library from it. Additionally I have been working on making java bindings from the grammar as well.

At the moment I think the grammar is in a good state and seems to be able to handle even a complex wdl (from testing locally). It would be great of you could review this again so we could potentially get it merged soon!

@orodeh
Copy link
Contributor

orodeh commented Mar 11, 2020

@patmagee I updated to the latest grammar, and it works nicely with my scala code. This is great!

I ran into an issue with // at the beginning of the line. In languages like C/C++/scala/Java it denotes the beginning of a comment line. In this grammar, it looks like the rest of the document gets dropped. Is this intentional?

For example, the following task is parsed correctly:

version 1.0

# These are various expressions
task district {
  Int i = 3
}

However, if you make a mistake and use // instead of #, like this:

version 1.0

// These are various expressions
task district {
  Int i = 3
}

The district task is just dropped from the resulting document.

@orodeh
Copy link
Contributor

orodeh commented Mar 11, 2020

In case anyone is interested, the code is at https://github.com/dnanexus-rnd/wdlTools/tree/develop

@patmagee
Copy link
Member Author

Thanks for reviewing this @orodeh. That issue with the comment is strange. I'll test it locally to see if I can replicate it! I would not expect it to happen

@orodeh
Copy link
Contributor

orodeh commented Mar 11, 2020

One more thing, name access in an object results in the wrong derivation (I think).

version 1.0

# An expression that access a field in an object
task district {
  Int k = x.a
}

x.a is parsed as an identifier, instead of the #at rule.

@patmagee
Copy link
Member Author

@orodeh thanks for the sleuthing!

I pushed two fixes for the issue.

  1. The identifier was actually defined as CompleteIdentifier (DOT CompleteIdentifier)* which was causing that string to be matched as a Token instead of being cauaght by the dot accessor rule
  2. I did not explicity define EOF for the document rule. Apparantly ANTLR's default behaviour is to just throw out all tokens that come after an illegal character within a rule, unless EOF is explicitly added

@orodeh
Copy link
Contributor

orodeh commented Mar 12, 2020

I pulled the updates, and both problems are fixed now.

Thanks!

@patmagee patmagee added the voting active Changes for which voting is active. label Mar 17, 2020
@orodeh
Copy link
Contributor

orodeh commented Mar 17, 2020

I have worked with the grammar, and works very well. If anyone is interested in the Scala abstract syntax it is at AbstractSyntax.scala.

👍

@geoffjentry
Copy link
Member

Couldn't be more supportive of this. Thanks so much 👍

@rhpvorderman
Copy link
Contributor

Thanks @patmagee for tackling this substantial amount of work, and thanks @orodeh for putting it into practice immediately, so it could be thoroughly evaluated.

👍

@aednichols
Copy link
Contributor

This is amazing 👍

@patmagee patmagee merged commit a661a6f into openwdl:master Mar 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clean-up Issues that are related to clean-up tasks. enhancement An enhancement to the WDL language. voting active Changes for which voting is active.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants