Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src/grammar for the canonical grammar of the Rust language #1331

Merged
merged 1 commit into from
Jan 15, 2016

Conversation

nagisa
Copy link
Member

@nagisa nagisa commented Oct 21, 2015

@nagisa nagisa changed the title src/grammar for the canonical gramar of the Rust language src/grammar for the canonical grammar of the Rust language Oct 21, 2015
@steveklabnik
Copy link
Member

👍 😍 💯

@nrc nrc added the T-lang Relevant to the language team, which will review and decide on the RFC. label Oct 22, 2015
@nikomatsakis nikomatsakis self-assigned this Oct 22, 2015
@bstrie
Copy link
Contributor

bstrie commented Oct 27, 2015

The semi-canonical step may be unnecessary. I don't expect there to be very many discrepancies between what we think the grammar is and what the grammar actually is, so we could just switch to the fully canonical stage directly. Even if a discrepancy is discovered after that point, all it would take would be an RFC to determine whether it should result in a change to the grammar or a change to the parser (obviously taking into account how much breakage would result from a change to the parser).

@matklad
Copy link
Member

matklad commented Nov 3, 2015

I'm hugely interested in this, many thanks!

One question that bothers me is how the consistency between libsyntax and src/grammar is to be enforced? I am afraid that if automatic comparison of parsers is not a part of the test suite, then the grammar will never become canonical (or the rust parser will never become conforming, if you prefer this half of the glass).

And if we speak of test suite, it would be ideal if it is implementation independent and included in something like ctrs.

@steveklabnik
Copy link
Member

We have vestigial grammar tests still in the makefiles, I think.

@Aatch
Copy link
Contributor

Aatch commented Nov 6, 2015

I share @matklad's concerns. It'd be awesome to have a canonical grammar, but maintaining consistency seems like a major challenge.

I'm not sure how we could actually go about testing it in a rigorous manner. Since the compiler's parser doesn't have to parse in the same way a generated parser (from the grammar) would, figuring out whether or not they are consistent with each other seems difficult.

That said, less-rigorous testing could be done with a combination of fuzzing and regression tests. The grammar could be used to generate syntactically-correct (if nonsensical) code which is run through the compiler's parser. If the sample fails to parse, we can look at it and decide if it's a "real" failure or not. If it is, we can fix the bug and use the sample (or a reduced version) as the regression test. The downside here is that it doesn't enforce the same parse. The compiler could parse a code fragment without error, but not actually be correct. I can't see this being a problem though.

With all that said, some discussion around testing the grammar and parser would be nice to see in the RFC proper.

@steveklabnik
Copy link
Member

For reference https://github.com/rust-lang/rust/blob/master/src/grammar/check.sh and https://github.com/rust-lang/rust/blob/master/src/grammar/verify.rs and other stuff in src/grammar is how we used to do the testing.

@matklad
Copy link
Member

matklad commented Nov 6, 2015

@steveklabnik I thought those are only lexer tests

@steveklabnik
Copy link
Member

They at least used to be referred to as "grammar tests", though maybe I cleared some of them out and they're not there anymore. We were like, using antlr and such...


This process is sure to become ambiguous over time as syntax is increasingly adjusted (it is harder
to “blame” syntax changes compared to syntax additions), therefore the resolution process of
discrepancies will also depend more on a decision from the Rust team.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this should be the final decision of the language subteam, I think. Seems like the most logical decision making body for this sort of question. (And it's already what we do in practice.)

@nikomatsakis
Copy link
Contributor

@nagisa I am very positive on this idea. For some time now I have been wanting to have a grammar changes written up more rigorously. I think your plan is pretty coherent -- basically there's going to be a messy period where we have to make ad-hoc judgement calls on what is the 'canonical' grammar for a bit of time, but eventually we'll get to a clean state.

I do think it's important that we add automatic checking of the Rust grammar against the in-tree sources (at least). Ideally we'd maybe even do crater runs or something, but you gotta run before you can walk.

One question that I am wondering is how the in-tree grammar compares to e.g. the rust-grammar project in terms of completeness? (Also, is one project based on the other?) I know that @rprichard has been filing various issues about discrepancies between what rustc does and what rust-grammar does, so they must have some automated means of testing it.

@nikomatsakis
Copy link
Contributor

cc @rprichard @bleibig @dgrunwald as people who seem likely to have an opinion on this RFC and related matters.

@nagisa
Copy link
Member Author

nagisa commented Nov 6, 2015

One question that I am wondering is how the in-tree grammar compares to e.g. the rust-grammar project in terms of completeness?

As far as I can see our in-tree LARL grammar is pretty much a copy of this project (sans the changes that have been made to the grammar). I suspect, without much digging, that it was imported from the rust-grammar project itself.

@nagisa
Copy link
Member Author

nagisa commented Nov 6, 2015

I’m aware it is usually possible to generate valid inputs from a given grammar (or its state machine), but there’s no existing attempts to do this as far as I know. While that sounds like a nice way to test the parser conformance, I’m not even sure how complex the implementation of such tester would be.

Answers to similar question are also pretty hand-wavy.

@nikomatsakis
Copy link
Contributor

@nagisa

I’m aware it is usually possible to generate valid inputs from a given grammar (or its state machine), but there’s no existing attempts to do this as far as I know. While that sounds like a nice way to test the parser conformance, I’m not even sure how complex the implementation of such tester would be.

ISTM that the rust compiler, libs, and crates.io provide plenty of sample inputs :)

@dgrunwald
Copy link
Contributor

ISTM that the rust compiler, libs, and crates.io provide plenty of sample inputs :)

Different grammar productions can interact in surprising ways.

For example, the operator as allows invoking a lambda expression without parenthesizing the lambda expression:

fn main() {
    || println!("Hello World") as ()()
}

It's not obvious that this is even possible (and it probably shouldn't be), so I'd be surprised if you can find such code in the wild.

@matklad
Copy link
Member

matklad commented Nov 11, 2015

@nikomatsakis sample inputs will allow to verify that the formal grammar accepts what rustc accepts. But even the trivial grammar can do this, namely, the grammar of all token sequences.

The hard problem is to make sure that the formal grammar does not accept invalid Rust. As I understand @Aatch suggestion was about this direction of verification.

And the second hard problem is to make sure that both the grammar and the rustc produce similar parse trees: you can easily mix up operator precedence, for a start.

@nikomatsakis
Copy link
Contributor

@dgrunwald @matklad yes, ok, right, forgot about @Aatch's question. Nonetheless, I think that comparing the two against all the Rust code we can get our hands on (and yes, comparing the parse trees, which seems eminently doable) provides a fairly high degree of confidence that we are parsing what we want to, despite the theoretical disparities. Certainly generating inputs would let us go further (though we'd presumably be filtering these inputs against the rustc impl, which still means that there may be cases where both do the wrong thing in the same way).

@graydon
Copy link

graydon commented Nov 12, 2015

I'd be interested in taking this a step further and experimenting with a grammar-generated frontend. I honestly assumed that was part of Niko's plan with LALRPOP.

@nikomatsakis
Copy link
Contributor

@graydon

I'd be interested in taking this a step further and experimenting with a grammar-generated frontend. I honestly assumed that was part of Niko's plan with LALRPOP.

It is, of course. =)

@nagisa
Copy link
Member Author

nagisa commented Nov 12, 2015

I'd be interested in taking this a step further and experimenting with a grammar-generated frontend. I honestly assumed that was part of Niko's plan with LALRPOP.

To me it sounds like generating parser is a natural step to take after we have a proper (canonical and verified to be correct) grammar. We need some way to ensure a grammar matches current implementation (and/or becomes the canonical representation of the Rust’s grammar) before we migrate rustc to a parser generator of any sort, don’t we?

However, that sounds like something that would need to be discussed once we have a generator itself ready, right?

@dgrunwald
Copy link
Contributor

Why would we migrate rustc to a generated parser?

Good messages for syntax errors are so much easier in a hand-written recursive descend parser.

Also, LR parsers are especially problematic for use in IDEs: if you need a (partial) syntax tree for code completion, you'll have to puzzle it together from the partially matched productions remaining on the parser's stack. Or (equivalently) introduce a whole bunch of error productions, and you'll always forget some and have cases where nodes aren't integrated into the syntax tree and code completion suffers. This stuff will be much much easier with the existing hand-written parser (since it's LL-style, there's always an obvious choice for connecting the current node to the root of the tree when you encounter an error).

@nikomatsakis
Copy link
Contributor

@dgrunwald Note that we did not necessarily say an LR parser. I was hoping
for LL(k) or maybe at worst GLL. But I agree that the benefits of an
automatic parser do not necessarily outweigh the downsides.

On Thu, Nov 12, 2015 at 12:30 PM, Daniel Grunwald notifications@github.com
wrote:

Why would we migrate rustc to a generated parser?

Good messages for syntax errors are so much easier in a hand-written
recursive descend parser.

Also, LR parsers are especially problematic for use in IDEs: if you need a
(partial) syntax tree for code completion, you'll have to puzzle it
together from the partially matched productions remaining on the parser's
stack. Or (equivalently) introduce a whole bunch of error productions, and
you'll always forget some and have cases where nodes aren't integrated into
the syntax tree and code completion suffers. This stuff will be much much
easier with the existing hand-written parser (since it's LL-style, there's
always an obvious choice for connecting the current node to the root of the
tree when you encounter an error).


Reply to this email directly or view it on GitHub
#1331 (comment).

@dgrunwald
Copy link
Contributor

comparing the parse trees, which seems eminently doable

Careful there; you seem to be assuming that it will be possible to specify an unambiguous formal grammar that matches rustc's behavior (modulo rustc bugs that we can fix without breaking the world).

I'm not sure if this is possible. Some things are easy to specify in English (and in a hand-written parser), but hard in a formal grammar.

For example, take this small grammar, which looks like a tiny subset of rust's expression grammar (allowing integer literals, the "as" operator, and less than operator):

type ::= identifier | identifier "<" type ">"
expr ::= int_literal | expr "as" type | expr "<" expr

But this grammar already accepts some expressions that rustc does not: e.g. 1 as i32 < 2.

It's easy to say: "A '<' token after a type name is always considered to be part of the type." But doing this in the formal grammar is not so easy:

type ::= identifier | identifier "<" type ">"
type_LT_follow ::= identifier "<" type ">"
expr ::= int_literal
    | expr "as" type
    | expr_LT_follow "<" expr
expr_LT_follow ::= int_literal
    | expr "as" type_LT_follow
    | expr_LT_follow "<" expr_LT_follow

Note that we just duplicated the number of productions! A few more such disambiguation rules, and our grammar increases exponentially in size. It becomes unreadable and unmaintainable.

Keeping the disambiguation rule in plain English is much better. This is also the option taken in the C# language specification: the formal grammar in the appendix is just a rough approximation, and sections like "7.6.4.2 Grammar ambiguities" explain how the parser should handle specific tricky cases.

But if the formal grammar is just an approximation, that makes automated consistency checks rather tricky...

@nikomatsakis
Copy link
Contributor

@dgrunwald

Note that we just duplicated the number of productions! A few more such
disambiguation rules, and our grammar increases exponentially in size. It
becomes unreadable and unmaintainable.

We'll have to see how it goes. I'll note though that LALRPOP's macros
include an explicit feature for just this case, but I've not tried using it
"at scale" yet. This is the kind of thing I was thinking of when designing
it, though. (In particular, they let you specify flags that can filter
which productions are used, etc.) I seem to recall that JavaScript uses a
similar mechanism for cases like this in their spec, but I don't recall for
certain.

On Thu, Nov 12, 2015 at 7:37 PM, Daniel Grunwald notifications@github.com
wrote:

comparing the parse trees, which seems eminently doable

Careful there; you seem to be assuming that it will be possible to specify
an unambiguous formal grammar that matches rustc's behavior (modulo rustc
bugs that we can fix without breaking the world).

I'm not sure if this is possible. Some things are easy to specify in
English (and in a hand-written parser), but hard in a formal grammar.

For example, take this small grammar, which looks like a tiny subset of
rust's expression grammar (allowing integer literals, the "as" operator,
and less than operator):

type ::= identifier | identifier "<" type ">"
expr ::= int_literal | expr "as" type | expr "<" expr

But this grammar already accepts some expressions that rustc does not:
e.g. 1 as i32 < 2.

It's easy to say: "A '<' token after a type name is always considered to
be part of the type." But doing this in the formal grammar is not so easy:

type ::= identifier | identifier "<" type ">"
type_LT_follow ::= identifier "<" type ">"
expr ::= int_literal
| expr "as" type
| expr_LT_follow "<" expr
expr_LT_follow ::= int_literal
| expr "as" type_LT_follow
| expr_LT_follow "<" expr_LT_follow

Note that we just duplicated the number of productions! A few more such
disambiguation rules, and our grammar increases exponentially in size. It
becomes unreadable and unmaintainable.

Keeping the disambiguation rule in plain English is much better. This is
also the option taken in the C# language specification: the formal grammar
in the appendix is just a rough approximation, and sections like "7.6.4.2
Grammar ambiguities" explain how the parser should handle specific tricky
cases.

But if the formal grammar is just an approximation, that makes automated
consistency checks rather tricky...


Reply to this email directly or view it on GitHub
#1331 (comment).

@rprichard
Copy link

@dgrunwald

It's easy to say: "A '<' token after a type name is always considered to be part of the type." But doing this in the formal grammar is not so easy:

FWIW, the parser-lalr grammar does seem to handle this ambiguity correctly using precedence directives rather than duplicating the expression non-terminal. I'm guessing a precedence directive is also the reason this parser-lalr bug isn't caught at compile-time.

@glaebhoerl
Copy link
Contributor

This seems relevant.

@nikomatsakis
Copy link
Contributor

(Nominating for final comment period discussion, since it seems like we've reached a fixed point here in this comment thread.)

@nikomatsakis
Copy link
Contributor

Hear ye, hear ye! This RFC is now entering final comment period.

@nikomatsakis nikomatsakis added final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. and removed I-nominated labels Jan 8, 2016
@jarcane
Copy link

jarcane commented Jan 12, 2016

This is potentially very far off in future times, but would having a canonical grammar like this make it easier for people to develop alternate implementations?

@nikomatsakis
Copy link
Contributor

@jarcane

This is potentially very far off in future times, but would having a canonical grammar like this make it easier for people to develop alternate implementations?

It would certainly help, though probably not by THAT much.

@nikomatsakis
Copy link
Contributor

Huzzah! This RFC has been accepted by the language design subteam.

nikomatsakis added a commit to nikomatsakis/rfcs that referenced this pull request Jan 15, 2016
@nikomatsakis nikomatsakis merged commit 44e88d3 into rust-lang:master Jan 15, 2016
@Centril Centril added the A-syntax Syntax related proposals & ideas label Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-syntax Syntax related proposals & ideas final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. T-lang Relevant to the language team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.