Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No tests to express expected failure for templates and other things #42

Closed
sogaiu opened this issue Feb 10, 2023 · 20 comments
Closed

No tests to express expected failure for templates and other things #42

sogaiu opened this issue Feb 10, 2023 · 20 comments

Comments

@sogaiu
Copy link
Owner

sogaiu commented Feb 10, 2023

There was a query about how expected failures might be tested for at the tree-sitter repository.

It got me thinking that we might be able to capture and express some of our intents about what we don't support.

Some possible examples of such things might include:

Additions based on further discussion:


[1] I think we now know what construct this is and ATM it seems reasonable to try to support it.

@sogaiu sogaiu changed the title No tests to express expected failure for templates No tests to express expected failure for templates and other things Feb 10, 2023
@sogaiu
Copy link
Owner Author

sogaiu commented Feb 11, 2023

Below is an example corpus test for {{ }} templates:

================================================================================
Templates with `{{` `}}` Should Fail
================================================================================

(defproject codebrutale/{{name}} "0.1.0-SNAPSHOT"
  :description "FIXME"
  :url "https://github.com/codebrutale/{{name}}"
  :license {:name "Eclipse Public License"
            :url "https://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.9.0"]])

--------------------------------------------------------------------------------

(source
  (list_lit
    (sym_lit
      (sym_name))
    (sym_lit
      (sym_name))
    (ERROR)
    (map_lit
      (map_lit
        (sym_lit
          (sym_name))))
    (str_lit)
    (kwd_lit
      (kwd_name))
    (str_lit)
    (kwd_lit
      (kwd_name))
    (str_lit)
    (kwd_lit
      (kwd_name))
    (map_lit
      (kwd_lit
        (kwd_name))
      (str_lit)
      (kwd_lit
        (kwd_name))
      (str_lit))
    (kwd_lit
      (kwd_name))
    (vec_lit
      (vec_lit
        (sym_lit
          (sym_ns)
          (sym_name))
        (str_lit)))))

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 11, 2023

Note the following quote by the tree-sitter creator:

Yeah, there is intentionally not much that you as a grammar author can do to manipulate the error recovery.

Also, looking at the first comment in this, it seems possible that tweaks to one's grammar might change whether and where there might be ERROR and/or MISSING nodes.

If that is the case, then having too many corpus tests that indicate precise failure patterns might not be so great from a maintenance perspective...

@dannyfreeman
Copy link
Collaborator

I don't think we should test for expected failures, I think we should just let them be failures. Like you point out, we can't really control how things fail. Changes to tree sitter versions and the grammar may change what the failures look like. To me, the important cases to test are what we expect tree sitter to parse correctly. That's all consumers of the parser care about, is that valid clojure produces a usable parse tree. Errors are expected to occur while the library is in use, and those errors are completely ignored most of the time.

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 14, 2023

That sounds mostly good.

I want to document that we don't necessarily support some things though, like the items mentioned earlier.

If the testing could have been reliable, that would have been a bit more convenient.

@dannyfreeman
Copy link
Collaborator

There is always the option of some more hand crafted tests. Like a script that parses a file with the grammar, and we assert that it returns a non-zero exit code, but also that it doesn't crash the parsers (like the null byte did). There is no reason it needs to go through the corpus testing.

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 14, 2023

Good point.

So perhaps the zero byte [1] thing is something we should have a test for too.

Regarding the nul zero byte thing, I found this discussion:

(I'm posting multiple links though it's the same discussion because GitHub discussions tends to hide things -- not a helpful UI / UX sometimes.)

I'm not sure whether we're supposed to be able to handle them...


[1] At this point, from the perspective of widespread use, I'm not sure whether "nul" or "null" is an appropriate term. I started using "zero byte" at some point to avoid the whole topic, but apparently not as far back as the referenced discussion :) See here for some more on this if interested.

@dannyfreeman
Copy link
Collaborator

I suspect NUL byte came from the days when c-niles did everything in their power to eliminate redundant characters in all their identifiers. I'll try to stick with zero byte from here on out.

Regardless of that one scenario, some simple tests like this might be worth throwing into a script.

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 15, 2023

Ok, I'll update the initial post to include mention of zero bytes so there's a place we can look that fits on most normal screens :)

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 27, 2023

I added some preliminary expected failure testing in 51ca143 (on the dev branch).

There isn't anything yet for ClojureCLR's pipe-delimited symbols, but I did add something for a template file that leads to a parse error and something for a file with a zero byte in it.

The tests can be invoked by bb expect-failures.

Not crazy about the output but at least the testing occurs.

On a related note, I experimented with using TAP in tree-sitter-janet-simple as mentioned at the bottom of #39 (comment) (see the link named "described here" if interested) [1].

Possibly it would be interesting to consider using TAP here too.


[1] I'm avoiding putting the link here because then this issue will show up at the tree-sitter repository unnecessarily :P

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 27, 2023

For the ClojureCLR pipe-delimited symbol testing, it's not necessarily that we get an error parsing, it's that the parsed result may be quite different from correct.

Here is a comment that has an example along with the resulting parse output.

Testing for that would be different from the other two things where there was checking for an exit code of 1 (possibly parsing the output to look for ERROR would be good too, but that's not there ATM).

@dannyfreeman
Copy link
Collaborator

dannyfreeman commented Feb 28, 2023

The failure tests work great so far.

For the CLR test case, it's not a failure, but a strange parse tree that is technically does not produce errors. I can pass this into the JVM clojure repl

'(|System.Collections.Generic.IList`1[System.Int32]|)

And it gives me a valid result. I don't think it should produce an error then. Since that is how the reference implementation behaves, perhaps this should just be described in a corups test??

Edit: here is a PR against dev. Feel free to press merge if you like it #48

@sogaiu
Copy link
Owner Author

sogaiu commented Mar 1, 2023

Thanks for trying things out.

I think that:

|System.Collections.Generic.IList`1[System.Int32]|

is supposed to be a single symbol.

Interacting with clojure via the command line I get:

$ rlwrap clojure
Clojure 1.11.1
user=> '|System.Collections.Generic.IList`1[System.Int32]|
|System.Collections.Generic.IList
1
Syntax error (ClassNotFoundException) compiling at (REPL:0:0).
System.Int32
Syntax error compiling at (REPL:0:0).
Unable to resolve symbol: | in this context
user=>

I think what's happening here is that:

  • |System.Collections.Generic.IList is recognized as a symbol [1]
  • 1 (actually backtick followed by 1) is recognized as a number
  • [System.Int32] is recognized as a vector with a single symbol in it
  • The last | is being interpreted as a symbol

The ClojureCLR docs here have this bit:

Vertical bars are used in pairs to surround the name (or part of the name) of a symbol that has any special characters (from the standard Clojure symbol viewpoint) in it. For example, |A(B)|, A|(|B|)|, and A|(B)| all mean the symbol whose name consists of the four characters A, (, B, and ).

So the parsing doesn't seem correct to me.

I suspect if this is to be handled that doing it via an external scanner might be worth considering. I'm not yet convinced effort should be spent on it at this stage though.


[1] I tried the following too:

$ rlwrap clojure
Clojure 1.11.1
user=> (def |System.Collections.Generic.IList 1)
#'user/|System.Collections.Generic.IList

so that seems to be evidence in favor of |System.Collections.Generic.IList being interpreted as a symbol.

@dannyfreeman
Copy link
Collaborator

dannyfreeman commented Mar 1, 2023

In JVM clojure it is definitely more than one symbol, and it is possible to parse those symbols without errors (executing is another matter). It seems that in clojure CLR it is a single symbol. We can't support both because while scanning we don't know what platform the code is intended to run on. Maybe it's CLR, maybe JVM, maybe it's both, but we don't have that info while parsing.

I think in this case we support the more common platform. Errors are not produced when parsing, so the only downside I can see is for CLR clojure they might get some funny syntax highlighting on these types of symbols. Since the CLR implementation has a different reader, then I think this would be the case for a special grammar that inherits this grammar as a parent and adds custom rules for these symbols. I doubt anyone will think this is enough of a problem to actually write the new grammar. It would be a lot of hassle to use just for the CLR. Based on how clojure CLR is used, it would be difficult to know when to use a CLR grammar since a lot of the code uses the normal .clj extension.

@NoahTheDuke
Copy link

@dmiller might have thoughts here. I know he's currently in the process of rewriting the ClojureCLR in F#, maybe he'd be willing to shed some insights on this stuff for us. If the cljr community can move to relying on a different filetype, we could help fork this for them. Otherwise, I agree we're at an impasse.

@dannyfreeman
Copy link
Collaborator

Yeah, it's would be an involved process. Not only would we need a sub-grammar based on this one, editors would also need some support to use it. It would result in a special tree-sitter-clojure-clr.so dynamic library after all. In Emacs I believe that would be at least a derived major mode.

Unrelated: It is very cool to see ClojureCLR getting rewritten in F#.

@dmiller
Copy link

dmiller commented Mar 1, 2023

I'm not sure of all the questions being asked, but here goes from what I can figure out.
The |...| syntax I would love to see going away in the rewrite, but unfortunately I may need to keep it for backward compatibility.
If tagged literals had been available at the time, I would have used them instead of | ... |, but so it goes.

Regarding |System.Collections.Generic.IList`1[System.Int32]| -- this unfortunately is the actual CLR name for the type.
I didn't know what kind of sweetened syntax to put in for typenames, a la C# or F# -- no experience in the real world at that point -- so i went with the official CLR syntax. Given all the strange characters, I had to introduce | ... | quoting for symbol names. Stolen with some modification from Common Lisp.

How JVM handles it -- well, you can see it is weird.

Does this help?

@dannyfreeman
Copy link
Collaborator

dannyfreeman commented Mar 1, 2023

_It is good to know that this syntax is probably not going away.

One question that I think would be good to know: will clojure CLR be moving to a new file extension .cljr? Right now it looks like a lot of it uses .clj, same as the JVM. Without a distinct file extension I don't think we will be able to provide a separate tree sitter grammar for clojure CLR that would handle these kinds of differences in the reader.

If the plan is to stick with .clj that is fine, the only consequence is that type names like |System.Collections.Generic.IList`1[System.Int32]| might have some strange syntax highlighting in certain editors.

@dmiller
Copy link

dmiller commented Mar 1, 2023

The plan is to move to a new extension .cljr.
The code to do that is in the latest alpha release of ClojureCLR (the current version, not the rewrite).
I will make that transition in all the libraries I maintain by the next release.
Though I'm sure folks will be happy enough to have any support and can live with some strange syntax highlighting.

@sogaiu
Copy link
Owner Author

sogaiu commented Mar 7, 2023

In c86581f [1] I've added a babashka task to demonstrate an expected misparse.

bb show-expected-misparses should produce some output. Not that illustrative at the moment. Probbably it could use some work, but at least something is recorded in a way that can be checked programmatically.


[1] This is on a temporary branch where I'm working on various babashka tasks.

@sogaiu
Copy link
Owner Author

sogaiu commented Mar 12, 2023

Most (if not all) of the testing for these bits will be living in a separate repository, so I'll close this issue for now.

Thanks to all the participants!

@sogaiu sogaiu closed this as completed Mar 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants