-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
file-types
section in package.json
incomplete
#35
Comments
file-types
in package.json
file-types
section in package.json
incomplete
I think I'm not sure yet, but I think they may have some syntactic constructs that are not valid in Clojure or ClojureScript:
via; https://github.com/Tensegritics/ClojureDart/blob/main/doc/differences.md#parametrized-types There might be other things that are different too (see the page linked to) -- the one above is just something I became aware of looking through their Update: According to cgrand:
So my statement eariler:
is not correct. However, as can be seen in #46, our grammar wasn't handling everything appropriately. |
On a side note, I know that there is at least one type of thing in ClojureCLR that won't work with tree-sitter-clojure. It's documented here:
Some examples from the aforementioned page:
Note that unlike ClojureDart and ClojureScript, AFAIU, ClojureCLR does NOT use a different file extension to distinguish itself from Clojure. They both use Update: As clarified here, it turns out ClojureCLR can also use |
Here's another list of extensions courtesy of helix editor: file-types = ["clj", "cljs", "cljc", "clje", "cljr", "cljx", "edn", "boot"] I think I hadn't seen
|
I always thought cljr was for clojure CLR :) Does the modified reader for clojure CLR cause the grammar to break? That seems like a feature that would be rarely used (does typical C# code produce assemblies with all those weird characters??). If the parser still works it might be worth not touching at the risk of making the parser even more complicated. |
Apparently you were not alone: helix-editor/helix#3387 (comment) As a former user speaking, I don't believe that was the case, but perhaps things have changed. May be we'll get a response to this. |
I don't have any data on how frequently things are used -- I'm pretty sure this extension to the syntax wasn't added lightly (and I think RH may be aware of this change). Perhaps I should have chosen a different sample from that page. Here's one that's much more likely to appear:
Here's some text from the same page regarding the above example:
Regarding:
I guess I can try but I doubt it will work. Ok, here's what I get for the example:
Seems to not work. AFAIU, there aren't a large number of ClojureCLR users out there (if there are, they are hiding well) and the last I checked, the current maintainer is in the midst of rewriting from scratch using F# (not sure whether he's still doing this). It doesn't seem like an urgent matter and it might be better done in a separate grammar. Perhaps it is possible to handle this sort of thing using an external scanner though. |
An extension grammar might work well for detecting these symbols, but it may be difficult for editors to use when the language seems to be using the same file extensions, how should it know when I file is meant to be executed on the CLR or the JVM? FWIW, that specific symbol |
I didn't have difficulties in practice as I tended to manually manage my REPL connections, but there is a trick you can do with the REPL and reader conditionals to determine what sort of Clojure is sitting at the other end. I agree it would be far less confusing if a different file extension were to be used. May be there is still time for that to change. (On a side note, it looks like some support for ClojureCLR was added to inf-clojure last summer: clojure-emacs/inf-clojure#202)
Thanks for checking.
Perhaps not any time soon anyway :) |
For clarity, https://github.com/clojure/clojure-clr/blob/master/Clojure/Clojure/Lib/RT.cs#L3229 |
@IGJoshua Appreciate you taking the time to respond and explaining! |
Re: ClojureDart - cgrand mentioned here that:
So perhaps we can add I suppose finding some collection of sample code and doing some testing first might not be a bad idea. Below is an example of it working:
This is using ahlinc's alpha that has parsing via stdin built in. |
That is super cool! Nice find. |
I think adding the cljd (and cljr probably) extension is a good move. Is there anything stopping us from doing so? I'm not too worried about the impact it would have on other things that might use those extensions. It just seems like a very remote edge case that will never come up. |
Thanks for your comments. I'm thinking to try to find some I'm not so sure about As far as I'm aware, technically, this whole file extension thing only affects the tree-sitter cli (but may be that's not true) so I'm also not too concerned about anything else that might happen to use the extensions. There is what people might read into it though, so I think making some appropriate statements might be preferrablle to be less "misleading". [1] This search shows some results here at least. |
@dannyfreeman May be we can adapt the babashka script you wrote for clojars fetching to get these things (though I guess we might want to use git?) and make a babashka task for it plus parsing to test (a bit along these lines). |
Looks like
|
It doesn't look like it's worth the trouble to use Instead, the following might be fine:
Much simpler. [1] You apparently need to know some commit sha or branch name from some other means. There doesn't appear to be a way to get that kind of info via the API. There is a |
I don't know yet if the construct is valid but I found something our grammar doesn't seem to handle:
Assuming for the moment that that is valid, I presume the I guess that's not a case I anticipated: tree-sitter-clojure/grammar.js Lines 335 to 342 in 421546c
If we're lucky a fix might be a matter of attending to that last Adding something in the last This brings up the topic of whether that I guess testing these possibilities out might be a next step to consider, though establishing the validity of the code might be a good idea too :) |
Sounds like updating the file list should depend on us making sure we work with clojure dart properly first. I've commented over there with some thoughts |
Below are some numbers about how many of each type of file extension I was able to find recently in "release" jars from Clojars:
It would be interesting to see if @phronmophobic has done any similar tallying of things collected for dewey. For example, I expect that GitHub repositories would be more likely to have |
@sogaiu , not sure exactly what info you're looking for. I did a tally using the suffixes from your list and got the following results:
The local data on my laptop is a few months out of date, but hopefully should be useful as an estimate. If there's other info you're interested in, I can try to look into it. FWIW, all of the clojure repos that dewey tracks on github adds up to only about 15gb of compressed data. |
@phronmophobic Thanks a lot for the summary and the numbers. To give some background, the quality of tree-sitter-clojure relies heavily on testing against a large sample of source as partly a proxy for there not being an official specification (we've found constructs / usages via the testing that we hadn't imagined) but also to act as a check when we change things. When I first started testing against source samples written by others I used git repository content. I believe at the time (this was a few years ago so my recollection may be untrustworthy) I was finding that there were too many cases where source content was off in one way or another (e.g. a missing delimiter or not realizing that what's inside I switched to using "release" jar content from Clojars figuring that the content was more likely to actually run (and hence ought to read / parse). Mostly that seems to have worked out. However, I was only looking at Recently we discovered that That suggested to me that there might be files with other extensions that would be worth looking at. I don't imagine the list of file extensions I wrote about above misses a whole lot [1], but if there are others I'd like to investigate. It looks like dewey has more files in total from your map as well as the 15gb compressed data number -- though may be that includes There's also the matter of possible bias -- possibly git repositories might be somehow more representative. So may be it's time to try out the grammar on the data identified by dewey :) In any case, thanks again for sharing your results and the ongoing efforts with dewey. [1] The list is basically a combination of what Cider and Helix use. |
Dewey relies on github's notion of which libraries are "clojure" libraries. There are definitely some false positives and false negatives. The most recent work on dewey includes automating the full pipeline so that dewey releases happen regularly (and cheaply) without any manual steps (except for clicking the release button at the very end). Part of the reason for this is to make it easier to support running more analyses. Depending on what you're looking for, I would be happy to chat about how to use dewey's data or do some ad-hoc analysis using dewey's data pipeline. If there's a simple way to try to parse or benchmark parsing across repos, then I could look into plugging that in. |
Thanks for the further details and offer. This bit of code shows what I'm working on putting in place. Steps are roughly:
The output is a starting point for further investigation. In general I have found it necessary to examine the actual files that have had parsing issues as the tree-sitter cli output does not provide enough detail. If there are more than a small number of cases, I've used clj-kondo linting results to try to create groups of files to try to make examining the results manageable. These last "post-parsing examination" bits makes me wonder if it's a good fit for trying to use the data pipeline you've constructed. The files I've had the least amount of experience examining are those that don't have the file extensions
If Is it straight-forward to determine which repositories those might be? I looked at On a side note, is "default" branch info for each repository saved somewhere? With that info, it seems like it might be possible to create urls to the |
Yes. I have all the repos stored locally (which is how I found the tallies). When I get a free moment, I can create a list of repos and paths for the uncommon file suffixes.
Yes, that info can be found in the
Yea, it takes a while (a few hours?). Especially if you try to adhere to github's rate-limiting (which I do). There's no explicit rate-limit for non-API requests, but I try to keep it below their API request limit (5,000 requests an hour). |
Ok, here's a zip of paths for the less common suffixes and the github repos they were found in: |
Wow, thanks so much for the zip file and the explanations! |
After donig a bit of flitering, I ended up with 600 or so urls for |
You can also download individual files if you know the user, repo, and path. https://github.com/phronmophobic/dewey/blob/main/src/com/phronemophobic/dewey.clj#L167 |
Indeed. Thanks for the tip! |
Of the 1228 files that matched the file extension criteria, 7 produced parse errors. The tldr is that each of the files had at least one issue to prevent successful parsing. I think none of them should have parsed correctly. It looks like we didn't find anything unexpected among the less common file extension file content...this time. But now we have code to repeat this sort of exercise in the future :) Below are some descriptions about why parsing failed. Content that has naked prose (non-Clojure code) and Clojure code in it Content that looks like something that is being used as a placeholder / reminder for porting from some non-Clojure language to Clojure Unintenionally malformed content (delimiter issue) Template content that isn't strictly speaking Clojure, but quite close (uses file extensions often used by files that have "Clojure" in them [1]) Intentionally malformed content [1] Perhaps an unfortunate fairly prevalent practice from the perspective of folks doing these kinds of analyses :) |
A note for future reference... As seen earlier, So perhaps |
Since folks seem to keep coming up with additoinal Clojure flavors, it seems possible the file types may never be complete... I'm going to close this issue for now. If there are particular file types that should be considered for addition, I suggest we make individual issues for them (ofc, multiple file types can be mentioned in a single issue :) ). |
For future reference, Pulsar currently has this list:
|
Currently,
clojure-mode.el
has the following section:via: https://github.com/clojure-emacs/clojure-mode/blob/3453cd229b412227aaffd1dc2870fa8fa213c5b1/clojure-mode.el#L3221-L3230
The current
package.json
has the following section:tree-sitter-clojure/package.json
Lines 21 to 26 in 262d6d6
May be there are some things worth adding.
AFAIU, one thing this information affects is the tree-sitter cli's scanning (as described briefly here). I am not very clear on how this works, but I wonder if there could be issues if other non-Clojure grammars use any of the same file extensions.
The text was updated successfully, but these errors were encountered: