Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formatting of grammar.js is unusual #39

Closed
sogaiu opened this issue Jan 29, 2023 · 26 comments
Closed

Formatting of grammar.js is unusual #39

sogaiu opened this issue Jan 29, 2023 · 26 comments

Comments

@sogaiu
Copy link
Owner

sogaiu commented Jan 29, 2023

The formatting of grammar.js now has portions that look like this:

source: $ =>
repeat(choice($._form,
$._gap)),

AFAIK, this arrangement is not typical. However, it is the result of trying to achieve:

  • better readability elsewhere inside grammar.js, and
  • having a semi-automated way of arriving at such results

We tried some formatters including prettier and js-beautify, but failed to get them to handle expressing nested calls with indentation that was sufficiently readable to us.

As an example of something that we would prefer but failed to manage with other methods, consider:

meta_lit: $ =>
seq(field('marker', "^"),
repeat($._gap),
field('value', choice($.read_cond_lit,
$.map_lit,
$.str_lit,
$.kwd_lit,
$.sym_lit))),

We didn't figure out how to get that sort of result from any of the formatters we tried, nor, it seems, did any of the other tree-sitter grammar repositories we checked.

Although JavaScript is being used to express things in grammar.js, perhaps the number of nested calls is somewhat unusual and thus (some?) existing formatters are not likely to have considered this kind of use (i.e. to express a grammar description).

We did have some luck with Emacs' js.el using code like:

(setq-default indent-tabs-mode nil)

(require 'js)

(setq js-indent-level 2)

(setq js-indent-align-list-continuation t)

(setq make-backup-files nil)

(defun tsclj-massage-and-save ()
  "Untabify, indent, and save buffer content."
  (let ((start (point-min))
        (end (point-max)))
    (message "Untabbifying...")
    (untabify start end)
    (indent-region start end nil)
    (save-buffer)
    (message "Saving...")))

This allowed us to arrange for nested functions calls to appear as we liked, but a side-effect was that it resulted in the unusual type of indentation demonstrated at the beginning of this post. While that is unfortunate, we think the alternative is less desirable.

We may eventually add code for formatting to the repository and a method to invoke it from the command line, perhaps a script like:

#! /bin/sh

# pass grammar.js as first argument to script

# n.b. order of arguments seems to matter below
emacs --batch \
      --load=tsclj-massage.el \
      $1 \
      --funcall=tsclj-massage-and-save

For invocation from cmd.exe, perhaps something like the following will do:

REM pass grammar.js as first argument to script
set grammarfile=%1

REM n.b. order of arguments seems to matter below
emacs --batch^
      --load=tsclj-jmassage.el^
      %grammarfile%^
      --funcall=tsclj-massage-and-save
@NoahTheDuke
Copy link

One thing to note, emacs is less available/portable than JavaScript formatters.

@sogaiu
Copy link
Owner Author

sogaiu commented Jan 29, 2023

@NoahTheDuke Thanks for taking a look and commenting.

I think the target audience here is people who might work on the grammar. More generally, perhaps it is for entities that might modify it with the intention of having the results merged in. But may be there are cases I have overlooked.

I usually use some sort of Linux box but I also end up having to work with Windows. The outlined approach works with those and I have a hard time imagining it's not going to work on macos or BSDs -- which historically have been able to run Emacs. Note that this approach doesn't require that anyone use Emacs as an editor interactively -- it's just used to execute a script.

It is true however that the formatting that js.el does as a result of:

(setq js-indent-align-list-continuation t)

has a long and not-well-documented past [1]. In that sense it may be fragile. OTOH, if that stops working in a future Emacs, perhaps we can point at the old version of the code.

I'd like an approach that doesn't depend on an editor, so I'm happy to consider other formatting options. Probably I should also mention that I'm leaning toward removing the use of npm so I'd prefer to choose something that could work along those lines.

Do specific situations come to mind where you think the outlined approach might be an issue?


[1] It may have come from code in javascript.el written by Karl Landstrom and then adapted by Steve Yegge (search for "Karl Angalsdkjfadslkfj to the rescue" if interested).

@NoahTheDuke
Copy link

Do specific situations come to mind where you think the outlined approach might be an issue?

Only that most Linux distributions and osx and windows don't come with emacs installed by default. I personally use neovim (and vim before that), so having to install emacs merely to format the code is a much bigger step than using the language I'm currently writing to format the code.

Seems like a bad trade to go from relying on the defacto standard tooling (npm) to a specific editor.

On the other hand, I'm not an active dev so maybe this doesn't matter.

@sogaiu
Copy link
Owner Author

sogaiu commented Jan 29, 2023

To do development for a tree-sitter grammar, in my experience, I've found that what I needed included at least:

  • C compiler
  • Emscripten
  • Node.js
  • Python
  • Rust
  • tree-sitter

These are not all available on any of the platforms I use out-of-the-box [1] -- I don't think there is a platform that has them all by default. So IIUC there is likely installation work involved no matter what your environment is.

Windows is the platform I found to be the most work getting bits to work. I tend to use scoop there and it was a matter of scoop install emacs -- somewhat analogous to apt install emacs-gtk or pacman -S emacs or brew install emacs.

Relatively speaking, installing Emacs is, in my experience, less problematic compared to some other pieces (e.g. Emscripten) and further, I have found it to be not much work.

Installation isn't the only factor of course -- updating is also something one needs to consider. As you've probably noticed from the other issues, Emscripten is much more effort and cumbersome to get right for tree-sitter grammars.

As I said earlier though, I'm not adverse to choosing another non-editor-dependent approach. I just haven't found a suitable one yet.

The option of formatting things manually is also available -- it's what I did for most of the life of this repository. Having an automated way is nice to be able to use, but one does not have to use it all of the time. If someone is asked to change their formatting in a PR, then it seems a good option to provide a way to make that easier.

Regarding:

Seems like a bad trade to go from relying on the defacto standard tooling (npm) to a specific editor.

AFAIU, npm doesn't provide any formatter on its own (and it has other significant problematic issues). I don't see a trade-off being made here. I'm hoping to stop using npm for this project as well.

Thanks for sharing your thoughts by the way. I think the resulting communication can help to explore and spell certain things out.


[1] Note that getting an appropriate version is also relevant and one may need to install a different version of something that is already installed.

@dannyfreeman
Copy link
Collaborator

I'd agree with Noah that using Emacs for formatting is a barrier for other contributors. The way I see this, we could keep using Emacs or switch to prettier, which is the de-facto standard js formatter. If we use Emacs, we don't need to require users to install it. We can apply the formatting ourselves after other people develop with normal tools. I don't mind doing that at all.

@sogaiu
Copy link
Owner Author

sogaiu commented Jan 29, 2023

@dannyfreeman

I think there may be a misunderstanding here -- I don't think I ever stated that there is a requirement for installing Emacs. I'm sorry if I gave that impression, it was not intended.

Also, I stated this:

The option of formatting things manually is also available -- it's what I did for most of the life of this repository. Having an automated way is nice to be able to use, but one does not have to use it all of the time. If someone is asked to change their formatting in a PR, then it seems a good option to provide a way to make that easier.

but I'm not necessarily adverse to performing some reformatting manually nor do I think the stated style is particularly difficult to manage to learn how to do.

I personally find reading the other grammars I've looked at quite taxing in spots. When I have to look at many rules I am not familiar with (which is likely to be the case if I come back to look here too), the existing styles really don't help.

There is a bit of oddness to the proposed one here, but it's far easier for me to understand at a glance than the alternatives I have looked at.

Having said all of this, I'm still fine to investigate looking for alternatives that might work better.

I very much doubt that prettier will be a good choice though -- I say this based on having tried it out as well as based on what it claims:

An opinionated code formatter
Has few options

It's fine to be opinionated but it's opining about a different use of JavaScript than how it's being used for tree-sitter grammars as a DSL. Having few options means it's not likely to be that tweakable IIUC. I don't think it's a good match.

I think it's more likely that js-beautify might be made to work though I didn't have much luck with it when I tried before. I only tried via here though -- perhaps there are more options. Another plus with them is npm is not a requirement.

@sogaiu
Copy link
Owner Author

sogaiu commented Jan 30, 2023

Some other formatting-related programs I've looked at include:

  • jsfmt
  • esformatter
  • codepainter

Looked through options and tried some things out but not much luck yet.

I did get some descriptions that might be handy when searching:

  • Parenthesis-aligned indentation (courtesy of Google's JavaScript Style Guide)
  • Alignment of multiline JavaScript function arguments (bug report at JetBrains)
  • AlignAfterOpenBracket (via Clang Format Style Options)

@sogaiu
Copy link
Owner Author

sogaiu commented Jan 30, 2023

I've collected links, vocabulary, samples, and other bits and placed them in a gist, as it's a bit much for here perhaps: https://gist.github.com/sogaiu/75411c556eba685ea4dfa6043970cfed

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 7, 2023

I'm trying out astyle.

It seems to be able to handle the concern mentioned above regarding nested calls. There are some other bits that I haven't tamed yet though.

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 7, 2023

As part of the reformatting effort, I tried replacing instances of the // construct (regex literal?) with analogous new RegExp() instances.

There are some tweaks that are necessary to do this (e.g. / is no longer escaped, while backslash itself requires more escaping), but I do like the following kind of thing better than what we have now:

const KEYWORD_HEAD =
      RegExp('[^' +
             '\\f\\n\\r\\t ' +
             '(){}' +
             '\\[\\]' + // double-backslashes for re escapes
             '\\\\' +   // double-backslashes for re escapes
             '"' +
             '~^;`,:/' +
             '@' +
             '\\u000B\\u001C\\u001D\\u001E\\u001F' +
             '\\u2028\\u2029\\u1680' +
             '\\u2000\\u2001\\u2002\\u2003\\u2004\\u2005\\u2006\\u2008\\u2009' +
             '\\u200a\\u205f\\u3000' +
             ']');

Here's what we have on 262d6d6 for comparison:

const KEYWORD_HEAD =
/[^\f\n\r\t ()\[\]{}"@~^;`\\,:/\u000B\u001C\u001D\u001E\u001F\u2028\u2029\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2008\u2009\u200a\u205f\u3000]/;

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 7, 2023

By adding the following definition in grammar.js:

function regex(patt) {
  return RegExp(patt);
}

The sample above can be written:

const KEYWORD_HEAD =
      regex('[^' +
            '\\f\\n\\r\\t ' +
            '(){}' +
            '\\[\\]' + // double-backslashes for re escapes
            '\\\\' +   // double-backslashes for re escapes
            '"' +
            '~^;`,:/' +
            '@' +
            '\\u000B\\u001C\\u001D\\u001E\\u001F' +
            '\\u2028\\u2029\\u1680' +
            '\\u2000\\u2001\\u2002\\u2003\\u2004\\u2005\\u2006\\u2008\\u2009' +
            '\\u200a\\u205f\\u3000' +
            ']');

Here's what STRING's definition looks like:

const STRING =
      token(seq('"',
                repeat(regex('[^"\\\\]')),
                repeat(seq("\\",
                           regex('.'),
                           repeat(regex('[^"\\\\]')))),
                '"'));

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 7, 2023

May be I'll start a new issue with these regular expression bits.

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 11, 2023

As a fairly speculative idea, I tried translating grammar.js to .edn after not having a large amount of success with existing formatters -- yeah, a bit of a jump :P

I don't know if this will get pursued much, but there are other folks who generate grammar.json from something other than grammar.js, so it's not entirely out there. Noting here also that tree-sitter generate does support starting from grammar.json (and skipping the implicit grammar.js input).

For this to be useful I imagine one would need at least:

  • Code that translates from grammar.edn to grammar.json (and likely grammar.js)
  • The translation code needs to be fast enough (at least to grammar.json -- not necessarily to grammar.js)
  • The code needs to be able to handle .edn and produce .json (jet / babashka are obvious pieces that might be useful)

I haven't looked into which part of the tree-sitter cli generates feedback to the user about conflicts (and other things), so if that turns out to be in the process of generating grammar.json (and not when creating parser.c from grammar.json ), that might make this idea more work than might be palatable.

As a side note, using this approach (alone) would mean that Node.js is no longer necessary as that is only used by tree-sitter to produce grammar.json from grammar.js.

Being able to generate grammar.js from grammar.edn might be important if there are other grammars that want to inherit from it (as tree-sitter-commonlisp does -- though that inherits from a fork, strictly speaking).

Anyway, below is what it could look like (though I did tweak the default formatting for vectors).

(I made up the :_defs key to give a place for "lower" level pieces to live in.)

{:name "clojure"

 ;; a comment
 :extras []

 :conflicts []

 :inline [:_kwd_leading_slash
          :_kwd_just_slash
          :_kwd_qualified
          :_kwd_unqualified
          :_kwd_marker
          :_sym_qualified
          :_sym_unqualified]

 :_tokens
 {:WHITESPACE_CHAR
  [:regex "["
          "\\f\\n\\r\\t, "
          "\\u000B\\u001C\\u001D\\u001E\\u001F"
          "\\u2028\\u2029\\u1680"
          "\\u2000\\u2001\\u2002\\u2003\\u2004\\u2005\\u2006\\u2008\\u2009"
          "\\u200a\\u205f\\u3000"
          "]"]

  :WHITESPACE [:token [:repeat1 :WHITESPACE_CHAR]]

  :COMMENT [:token [:regex "(;|#!)"
                           ".*"
                           "\\n?"]]

  :DIGIT [:regex "[0-9]"]

  :ALPHANUMERIC [:regex "[0-9a-zA-Z]"]

  :HEX_DIGIT [:regex "[0-9a-fA-F]"]

  :OCTAL_DIGIT [:regex "[0-7]"]
  
  :HEX_NUMBER [:seq "0"
                    [:regex "[xX]"]
                    [:repeat1 :HEX_DIGIT]
                    [:optional "N"]]

  :OCTAL_NUMBER [:seq "0"
                      [:repeat1 :OCTAL_DIGIT]
                      [:optional "N"]]

  :RADIX_NUMBER [:seq [:repeat1 :DIGIT]
                      [:regex "[rR]"]
                      [:repeat1 :ALPHANUMERIC]]

  :RATIO [:seq [:repeat1 :DIGIT]
               "/"
               [:repeat1 :DIGIT]]

  :DOUBLE [:seq [:repeat1 :DIGIT]
                [:optional [:seq "."
                                 [:repeat :DIGIT]]]
                [:optional [:seq [:regex "[eE]"]
                                 [:optional [:regex "[+-]"]]
                                 [:repeat1 :DIGIT]]]
                [:optional "M"]]

  :INTEGER [:seq [:repeat1 :DIGIT]
                 [:optional [:regex "[MN]"]]]

  :NUMBER [:token [:prec 10
                         [:seq [:optional [:regex "[+-]"]]
                               [:choice :HEX_NUMBER
                                        :OCTAL_NUMBER
                                        :RADIX_NUMBER
                                        :RATIO
                                        :DOUBLE
                                        :INTEGER]]]]

  :NIL [:token "nil"]

  :BOOLEAN [:token [:choice "false"
                            "true"]]

  :KEYWORD_HEAD
  [:regex "[^"
          "\\f\\n\\r\\t "
          "/"
          "()"
          "\\[\\]"
          "{}"
          "\""
          "@~^;`"
          "\\\\"
          ",:"
          "\\u000B\\u001C\\u001D\\u001E\\u001F"
          "\\u2028\\u2029\\u1680"
          "\\u2000\\u2001\\u2002\\u2003\\u2004\\u2005\\u2006\\u2008\\u2009"
          "\\u200a\\u205f\\u3000"
          "]"]

  :KEYWORD_BODY [:choice [:regex "[:']"]
                         :KEYWORD_HEAD]

  :KEYWORD_NAMESPACED_BODY
  [:token [:repeat1 [:choice [:regex "[:'/]"]
                             :KEYWORD_HEAD]]]

  :KEYWORD_NO_SIGIL
  [:token [:seq :KEYWORD_HEAD
                [:repeat :KEYWORD_BODY]]]

  :KEYWORD_MARK [:token ":"]

  :AUTO_RESOLVE_MARK [:token "::"]

  :STRING
  [:token [:seq "\""
                [:repeat [:regex "[^"
                                 "\""
                                 "\\\\"
                                 "]"]]
                [:repeat [:seq "\\"
                               [:regex "."]
                               [:repeat [:regex "[^"
                                                "\""
                                                "\\\\"
                                                "]"]]]]
                "\""]]

  :OCTAL_CHAR [:seq "o"
                    [:choice [:seq :DIGIT :DIGIT :DIGIT]
                             [:seq :DIGIT :DIGIT]
                             [:seq :DIGIT]]]

  :NAMED_CHAR [:choice "backspace"
                       "formfeed"
                       "newline"
                       "return"
                       "space"
                       "tab"]

  :UNICODE [:seq "u"
                 :HEX_DIGIT
                 :HEX_DIGIT
                 :HEX_DIGIT
                 :HEX_DIGIT]

  :ANY_CHAR [:regex ".|\\n"]

  :CHARACTER [:token [:seq "\\"
                           [:choice :OCTAL_CHAR
                                    :NAMED_CHAR
                                    :UNICODE
                                    :ANY_CHAR]]]

  :SYMBOL_HEAD [:regex "[^"
                       "\\f\\n\\r\\t "
                       "/"
                       "()"
                       "\\[\\]"
                       "{}"
                       "\""
                       "@~^;`"
                       "\\\\"
                       ",:"
                       "#'"
                       "0-9"
                       "\\u000B\\u001C\\u001D\\u001E\\u001F"
                       "\\u2028\\u2029\\u1680"
                       "\\u2000\\u2001\\u2002\\u2003\\u2004\\u2005\\u2006\\u2008"
                       "\\u2009\\u200a\\u205f\\u3000"
                       "]"]

  :NS_DELIMITER [:token "/"]

  :SYMBOL_BODY [:choice :SYMBOL_HEAD
                        [:regex "[:#'0-9]"]]

  :SYMBOL_NAMESPACED_NAME
  [:token [:repeat1 [:choice :SYMBOL_HEAD
                             [:regex "[/:#'0-9]"]]]]

  :SYMBOL
  [:token [:seq :SYMBOL_HEAD
                [:repeat :SYMBOL_BODY]]]
  }

 :rules
 {:source [:repeat [:choice :_form
                            :_gap]]

  :_gap [:choice :_ws
                 :comment
                 :dis_expr]

  :_ws :WHITESPACE

  :comment :COMMENT

  :dis_expr [:seq [:field "marker" "#_"]
                  [:repeat :_gap]
                  [:field "value" :_form]]

  :_form [:choice :num_lit ;; atom-ish
                  :kwd_lit
                  :str_lit
                  :char_lit
                  :nil_lit
                  :bool_lit
                  :sym_lit
                  ;; basic collection-ish
                  :list_lit
                  :map_lit
                  :vec_lit
                  ;; dispatch reader macros
                  :set_lit
                  :anon_fn_lit
                  :regex_lit
                  :read_cond_lit
                  :splicing_read_cond_lit
                  :ns_map_lit
                  :var_quoting_lit
                  :sym_val_lit
                  :evaling_lit
                  :tagged_or_ctor_lit
                  ;; some other reader macros
                  :derefing_lit
                  :quoting_lit
                  :syn_quoting_lit
                  :unquote_splicing_lit
                  :unquoting_lit]

  :num_lit :NUMBER

  :kwd_lit [:choice :_kwd_leading_slash
                    :_kwd_just_slash
                    :_kwd_qualified
                    :_kwd_unqualified]

  :_kwd_leading_slash [:seq [:field "marker" :_kwd_marker]
                            [:field "delimiter" :NS_DELIMITER]
                            [:field "name"
                                    [:alias :KEYWORD_NAMESPACED_BODY
                                            :kwd_name]]]

  :_kwd_just_slash [:seq [:field "marker" :_kwd_marker]
                         [:field "name" [:alias :NS_DELIMITER :kwd_name]]]

  :_kwd_qualified
  [:prec 2
         [:seq [:field "marker" :_kwd_marker]
               [:field "namespace"
                       [:alias :KEYWORD_NO_SIGIL :kwd_ns]]
               [:field "delimiter" :NS_DELIMITER]
               [:field "name"
                       [:alias :KEYWORD_NAMESPACED_BODY :kwd_name]]]]

  :_kwd_unqualified
  [:prec 1
         [:seq [:field "marker" :_kwd_marker]
               [:field "name" [:alias :KEYWORD_NO_SIGIL :kwd_name]]]]

  :_kwd_marker [:choice :KEYWORD_MARK
                        :AUTO_RESOLVE_MARK]

  :str_lit :STRING

  :char_lit :CHARACTER

  :nil_lit :NIL

  :bool_lit :BOOLEAN

  :sym_lit [:seq [:repeat :_metadata_lit]
                 [:choice :_sym_qualified :_sym_unqualified]]

  :_sym_qualified
  [:prec 1 [:seq [:field "namespace" [:alias :SYMBOL :sym_ns]]
                 [:field "delimiter" :NS_DELIMITER]
                 [:field "name" [:alias :SYMBOL_NAMESPACED_NAME :sym_name]]]]

  :_sym_unqualified
  [:field "name"
          [:alias [:choice :NS_DELIMITER
                           :SYMBOL]
                  :sym_name]]

  :_metadata_lit
  [:seq [:choice [:field "meta" :meta_lit]
                 [:field "old_meta" :old_meta_lit]]
        [:optional [:repeat :_gap]]]

  :meta_lit
  [:seq [:field "marker" "^"]
        [:repeat :_gap]
        [:field "value" [:choice :read_cond_lit
                                 :map_lit
                                 :str_lit
                                 :kwd_lit
                                 :sym_lit]]]

  :old_meta_lit
  [:seq [:field "marker" "#^"]
        [:repeat :_gap]
        [:field "value" [:choice :read_cond_lit
                                 :map_lit
                                 :str_lit
                                 :kwd_lit
                                 :sym_lit]]]

  :list_lit [:seq [:repeat :_metadata_lit]
                  :_bare_list_lit]

  :_bare_list_lit [:seq [:field "open" "("]
                        [:repeat [:choice [:field "value" :_form]
                                          :_gap]]
                        [:field "close" ")"]]

  :map_lit [:seq [:repeat :_metadata_lit]
                 :_bare_map_lit]

  :_bare_map_lit [:seq [:field "open" "{"]
                       [:repeat [:choice
                                 [:field "value" :_form]
                                 :_gap]]
                       [:field "close" "}"]]

  :vec_lit [:seq [:repeat :_metadata_lit]
                 :_bare_vec_lit]

  :_bare_vec_lit [:seq [:field "open" "["]
                       [:repeat [:choice [:field "value" :_form]
                                         :_gap]]
                       [:field "close" "]"]]

  :set_lit [:seq [:repeat :_metadata_lit]
                 :_bare_set_lit]

  :_bare_set_lit [:seq [:field "marker" "#"]
                       [:field "open" "{"]
                       [:repeat [:choice [:field "value" :_form]
                                         :_gap]]
                       [:field "close" "}"]]

  :anon_fn_lit [:seq [:repeat :_metadata_lit]
                     [:field "marker" "#"]
                     :_bare_list_lit]

  :regex_lit [:seq [:field "marker" "#"]
                   :STRING]

  :read_cond_lit [:seq [:repeat :_metadata_lit]
                       [:field "marker" "#?"]
                       [:repeat :_ws]
                       :_bare_list_lit]

  :splicing_read_cond_lit [:seq [:repeat :_metadata_lit]
                                [:field "marker" "#?@"]
                                [:repeat :_ws]
                                :_bare_list_lit]

  :auto_res_mark :AUTO_RESOLVE_MARK

  :ns_map_lit [:seq [:repeat :_metadata_lit]
                    [:field "marker" "#"]
                    [:field "prefix" [:choice :auto_res_mark
                                              :kwd_lit]]
                    [:repeat :_gap]
                    :_bare_map_lit]

  :var_quoting_lit [:seq [:repeat :_metadata_lit]
                         [:field "marker" "#'"]
                         [:repeat :_gap]
                         [:field "value" :_form]]

  :sym_val_lit [:seq [:field "marker" "##"]
                     [:repeat :_gap]
                     [:field "value" :sym_lit]]

  :evaling_lit [:seq [:repeat :_metadata_lit]
                     [:field "marker" "#="]
                     [:repeat :_gap]
                     [:field "value" [:choice :list_lit
                                              :read_cond_lit
                                              :sym_lit]]]

  :tagged_or_ctor_lit [:seq [:repeat :_metadata_lit]
                            [:field "marker" "#"]
                            [:repeat :_gap]
                            [:field "tag" :sym_lit]
                            [:repeat :_gap]
                            [:field "value" :_form]]

  :derefing_lit [:seq [:repeat :_metadata_lit]
                      [:field "marker" "@"]
                      [:repeat :_gap]
                      [:field "value" :_form]]

  :quoting_lit [:seq [:repeat :_metadata_lit]
                     [:field "marker" "'"]
                     [:repeat :_gap]
                     [:field "value" :_form]]

  :syn_quoting_lit [:seq [:repeat :_metadata_lit]
                         [:field "marker" "`"]
                         [:repeat :_gap]
                         [:field "value" :_form]]

  :unquote_splicing_lit [:seq [:repeat :_metadata_lit]
                              [:field "marker" "~@"]
                              [:repeat :_gap]
                              [:field "value" :_form]]

  :unquoting_lit [:seq [:repeat :_metadata_lit]
                       [:field "marker" "~"]
                       [:repeat :_gap]
                       [:field "value" :_form]]

  }}

@NoahTheDuke
Copy link

NoahTheDuke commented Feb 11, 2023

I love edn and do a similar edn-to-json translation for https://github.com/mtgred/netrunner. However, isn't this just trading one dev tool for another? Instead of node, they now need babashka or jet or what have you. Unless you write it to be usable with clojure alone, which is probably a safe bet to assume for someone working on a clojure grammar 😉

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 11, 2023

Thanks for that link -- I will check it out. Ah, would you mind providing a hint or two about where in the codebase:

a similar edn-to-json translation

might be? I did some searching for json and edn and at first glance it seems there isn't a lot of overlap of files that contain both.

However, isn't this just trading one dev tool for another?

Yes, I think there could be a trade occurring here -- though depending on how it's done, it might be that there are just more options (e.g. generating a readable grammar.js from grammar.edn might be possible too).

Also, it seems to me that sometimes trades / swaps are worth it. It's not clear yet whether that would be the case here, but if there are issues with a setup that involved Clojure, edn, babashka, etc., what the chances of borkdude (or other Clojure folks) helping might be compared to chances that Node.js (or other non-Clojure-using) folks helping might be? I have no magical crystal ball but I have my experience to look back at :)

Another consideration one might have is whether Clojure-using folks who might have some interest in contributing would prefer to work in .edn over .json -- at least at first.

Still investigating what the cost of this kind of approach could be though.

@dannyfreeman
Copy link
Collaborator

Thinking about potential contributions, I would think that whoever wants to contribute, it will be easier to convince a clojure dev to work with javascript tooling than it will be to convince a js developer to work with edn/clj tooling. Many clojure devs probably are working in js ecosystem to some extent already.

@dannyfreeman
Copy link
Collaborator

dannyfreeman commented Feb 13, 2023

I certainly don't mind working with EDN. Instead of writing a custom tool we could also consider compiling a grammar.cljs to grammar.js

For example:

(js/grammar (clj->js {:name "clojure" ...}))

then use whatever the right cljs configuration magic is needed to set module.exports

Edit:
Using shadow-cljs this is possible. It would be kind of nasty and probably require that we check in our compiled grammar.js output to source control so that people without a JVM installed would still be able to generate the C code. Next step is a quine I think

@NoahTheDuke
Copy link

Once you're using shadow-cljs, you might as well not change to anything else, as shadow-cljs requires having node installed lol.

After all of this back and forth, I think the grammar.js DSL is perfectly fine. The formatting might be a little wonky but that's most codebases (just look at Clojure's Java 👁️ 👄 👁️ ). It matches what other Tree-Sitter grammars are written in, it's mostly well-documented, and it currently works without anything extra.

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 14, 2023

I appreciate the work that thheller has done with shadow-cljs and I used it for a number of projects, but my recollection is that I needed to update frequently. Do you know if that has changed?

@NoahTheDuke
Copy link

I don't, I haven't used it much.

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 14, 2023

Thanks -- may be I'll try to catch up on the state of things.

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 14, 2023

It doesn't look that different to me when looking at: https://github.com/thheller/shadow-cljs/commits/master

Search for "bump".

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 15, 2023

Thinking about potential contributions, I would think that whoever wants to contribute, it will be easier to convince a clojure dev to work with javascript tooling than it will be to convince a js developer to work with edn/clj tooling. Many clojure devs probably are working in js ecosystem to some extent already.

Yes, and still easier would be to have mostly Clojure-ish tooling for Clojrue devs I would think :)

When working with Clojure-ish hosted languages, I think it's not uncommon for devs to try to avoid doing things in the underlying language for the most part, but using what's beneath when necessary (go interop). I'm thinking of the grammar.edn exploration as a bit along those lines.

Incidentally, IIUC, today marks the 3rd year since the first meaningful commit of this project (yay!).

Looking back, I think the reality of the situation for a tree-sitter grammar is that there are bits that change / don't work which are outside of the grammar writer's / maintainer's control [1]. Worse, some of those bits change unexpectedly or in hard to predict ways and broken things don't necessarily get fixed for a long time.

I think sometimes it's worth creating or adopting code to take over some of those bits to reduce churn / breakage (current and potential). If there is such code that we can create / control / maintain, perhaps we'd prefer using certain sorts of languages / tooling.


[1] My sense is that a typical path for a grammar repository is to end up using:

  • npm (not really necessary IMO unless Node.js bindings are needed),
  • node (only needed for creating grammar.json),
  • Emscripten (.wasm and playground), and
  • the tree-sitter cli (critically needed for the generate subcommand while other subcommands can be duplicated via other means or are not really used)
  • C compiler and other typical dev bits

Except for the last set of items, these all have had various problems (at least from my perspective). The first 3 I think most of us have little chance of influencing. The tree-sitter cli we have some chance, though even if we do, when such changes make it in to a released version has been fairly unpredictable. (If there is an unreleased fix for the tree-sitter cli and we want to use it, this likely means one uses Rust tooling.)

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 16, 2023

Ok, I managed to make something [1] that takes a grammar.edn and produces a grammar.json.

The grammar.json file was slightly different from that generated via tree-sitter generate with grammar.js (apart from not really having much indenting) mainly because I didn't include some extra top-level keys if there was nothing specified for them (e.g. precedences, externals, and supertypes). May be I should match the behavior. (Should be more or less matched now.)

The generated parser.c is the same.

The main thing I didn't account for initially was that order within rules matters (see point 7 here). I blindly "translated" the JavaScript DSL's objects to maps when it might have been better to go for vectors. I've since gone back and fixed this.

I didn't write it in Clojure, but I think it might not be too bad.


[1] The overall flow looks like this.

@sogaiu
Copy link
Owner Author

sogaiu commented Feb 24, 2023

I'm considering an approach where it's possible to add developer-related items (e.g. using babashka tasks) to the repository and leave (some of?) the existing bits in place perhaps after some "cleaning". The idea is mostly leaning in the direction of "co-existence" of methods rather than outright immediate supplantation.

This will allow experimentation, which, if it works out, may lead to retirement of some of the remaining older bits.

I've been working on and trying something similar in tree-sitter-janet-simple. There, I made the tooling in Janet and it's been working pretty nicely so far [1].

I think a similar thing may be doable with babashka and its task feature. I thought that the mentioned talk was pretty good at motivating why one might want to consider the tasks feature (starting around 2:21).

I think if this approach works out, it might help with folks with a Clojure-leaning getting involved -- which I believe might help for the upkeep / maintenance of the project.

Also, I think it's more fun and I would guess others might find that to be the case too.


[1] Some things I've implemented include:

  • alternative testing method (described here)
  • grammar.jdn used to generate grammar.js / grammar.json
  • tree-sitter cli building so unreleased versions (with fixes / new features) can be used

@sogaiu
Copy link
Owner Author

sogaiu commented May 8, 2023

Perhaps we've settled on a decent compromise from the perspective of readability. At least I think it's much easier to understand than what I've seen elsewhere :)

I'm going to close this for the moment. It can be reopened later if necessary.

@sogaiu sogaiu closed this as completed May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants