Skip to content

Latest commit

 

History

History
373 lines (252 loc) · 7.17 KB

syntax.md

File metadata and controls

373 lines (252 loc) · 7.17 KB

Syntax resources

Structural parsing

Tokens

Definitions:

  • macro character: one of ";@^`~()[]{}\'%#

  • terminating macro character: one of ";@^`~()[]{}\

Comment

  • open: /(;|#!)/
  • value: /[^\n\r\f]*/

Whitespace

  • value: /[ \t,\n\r\f]+/

Also, everything else that Java's Character.isWhitespace considers to be whitespace. See http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(int).

Number

basically, if it starts with a digit, or the combination of +/- followed by a digit, it's a number.

See http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isDigit(int) for what is considered to be a digit.

  • sign: /[-+]?/
  • first: /\d/
  • rest: (not1 ( whitespace | macro ) )(*)

Ident

  • first: (not1 ( whitespace | macro ) ) | '%'
  • rest: (not1 ( whitespace | terminatingMacro ))(*)

Why does this include %...?
Because: outside of a #() function, %... is just a normal ident.

Character

  • open: \\
  • first: .
  • rest: (not1 ( whitespace | terminatingMacro ) )(*)

String

  • open: "
  • body: /([^\\"]|\\.)*/ -- . includes newlines
  • close: "

This is only approximately correct. how could it go wrong?

Regex

  • open: #"
  • body: /([^\\"]|\\.)*/ -- . includes newlines
  • close: "

Punctuation

  • (
  • )
  • [
  • ]
  • {
  • }
  • @
  • ^
  • '
  • `
  • ~@
  • ~
  • #-dispatches
    • #(
    • #{
    • #^
    • #'
    • #=
    • #_
    • #< -- ??? unreadable reader ???
    • error: # followed by anything else (except for #! and #")

Hierarchical forms

Whitespace, comments and discard forms (#_) can appear in any amount between tokens.

Discard

  • open: #_
  • value: Form

List

  • open: (
  • body: Form(*)
  • close: )

Vector

  • open: [
  • body: Form(*)
  • close: ]

Table

  • open: {
  • body: Form(*)
  • close: }

Quote

  • open: '
  • value: Form

Deref

  • open: @
  • value: Form

Unquote

  • open: ~
  • value: Form

Unquote splicing

  • open: ~@
  • value: Form

Syntax quote

  • open: `
  • value: Form

Function

  • open: #(
  • body: Form(*)
  • close: )

Set

  • open: #{
  • body: Form(*)
  • close: }

Meta

  • open: '^' | '#^'
  • metadata: Form
  • value: Form

Eval

  • open: #=
  • value: Form

Var

  • open: #'
  • value: Form

Unreadable

  • open: #<
  • value: ??????????

Other dispatch

  • open: /#./
  • value: ???????????

Form

 String  |  Number  |  Char  |  Ident   |  Regex     |
 List    |  Vector  |  Set   |  Table   |  Function  |
 Deref   |  Quote   |  Unquote  |  UnquoteSplicing   |
 SyntaxQuote  |  Meta  |  Eval  |  Var

Order in which they're tried does seem to be important for some cases, since a given input might match multiple patterns:

  • Number before Ident

Clojure

Form(*)

Token parsers

Goal of this phase: determine the internal structure of the number, ident, char, string, and regex tokens

String

Syntax

  • escape

    • open: \

    • error: next char matches /[^btnfr\\"0-7u]/

    • value

      • simple

        • /[btnfr\\"]/
      • octal

        • /[0-7]{1,3}/
        • stops when: 3 octal characters parsed, or whitespace hit, or macro character hit
        • error: digit is 8 or 9
        • error: hasn't finished, but encounters character which is not whitespace, octal, or macro
      • unicode

        • /u[0-9a-zA-Z]{4}/
        • error: less than four hex characters found
  • /[^\\"]/: plain character (not escaped)

    • what about ?? unprintable chars (actual newline, etc.) ??

Notes

  • macro and whitespace characters have special meaning inside strings: they terminate octal and unicode escape sequences

  • octal and unicode escapes use Java's Character.digit and Character.isDigit, so they seem to work on other forms of digits, such as u+ff13

       "\uABCD" is the 1 character string "ꯍ"
       // b/c each of ABCD is a digit according to Character.digit(ch, 16)
    

Regex

Syntax

  • real escape: /\\[\\"]/
  • fake escape: /\\[^\\"]/ so-called because both characters get included in output

Notes

Number

Syntax

  • ratio

    • sign: /[-+]?/
    • numerator: /[0-9]+/
    • slash: /
    • denominator: /[0-9]+/
  • float

    • sign: /[-+]?/
    • int: /[0-9]+/
    • decimal (optional)
      • dot: .
      • int: /[0-9]*/
    • exponent (optional)
      • e: /[eE]/
      • sign: /[+-]?/
      • power: /[0-9]+/
    • suffix
      • /M?/
  • integer

    • sign: /[+-]?/
    • body
      • base16
        • `/0[xX]hex+/
        • where hex is /[0-9a-zA-Z]/
      • base8 (not sure about this)
        • /0[0-7]+/
        • error: 08
      • base(2-36)
        • /[1-9][0-9]?[rR][0-9a-zA-Z]+/
      • base10
        • /[1-9][0-9]*/
    • bigint suffix: /N?/

Notes

  • apparently, can't apply bigint suffix to base(2-36)

Char

  • open: \

  • value

    • long escape

      • newline
      • space
      • tab
      • backspace
      • formfeed
      • return
    • unicode escape -- not identical to string's unicode escape

      • XXXX where X is a hex character
      • hex characters defined by Java's Character.digit(<some_int>, 16)
        • includes some surprises!
    • octal escape

      • oX, oXX, or oXXX where X is an octal character
      • octal characters defined by Java's Character.digit(<som_int>, 8)
        • includes surprises!
    • simple character (not escaped)

      • any character, including n, u, \, an actual tab, space, newline
      • what about unprintable characters?

Ident

Syntax

  • special errors

    • :: anywhere but at the beginning
    • if it matches /([:]?)([^\d/].*/)?(/|[^\d/][^/]*)/, and:
      • $2 =~ /:\/$/ -> error
      • $3 =~ /:$/ -> error
  • value

    • reserved

      • nil
      • true
      • false
    • not reserved

      • type: starts with:

        • :: -- auto keyword
        • : -- keyword
        • else -- symbol
      • namespace (optional)

        • /[^/]+/
        • /
      • name

        • /.+/
  • code used to verify against implementation:

    (fn [my-string] (let [f (juxt type namespace name)] (try (f (eval (read-string my-string))) (catch RuntimeException e (.getMessage e)))))