Syntax resources

the CCW ANTLR grammar
the Clojure implementation

Structural parsing

Tokens

Definitions:

macro character: one of ";@^`~()[]{}\'%#
terminating macro character: one of ";@^`~()[]{}\

Comment

open: /(;|#!)/
value: /[^\n\r\f]*/

Whitespace

value: /[ \t,\n\r\f]+/

Also, everything else that Java's Character.isWhitespace considers to be whitespace. See http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(int).

Number

basically, if it starts with a digit, or the combination of +/- followed by a digit, it's a number.

See http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isDigit(int) for what is considered to be a digit.

sign: /[-+]?/
first: /\d/
rest: (not1 ( whitespace | macro ) )(*)

Ident

first: (not1 ( whitespace | macro ) ) | '%'
rest: (not1 ( whitespace | terminatingMacro ))(*)

Why does this include %...?
Because: outside of a #() function, %... is just a normal ident.

Character

open: \\
first: .
rest: (not1 ( whitespace | terminatingMacro ) )(*)

String

open: "
body: /([^\\"]|\\.)*/ -- . includes newlines
close: "

This is only approximately correct. how could it go wrong?

Regex

open: #"
body: /([^\\"]|\\.)*/ -- . includes newlines
close: "

Punctuation

(
)
[
]
{
}
@
^
'
`
~@
~
#-dispatches
- #(
- #{
- #^
- #'
- #=
- #_
- #< -- ??? unreadable reader ???
- error: # followed by anything else (except for #! and #")

Hierarchical forms

Whitespace, comments and discard forms (#_) can appear in any amount between tokens.

Discard

open: #_
value: Form

List

open: (
body: Form(*)
close: )

Vector

open: [
body: Form(*)
close: ]

Table

open: {
body: Form(*)
close: }

Quote

open: '
value: Form

Deref

open: @
value: Form

Unquote

open: ~
value: Form

Unquote splicing

open: ~@
value: Form

Syntax quote

open: `
value: Form

Function

open: #(
body: Form(*)
close: )

Set

open: #{
body: Form(*)
close: }

Eval

open: #=
value: Form

Var

open: #'
value: Form

Unreadable

open: #<
value: ??????????

Other dispatch

open: /#./
value: ???????????

Form

 String  |  Number  |  Char  |  Ident   |  Regex     |
 List    |  Vector  |  Set   |  Table   |  Function  |
 Deref   |  Quote   |  Unquote  |  UnquoteSplicing   |
 SyntaxQuote  |  Meta  |  Eval  |  Var

Order in which they're tried does seem to be important for some cases, since a given input might match multiple patterns:

Number before Ident

Clojure

Form(*)

Token parsers

Goal of this phase: determine the internal structure of the number, ident, char, string, and regex tokens

String

Syntax

escape
- open: \
- error: next char matches /[^btnfr\\"0-7u]/
- value
  - simple
    - /[btnfr\\"]/
  - octal
    - /[0-7]{1,3}/
    - stops when: 3 octal characters parsed, or whitespace hit, or macro character hit
    - error: digit is 8 or 9
    - error: hasn't finished, but encounters character which is not whitespace, octal, or macro
  - unicode
    - /u[0-9a-zA-Z]{4}/
    - error: less than four hex characters found
/[^\\"]/: plain character (not escaped)
- what about ?? unprintable chars (actual newline, etc.) ??

Notes

macro and whitespace characters have special meaning inside strings: they terminate octal and unicode escape sequences

octal and unicode escapes use Java's Character.digit and Character.isDigit, so they seem to work on other forms of digits, such as u+ff13

   "\uＡＢＣＤ" is the 1 character string "ꯍ"
   // b/c each of ＡＢＣＤ is a digit according to Character.digit(ch, 16)

Regex

Syntax

real escape: /\\[\\"]/
fake escape: /\\[^\\"]/ so-called because both characters get included in output

Notes

Number

Syntax

ratio
- sign: /[-+]?/
- numerator: /[0-9]+/
- slash: /
- denominator: /[0-9]+/
float
- sign: /[-+]?/
- int: /[0-9]+/
- decimal (optional)
  - dot: .
  - int: /[0-9]*/
- exponent (optional)
  - e: /[eE]/
  - sign: /[+-]?/
  - power: /[0-9]+/
- suffix
  - /M?/
integer
- sign: /[+-]?/
- body
  - base16
    - `/0[xX]hex+/
    - where hex is /[0-9a-zA-Z]/
  - base8 (not sure about this)
    - /0[0-7]+/
    - error: 08
  - base(2-36)
    - /[1-9][0-9]?[rR][0-9a-zA-Z]+/
  - base10
    - /[1-9][0-9]*/
- bigint suffix: /N?/

Notes

apparently, can't apply bigint suffix to base(2-36)

Char

open: \
value
- long escape
  - newline
  - space
  - tab
  - backspace
  - formfeed
  - return
- unicode escape -- not identical to string's unicode escape
  - XXXX where X is a hex character
  - hex characters defined by Java's Character.digit(<some_int>, 16)
    - includes some surprises!
- octal escape
  - oX, oXX, or oXXX where X is an octal character
  - octal characters defined by Java's Character.digit(<som_int>, 8)
    - includes surprises!
- simple character (not escaped)
  - any character, including n, u, \, an actual tab, space, newline
  - what about unprintable characters?

Ident

Syntax

special errors
- :: anywhere but at the beginning
- if it matches /([:]?)([^\d/].*/)?(/|[^\d/][^/]*)/, and:
  - $2 =~ /:\/$/ -> error
  - $3 =~ /:$/ -> error
value
- reserved
  - nil
  - true
  - false
- not reserved
  - type: starts with:
    - :: -- auto keyword
    - : -- keyword
    - else -- symbol
  - namespace (optional)
    - /[^/]+/
    - /
  - name
    - /.+/
code used to verify against implementation:

(fn [my-string] (let [f (juxt type namespace name)] (try (f (eval (read-string my-string))) (catch RuntimeException e (.getMessage e)))))

Files

syntax.md

Latest commit

History

syntax.md

File metadata and controls

Syntax resources

Structural parsing

Tokens

Comment

Whitespace

Number

Ident

Character

String

Regex

Punctuation

Hierarchical forms

Discard

List

Vector

Table

Quote

Deref

Unquote

Unquote splicing

Syntax quote

Function

Set

Meta

Eval

Var

Unreadable

Other dispatch

Form

Clojure

Token parsers

String

Regex

Number

Char

Ident