Definitions:
-
macro character: one of
";@^`~()[]{}\'%#
-
terminating macro character: one of
";@^`~()[]{}\
- open:
/(;|#!)/
- value:
/[^\n\r\f]*/
- value:
/[ \t,\n\r\f]+/
Also, everything else that Java's Character.isWhitespace
considers to be whitespace.
See http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(int).
basically, if it starts with a digit, or the combination of +/- followed by a digit, it's a number.
See http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isDigit(int) for what is considered to be a digit.
- sign:
/[-+]?/
- first:
/\d/
- rest:
(not1 ( whitespace | macro ) )(*)
- first:
(not1 ( whitespace | macro ) ) | '%'
- rest:
(not1 ( whitespace | terminatingMacro ))(*)
Why does this include %...
?
Because: outside of a #()
function, %...
is just a normal ident.
- open:
\\
- first:
.
- rest:
(not1 ( whitespace | terminatingMacro ) )(*)
- open:
"
- body:
/([^\\"]|\\.)*/
--.
includes newlines - close:
"
This is only approximately correct. how could it go wrong?
- open:
#"
- body:
/([^\\"]|\\.)*/
--.
includes newlines - close:
"
(
)
[
]
{
}
@
^
'
`
~@
~
- #-dispatches
#(
#{
#^
#'
#=
#_
#<
-- ??? unreadable reader ???- error:
#
followed by anything else (except for#!
and#"
)
Whitespace, comments and discard forms (#_
) can appear in any amount
between tokens.
- open:
#_
- value:
Form
- open:
(
- body:
Form(*)
- close:
)
- open:
[
- body:
Form(*)
- close:
]
- open:
{
- body:
Form(*)
- close:
}
- open:
'
- value:
Form
- open:
@
- value:
Form
- open:
~
- value:
Form
- open:
~@
- value:
Form
- open:
`
- value:
Form
- open:
#(
- body:
Form(*)
- close:
)
- open:
#{
- body:
Form(*)
- close:
}
- open:
'^' | '#^'
- metadata:
Form
- value:
Form
- open:
#=
- value:
Form
- open:
#'
- value:
Form
- open:
#<
- value: ??????????
- open:
/#./
- value: ???????????
String | Number | Char | Ident | Regex |
List | Vector | Set | Table | Function |
Deref | Quote | Unquote | UnquoteSplicing |
SyntaxQuote | Meta | Eval | Var
Order in which they're tried does seem to be important for some cases, since a given input might match multiple patterns:
- Number before Ident
Form(*)
Goal of this phase: determine the internal structure of the number, ident, char, string, and regex tokens
Syntax
-
escape
-
open:
\
-
error: next char matches
/[^btnfr\\"0-7u]/
-
value
-
simple
/[btnfr\\"]/
-
octal
/[0-7]{1,3}/
- stops when: 3 octal characters parsed, or whitespace hit, or macro character hit
- error: digit is 8 or 9
- error: hasn't finished, but encounters character which is not whitespace, octal, or macro
-
unicode
/u[0-9a-zA-Z]{4}/
- error: less than four hex characters found
-
-
-
/[^\\"]/
: plain character (not escaped)- what about ?? unprintable chars (actual newline, etc.) ??
Notes
-
macro and whitespace characters have special meaning inside strings: they terminate octal and unicode escape sequences
-
octal and unicode escapes use Java's
Character.digit
andCharacter.isDigit
, so they seem to work on other forms of digits, such as u+ff13"\uABCD" is the 1 character string "ꯍ" // b/c each of ABCD is a digit according to Character.digit(ch, 16)
Syntax
- real escape:
/\\[\\"]/
- fake escape:
/\\[^\\"]/
so-called because both characters get included in output
Notes
Syntax
-
ratio
- sign:
/[-+]?/
- numerator:
/[0-9]+/
- slash:
/
- denominator:
/[0-9]+/
- sign:
-
float
- sign:
/[-+]?/
- int:
/[0-9]+/
- decimal (optional)
- dot:
.
- int:
/[0-9]*/
- dot:
- exponent (optional)
- e:
/[eE]/
- sign:
/[+-]?/
- power:
/[0-9]+/
- e:
- suffix
/M?/
- sign:
-
integer
- sign:
/[+-]?/
- body
- base16
- `/0[xX]hex+/
- where
hex
is/[0-9a-zA-Z]/
- base8 (not sure about this)
/0[0-7]+/
- error:
08
- base(2-36)
/[1-9][0-9]?[rR][0-9a-zA-Z]+/
- base10
/[1-9][0-9]*/
- base16
- bigint suffix:
/N?/
- sign:
Notes
- apparently, can't apply bigint suffix to base(2-36)
-
open:
\
-
value
-
long escape
newline
space
tab
backspace
formfeed
return
-
unicode escape -- not identical to string's unicode escape
XXXX
where X is a hex character- hex characters defined by Java's
Character.digit(<some_int>, 16)
- includes some surprises!
-
octal escape
oX
,oXX
, oroXXX
where X is an octal character- octal characters defined by Java's
Character.digit(<som_int>, 8)
- includes surprises!
-
simple character (not escaped)
- any character, including
n
,u
,\
, an actual tab, space, newline - what about unprintable characters?
- any character, including
-
Syntax
-
special errors
::
anywhere but at the beginning- if it matches
/([:]?)([^\d/].*/)?(/|[^\d/][^/]*)/
, and:$2 =~ /:\/$/
-> error$3 =~ /:$/
-> error
-
value
-
reserved
nil
true
false
-
not reserved
-
type: starts with:
::
-- auto keyword:
-- keyword- else -- symbol
-
namespace (optional)
/[^/]+/
/
-
name
/.+/
-
-
-
code used to verify against implementation:
(fn [my-string] (let [f (juxt type namespace name)] (try (f (eval (read-string my-string))) (catch RuntimeException e (.getMessage e)))))