diff --git a/spec.md b/spec.md index e5776be..275d136 100644 --- a/spec.md +++ b/spec.md @@ -47,8 +47,6 @@ interact with the environment. ## Contents - - * [Overview](#overview) * [Contents](#contents) * [Lexical elements](#lexical-elements) @@ -58,6 +56,7 @@ interact with the environment. * [Integers](#integers) * [Floating-point numbers](#floating-point-numbers) * [Strings](#strings) + * [Bytes](#bytes) * [Lists](#lists) * [Tuples](#tuples) * [Dictionaries](#dictionaries) @@ -104,7 +103,7 @@ interact with the environment. * [any](#any) * [all](#all) * [bool](#bool) - * [chr](#chr) + * [dict](#dict) * [dir](#dir) * [enumerate](#enumerate) @@ -118,7 +117,7 @@ interact with the environment. * [list](#list) * [max](#max) * [min](#min) - * [ord](#ord) + * [print](#print) * [range](#range) * [repr](#repr) @@ -216,7 +215,7 @@ returns (U+000D), and newlines (U+000A). Within a line, white space has no effect other than to delimit the previous token, but newlines, and spaces at the start of a line, are significant tokens. -*Comments*: A hash character (`#`) appearing outside of a string +*Comments*: A hash character (`#`) appearing outside of a string or bytes literal marks the start of a comment; the comment extends to the end of the line, not including the newline character. Comments are treated like other white space. @@ -273,7 +272,7 @@ x index starts_with arg0 ``` *Literals*: literals are tokens that denote specific values. Starlark -has integer, floating-point, and string literals. +has integer, floating-point, string, and bytes literals. ```text 0 # int @@ -288,6 +287,10 @@ has integer, floating-point, and string literals. "hello" 'hello' # string '''hello''' """hello""" # triple-quoted string r'hello' r"hello" # raw string literal + +b"hello" b'hello' # bytes +b'''hello''' b"""hello""" # triple-quoted bytes +rb'hello' br"hello" # raw bytes literal ``` Integer and floating-point literal tokens are defined by the following grammar: @@ -365,15 +368,68 @@ allowing a long string to be split across multiple lines of the source file. def" # "abcdef" ``` -An *octal escape* encodes a single byte using its octal value. +An *octal escape* encodes a single string element using its octal value. It consists of a backslash followed by one, two, or three octal digits [0-7]. -It is error if the value is greater than decimal 255. +Simiarly, a *hexadecimal escape* encodes a single string element using its hexadecimal value. +It consists of `\x` followed by two hexadecimal digits [0-9a-fA-F]. +It is an error if the value of an octal or hexadecimal escape is greater than decimal 127. ```python -'\0' # "\x00" a string containing a single NUL byte +'\0' # "\x00" a string containing a single NUL element '\12' # "\n" octal 12 = decimal 10 '\101-\132' # "A-Z" '\119' # "\t9" = "\11" + "9" + +'\x00' # "\x00" a string containing a single NUL element +'\0A' # "\n" hexadecimal A = decimal 10 +"\x41-\x5A" # "A-Z" +``` + +The bit width of each string element is not defined by this spec, +and for reasons of efficiency and interoperability with the host language, +it varies across implementations. +For example, in the Go and Rust implementations, +each string element is an 8-bit value (a byte) and Unicode text is encoded as UTF-8, +whereas in the Java implementation, +string elements are 16-bit values (Java `char`s) and Unicode text is encoded as UTF-16. +This spec generalizes both approaches by referring to the string encoding as "UTF-K", +where the value of K is implementation-defined. + +A *Unicode escape* denotes the UTF-K encoding of a single Unicode code point. +The `\uXXXX` form, with exactly four hexadecimal digits, +denotes a 16-bit code point, and the `\UXXXXXXXX`, +with exactly eight digits, denotes a 32-bit code point. +It is an error if the value lies in the surrogate range (U+D800 to U+DFFF). + +```python +'\u0041' # "A", an ASCII letter (U+0041) +'\u0414' # "Д", a Cyrillic capital letter (U+0414) +'\u754c # "界", a Chinese character (U+754C) +'\U0001F600' # "😀", an Emoji (U+1F600) +``` + +The length of the encoding of a single Unicode code point may vary +based on the implementation's string encoding ("K"). + +```python +len("A") # 1 +len("Д") # 2 (UTF-8) or 1 (UTF-16) +len("界") # 3 (UTF-8) or 1 (UTF-16) +len("😀") # 4 (UTF-8) or 2 (UTF-16) +``` + +An *element escape* denotes a single element of a text string. +It consists of `\X`, followed by two hexadecimal digits if K=8, or four if K=16. +Element escapes are needed only to express strings that do not +contain a valid encoding of text, such as the result of a substring +operation that truncates an incomplete UTF-K encoding. +Element escapes are inherently non-portable, +and their use in source code is discouraged. + +```python +"\X41" # 'A', same as '\x41' +"\XFF" # '\XFF', not a valid text string +"😀"[:1] # "\Xf0" (UTF-8) or "\Xd83d" (UTF-16) ``` An ordinary string literal may not contain an unescaped newline, @@ -419,6 +475,41 @@ b" # "a\\\nb" It is an error for a backslash to appear within a string literal other than as part of one of the escapes described above. + +### Bytes literals + +A Starlark bytes literal denotes a bytes value, +and looks like a string literal, in any of its various forms +(single-quoted, double-quoted, triple-quoted, raw) +preceded by the letter `b`. + +```python +b"abc" b'abc' +b"""abc""" b'''abc''' +br"abc" br'abc' +rb"abc" rb'abc' +``` + +A raw bytes literal may be indicated by either a `br` or `rb` prefix. + +Non-escaped text within a bytes literal denotes the UTF-8 encoding of that text. +Bytes literals support the same escape sequences as text strings, +with the following differences: + +- Octal and hexadecimal escapes may specify any byte value from + zero (`\000` or `\x00`) to 255 (`\377` or `\xFF`). + +- Element escapes `\X` are not permitted. + +- A Unicode escape `\uXXXX` or `\UXXXXXXXX` denotes the byte + sequence of the UTF-8 encoding of the specified 16- or 32-bit code point. + (As with text strings, the code point value must not lie in the surrogate range.) + +Any valid string literal that, with a `b` prefix, is also a +valid bytes literal is equivalent in the sense that +the bytes value is the UTF-8 encoding of the string value. + + TODO: define indent, outdent, semicolon, newline, eof ## Data types @@ -430,7 +521,8 @@ NoneType # the type of None bool # True or False int # a signed integer of arbitrary magnitude float # an IEEE 754 double-precision floating-point number -string # a byte string +string # a text string, with Unicode encoded as UTF-8 or UTF-16 +bytes # a byte string list # a fixed-length sequence of values tuple # a fixed-length sequence of values, unmodifiable dict # a mapping from values to values @@ -451,7 +543,7 @@ every value has a type string that can be obtained with the expression `type(x)`, and any value may be converted to a string using the expression `str(x)`, or to a Boolean truth value using the expression `bool(x)`. Other operations apply only to certain types. For -example, the indexing operation `a[i]` works only with strings, lists, +example, the indexing operation `a[i]` works only with strings, bytes values, lists, and tuples, and any application-defined types that are _indexable_. The [_value concepts_](#value-concepts) section explains the groupings of types by the operators they support. @@ -639,28 +731,36 @@ float(3) / 2 # 1.5 ### Strings -A string represents an immutable sequence of bytes. +A string is an immutable sequence of elements that encode Unicode text. The [type](#type) of a string is `"string"`. -Strings can represent arbitrary binary data, including zero bytes, but -most strings contain text, encoded by convention using UTF-8. +Depending on the implementation, the elements may be 8- or 16-bit values, +and may hold arbitrary values of the element type; +we call the implementation's element width K. +By convention, however, strings are typically used to hold valid encodings +of Unicode code points, encoded using UTF-K. -The built-in `len` function returns the number of bytes in a string. +The built-in `len` function returns the number of elements in a string. Strings may be concatenated with the `+` operator. The substring expression `s[i:j]` returns the substring of `s` from -index `i` up to index `j`. The index expression `s[i]` returns the -1-byte substring `s[i:i+1]`. +index `i` up to index `j`. + +The index expression `s[i]` returns the +1-element substring `s[i:i+1]`. Strings are hashable, and thus may be used as keys in a dictionary. Strings are totally ordered lexicographically, so strings may be compared using operators such as `==` and `<`. +(Beware that the UTF-16 string encoding is not order-preserving.) Strings are _not_ iterable sequences, so they cannot be used as the operand of a `for`-loop, list comprehension, or any other operation than requires -an iterable sequence. +an iterable sequence. One must expliitly call a method of a string value +to obtain an iterable view. Any value may formatted as a string using the `str` or `repr` built-in functions, the `str % tuple` operator, or the `str.format` method. @@ -702,6 +802,41 @@ Strings have several built-in methods: * [`upper`](#string·upper) +### Bytes + +A _bytes_ is an immutable sequence of values in the range 0-255. +The [type](#type) of a string is `"string"`. + +Unlike a string, a bytes may represent binary data, +such as the contents of an arbitrary file, without loss. + +The built-in `len` function returns the number of elements (bytes) in a `bytes`. + +Two bytes values may be concatenated with the `+` operator. + +The slice expression `b[i:j]` returns the subsequence of `b` from +index `i` up to index `j`. The index expression `b[i]` returns the +int value of the ith element. + +Like strings, bytes are hashable, totally ordered, and not iterable, +and are considered True if they are non-empty. + +``` +TODO(https://github.com/bazelbuild/starlark/issues/112) +- methods. Likely the same as string (minus those concerned with text): + elems - iterator over ints + join + {start,end}with + {r,}{find,index,partition,split,strip} + replace +- specify ord, chr? +- hash(bytes) +- support 'bytes in bytes'? +- bytes(...) function +- encode, decode methods? +- can we reduce string iterator methods without loss of generality/efficiency? +``` + ### Lists A list is a mutable sequence of values. @@ -824,7 +959,7 @@ The [type](#type) of a dictionary is `"dict"`. Dictionaries provide constant-time operations to insert an element, to look up the value for a key, or to remove an element. Dictionaries are implemented using hash tables, so keys must be hashable. Hashable -values include `None`, Booleans, numbers, and strings, and tuples +values include `None`, Booleans, numbers, strings, and bytes, and tuples composed from hashable values. Most mutable values, such as lists, and dictionaries, are not hashable, unless they are frozen. Attempting to use a non-hashable value as a key in a dictionary @@ -867,8 +1002,7 @@ len(coins) # 5, existing item was updated A dictionary can also be constructed using a [dictionary comprehension](#comprehension), which evaluates a pair of expressions, the _key_ and the _value_, for every element of another iterable such -as a list. This example builds a mapping from each word to its length -in bytes: +as a list. This example builds a mapping from each word to its length: ```python words = ["able", "baker", "charlie"] @@ -1358,7 +1492,7 @@ variable, and calls to some built-in functions such as `print` change the state of the application that embeds the interpreter. Values of some data types, such as `NoneType`, `bool`, `int`, `float`, -and `string`, are _immutable_; they can never change. +`string` and `bytes`, are _immutable_; they can never change. Immutable values have no notion of _identity_: it is impossible for a Starlark program to tell whether two integers, for instance, are represented by the same object; it can tell only whether they are @@ -1431,7 +1565,7 @@ The hash of a value is an unspecified integer chosen so that two equal values have the same hash, in other words, `x == y => hash(x) == hash(y)`. A hashable value has the same hash throughout its lifetime. -Values of the types `NoneType`, `bool`, `int`, `float`, and `string`, +Values of the types `NoneType`, `bool`, `int`, `float`, `string`, and `bytes`, which are all immutable, are hashable. Values of mutable types such as `list` and `dict` are not @@ -1457,13 +1591,13 @@ We can classify different kinds of sequence types based on the operations they support. * `Iterable`: an _iterable_ value lets us process each of its elements in a fixed order. - Examples: `dict`, `list`, `tuple`, but not `string`. + Examples: `dict`, `list`, `tuple`, but not `string` or `bytes`. * `Sequence`: a _sequence of known length_ lets us know how many elements it contains without processing them. - Examples: `dict`, `list`, `tuple`, but not `string`. + Examples: `dict`, `list`, `tuple`, but not `string` or `bytes`. * `Indexable`: an _indexed_ type has a fixed length and provides efficient random access to its elements, which are identified by integer indices. - Examples: `string`, `tuple`, and `list`. + Examples: `string`, `bytes`, `tuple`, and `list`. * `SetIndexable`: a _settable indexed type_ additionally allows us to modify the element at a given integer index. Example: `list`. * `Mapping`: a mapping is an association of keys to values. Example: `dict`. @@ -1473,11 +1607,11 @@ least the `Sequence` contract, it's possible for an an application that embeds the Starlark interpreter to define additional data types representing sequences of unknown length that implement only the `Iterable` contract. -Strings are not iterable, though they do support the `len(s)` and +Strings and bytes values are not iterable, though they do support the `len(s)` and `s[i]` operations. Starlark deviates from Python here to avoid a common pitfall in which a string is used by mistake where a list containing a single string was intended, resulting in its interpretation as a sequence -of bytes. +of letters. Most Starlark operators and built-in functions that need a sequence of values will accept any iterable. @@ -1590,7 +1724,7 @@ PrimaryExpr = Operand . Operand = identifier - | int | float | string + | int | float | string | bytes | ListExpr | ListComp | DictExpr | DictComp | '(' [Expression] [,] ')' @@ -1613,14 +1747,14 @@ Lookup of locals and globals may fail if not yet defined. ### Literals -Starlark supports string literals of three different kinds: +Starlark supports string literals of four different kinds: ```text -Operand = int | float | string +Operand = int | float | string | bytes ``` Evaluation of a literal yields a value of the given type -(int, float, or string) with the given value. +(int, float, string, or bytes) with the given value. See [Literals](#lexical elements) for details. ### Parenthesized expressions @@ -1864,6 +1998,7 @@ bool # False < True int # mathematical float # as defined by IEEE 754, except NaN > +Inf string # lexicographical +bytes # lexicographical tuple # lexicographical list # lexicographical ``` @@ -1915,10 +2050,11 @@ Bitwise operations: Concatenation string + string + bytes + bytes list + list tuple + tuple -Repetition (string/list/tuple) +Repetition (bytes/string/list/tuple) int * sequence sequence * int @@ -1953,7 +2089,7 @@ It is a dynamic error if the second operand is negative. ``` The `+` operator may be applied to non-numeric operands of the same -type, such as two lists, two tuples, or two strings, in which case it +type, such as two lists, two tuples, two strings, or two bytes, in which case it computes the concatenation of the two operands and yields a new value of the same type. @@ -1964,7 +2100,7 @@ the same type. ``` The `*` operator may be applied to an integer _n_ and a value of type -`string`, `list`, or `tuple`, in which case it yields a new value +`string`, `bytes`, `list`, or `tuple`, in which case it yields a new value of the same sequence type consisting of _n_ repetitions of the original sequence. The order of the operands is immaterial. Negative values of _n_ behave like zero. @@ -2250,7 +2386,7 @@ f("n") # 2 ### Index expressions An index expression `a[i]` yields the `i`th element of an _indexable_ -type such as a string, tuple, or list. The index `i` must be an `int` +type such as a string, bytes, tuple, or list. The index `i` must be an `int` value in the range -`n` ≤ `i` < `n`, where `n` is `len(a)`; any other index results in an error. @@ -2292,7 +2428,7 @@ type, such as a tuple or string, or a frozen value of a mutable type. ### Slice expressions A slice expression `a[start:stop:stride]` yields a new value containing a -subsequence of `a`, which must be a string, tuple, or list. +subsequence of `a`, which must be a string, bytes, tuple, or list. ```text SliceSuffix = '[' [Expression] [':' Test [':' Test]] ']' . @@ -2987,6 +3123,8 @@ s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] `hash(x)` returns an integer hash value for a string x such that `x == y` implies `hash(x) == hash(y)`. + + ### int `int(x[, base])` interprets its argument as an integer. @@ -3177,6 +3315,7 @@ str(1) # '1' str("x") # 'x' str([1, "x"]) # '[1, "x"]' str(0.0) # '0.0' (formatted as if by "%g") +str(b"abc") # 'b"abc"' ``` ### tuple @@ -3488,12 +3627,15 @@ They are interpreted according to Starlark's [indexing conventions](#indexing). ### string·elems `S.elems()` returns an iterable value containing successive -1-byte substrings of S. +1-element substrings of S. ```python 'Hello, 123'.elems() # ["H", "e", "l", "l", "o", ",", " ", "1", "2", "3"] ``` + + + ### string·endswith @@ -3986,11 +4128,11 @@ Tokens: - spaces: newline, eof, indent, outdent. - identifier. -- literals: string, int, float. +- literals: string, bytes, int, float. - plus all quoted tokens such as '+=', 'return'. Notes: - Ambiguity is resolved using operator precedence. - The grammar does not enforce the legal order of params and args, - nor that the first compclause must be a 'for'. + nor that the first CompClause must be a 'for'.