diff --git a/spec.md b/spec.md index 4cd3eba..cd8d291 100644 --- a/spec.md +++ b/spec.md @@ -103,6 +103,7 @@ interact with the environment. * [any](#any) * [all](#all) * [bool](#bool) + * [bytes](#bytes) * [dict](#dict) * [dir](#dir) @@ -129,6 +130,7 @@ interact with the environment. * [type](#type) * [zip](#zip) * [Built-in methods](#built-in-methods) + * [bytes·elems](#bytes·elems) * [dict·clear](#dict·clear) * [dict·get](#dict·get) * [dict·items](#dict·items) @@ -147,10 +149,10 @@ interact with the environment. * [list·remove](#list·remove) * [set·union](#set·union) * [string·capitalize](#string·capitalize) - * [string·codepoint_ords](#string·codepoint_ords) - * [string·codepoints](#string·codepoints) + + * [string·count](#string·count) - * [string·elem_ords](#string·elem_ords) + * [string·elems](#string·elems) * [string·endswith](#string·endswith) * [string·find](#string·find) @@ -810,23 +812,24 @@ The slice expression `b[i:j]` returns the subsequence of `b` from index `i` up to but not including index `j`. The index expression `b[i]` returns the int value of the ith element. -Like strings, bytes are hashable, totally ordered, and not iterable, +The `in` operator may be used to test for the presence of one bytes +as a subsequence of another, or for the presence of a single `int` byte value. + +Like strings, bytes values are hashable, totally ordered, and not iterable, and are considered True if they are non-empty. +A bytes value has these methods: + +* [`elems`](#bytes·elems) ``` TODO(https://github.com/bazelbuild/starlark/issues/112) -- methods. Likely the same as string (minus those concerned with text): - elems - iterator over ints +- more methods: likely the same as string (minus those concerned with text): join {start,end}with {r,}{find,index,partition,split,strip} replace -- specify ord, chr? -- hash(bytes) -- support 'bytes in bytes', 'int in bytes'? -- bytes(...) function -- encode, decode methods? -- can we reduce string iterator methods without loss of generality/efficiency? +TODO: encode, decode methods? +TODO: ord, chr. ``` ### Lists @@ -1271,10 +1274,10 @@ Its [type](#type) is `"builtin_function_or_method"`. A built-in function value used in a Boolean context is always considered true. -Many built-in functions are predeclared in the environment -(see [Name Resolution](#name-resolution)), and are thus available to (see [Name Resolution](#name-resolution)). -all Skylark programs. Some built-in functions such as `len` are _universal_, that is, -available to all Skylark programs. +Many built-in functions are predeclared in the environment; +see [Name Resolution](#name-resolution). +Some built-in functions such as `len` are _universal_, that is, +available to all Starlark programs. The host application may predeclare additional built-in functions in the environment of a specific module. @@ -2110,19 +2113,20 @@ these operators. #### Membership tests ```text - any in sequence (list, tuple, dict, string) + any in sequence (list, tuple, dict, string, bytes, range) any not in sequence ``` The `in` operator reports whether its first operand is a member of its -second operand, which must be a list, tuple, dict, or string. +second operand, which must be a list, tuple, dict, string, or bytes. The `not in` operator is its negation. Both return a Boolean. The meaning of membership varies by the type of the second operand: the members of a list or tuple are its elements; the members of a dict are its keys; -the members of a string are all its substrings. +the members of a string or bytes are all its substrings. +Additionally, the members of a bytes include the int values of its (byte) elements. ```python 1 in [1, 2, 3] # True @@ -2136,6 +2140,9 @@ d = {"one": 1, "two": 2} "nasty" in "dynasty" # True "a" in "banana" # True "f" not in "way" # True + +b"nasty" in b"dynasty" # True +97 in b"abc" # True (97 = 'a') ``` #### String interpolation @@ -2381,7 +2388,7 @@ f("n") # 2 ### Index expressions An index expression `a[i]` yields the `i`th element of an _indexable_ -type such as a string, bytes, tuple, or list. The index `i` must be an `int` +type such as a string, bytes, tuple, list, or range. The index `i` must be an `int` value in the range -`n` ≤ `i` < `n`, where `n` is `len(a)`; any other index results in an error. @@ -2425,7 +2432,8 @@ type, such as a tuple or string, or a frozen value of a mutable type. ### Slice expressions A slice expression `a[start:stop:stride]` yields a new value containing a -subsequence of `a`, which must be a string, bytes, tuple, or list. +subsequence of `a`, which must be an indexable sequence such as string, +bytes, tuple, list, or range. ```text SliceSuffix = '[' [Expression] ':' [Test] [':' [Test]] ']' @@ -2984,6 +2992,29 @@ If the iterable is empty, it returns `True`. `bool(x)` interprets `x` as a Boolean value---`True` or `False`. With no argument, `bool()` returns `False`. +### bytes + +`bytes(x)` converts its argument to a `bytes`. + +If x is a `bytes`, the result is x. + +If x is a string, the result is a `bytes` whose elements are +the UTF-8 encoding of the string. Each element of the string that is +not part of a valid encoding of a code point is replaced by the +UTF-8 encoding of the replacement character, U+FFFD. + +If x is an iterable sequence of int values, +the result is a `bytes` whose elements are those integers. +It is an error if any element is not in the range 0-255. + +```python +bytes("hello 😃") # b"hello 😃" +bytes(b"hello 😃") # b"hello 😃" +bytes("hello 😃"[:-1]) # b"hello ���" +bytes([65, 66, 67]) # b"ABC" +bytes(65) # error: got int, want string, bytes, or iterable of int +``` + ### dict `dict` creates a dictionary. It accepts up to one positional @@ -3107,11 +3138,14 @@ provided `default` value instead of failing. ### hash -`hash(x)` returns an integer hash of a string x -such that two equal strings have the same hash. +`hash(x)` returns an integer hash of a string or bytes x +such that two equal values have the same hash. In other words `x == y` implies `hash(x) == hash(y)`. +Any other type of argument in an error, even if it is suitable as the key of a dict. + In the interests of reproducibility of Starlark program behavior over time and -across implementations, the specific hash function is the same as that implemented by +across implementations, the specific hash function for bytes is 32-bit FNV-1a, +and the hash function for strings is the same as that implemented by [java.lang.String.hashCode](https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#hashCode), a simple polynomial accumulator over the UTF-16 transcoding of the string: @@ -3119,11 +3153,6 @@ a simple polynomial accumulator over the UTF-16 transcoding of the string: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] ``` -`hash(x)` returns an integer hash value for a string x such that `x == y` -implies `hash(x) == hash(y)`. - - - ### int `int(x[, base])` interprets its argument as an integer. @@ -3326,9 +3355,13 @@ str(1) # '1' str("x") # 'x' str([1, "x"]) # '[1, "x"]' str(0.0) # '0.0' (formatted as if by "%g") -str(b"abc") # 'b"abc"' +str(b"abc") # 'abc' ``` +The string form of a bytes value is the UTF-K decoding of the bytes. +Each byte that is not part of a valid encoding is replaced by the +UTF-K encoding of the replacement character, U+FFFD. + ### tuple `tuple(x)` returns a tuple containing the elements of the iterable x. @@ -3367,6 +3400,18 @@ using [dot expressions](#dot-expressions). For example, strings have a `count` method that counts occurrences of a substring; `"banana".count("a")` yields `3`. + +### bytes·elems + +`b.elems()` returns an opaque iterable value containing successive int elements of b. +Its type is `"bytes.elems"`, and its string representation is of the form `b"...".elems()`. + +```python +type(b"ABC".elems()) # "bytes.elems" +b"ABC".elems() # b"ABC".elems() +list(b"ABC".elems()) # [65, 66, 67] +``` + ### dict·get @@ -3637,14 +3682,27 @@ They are interpreted according to Starlark's [indexing conventions](#indexing). ### string·elems -`S.elems()` returns an iterable value containing successive +`S.elems()` returns an opaque iterable value containing successive 1-element substrings of S. +Its type is `"string.elems"`, and its string representation is of the form `"...".elems()`. ```python -'Hello, 123'.elems() # ["H", "e", "l", "l", "o", ",", " ", "1", "2", "3"] +"Hello, 123".elems() # "Hello, 123".elems() +type("Hello, 123".elems()) # "string.elems" +list("Hello, 123".elems()) # ["H", "e", "l", "l", "o", ",", " ", "1", "2", "3"] ``` - +