spec: new 'bytes' data type #112

alandonovan · 2020-10-02T02:45:44Z

This proposal argues for the addition of a new 'bytes' data type to Starlark, defined as an immutable sequence of byte (uint8) values.

Starlark has implementations in Go, Java, and Rust. It is crucial for performance that Starlark use the same string data type as the host language, but this poses a problem, because Java strings are sequences of uint16 values, conventionally holding UTF-16 encoded text, whereas Go's strings are sequences of uint8 values, conventionally holding UTF-8 encoded text.

The solution to this problem is to specify the behavior of Starlark strings in terms of an implementation-defined parameter K, which is 8 for Go and 16 for Java. Nearly all string operations over text can be usefully described this way, enabling portable text handling. (The sole obvious exception is str.strip(chars='abc'), as it treats the chars string as a set rather than an ordered sequence.) [This needs its own proposal. TODO(adonovan)]

This leaves the Java implementation without a way to represent arbitrary byte strings, such as the contents of arbitrary binary files, or fields of protocol buffers. This need would be met by adding a new data type, 'bytes', to the core language:

the literal syntax b"ABC" would denote a byte string, analogous to how "ABC" denotes a UTF-K string.
byte string literals would permit hex escapes \xXX and octal escapes \377 over the range [0-255]. Ordinary strings would permit hex and octal escapes only for the ASCII range [0-127], but would additionally allow \uXXXX and \UXXXXXXXX escapes to denote the UTF-K encoding of a unicode code point specified in hex.
type(bytes) would return "bytes".
len(bytes) would return the number of bytes.
bool(bytes) would be equivalent to len(bytes) > 0.
indexing bytes[int] would yield a 1-element byte string.
the following string methods would have byte-string counterparts: [count endswith find index join lstrip partition replace rfind rindex rpartition rsplit rstrip split startswith strip]. This set excludes methods related to textual concepts such as letter vs number, or upper case vs lower. We should perhaps start with a smaller set.
the elems method would iterate over the 1-byte substrings, and elem_ords would iterate over the numeric byte values.
str(bytes) would decode bytes to UTF-K, replacing invalid sequences by U+FFFD.
bytes(str) (that is, a new predeclared function called "bytes") would encode a text string to UTF-8, replacing invalid code units with the encoding of U+FFFD. (If K=8 and str is a valid encoding, the result is the same as the input.)
repr(bytes) would return a byte string literal such as b"ABC".

The text was updated successfully, but these errors were encountered:

brandjon · 2020-10-02T18:50:57Z

byte string literals

The Python spec calls them "bytes literals", conspicuously omitting the term "string" when referring to them. I suggest we do the same.

byte string literals would permit hex escapes \xXX and octal escapes \377 over the range [0-255].

Bizarrely, Python 3 also permits \400 to \777 but truncates to 8 bits. Agreed that we should prohibit those values.

Ordinary strings would permit hex and octal escapes only for the ASCII range [0-127], but would additionally allow \uXXXX and \UXXXXXXXX escapes to denote the UTF-K encoding of a unicode code point specified in hex.

So that way escape sequences besides \u and \U yield behaviorally equivalent string content across implementations. Sounds good.

indexing bytes[int] would yield a 1-element byte string.

This differs from Python 3.

>>> b'\x00'[0]
0
>>> type(b'\x00'[0])
<class 'int'>

the following string methods would have byte-string counterparts: [count endswith find index join lstrip partition replace rfind rindex rpartition rsplit rstrip split startswith strip]

Note that many of the excluded methods exist on the bytes type in Python 3. Presumably this is to give access to Python 2-style string handling, so perhaps we don't need it.

the elems method would iterate over the 1-byte substrings, and elem_ords would iterate over the numeric byte values.

See above about indexing. If you want the items of elems to be of the same type as the values returned by indexing, then elems would be an iterable of integers. You could conceivably have another method to return the 1-byte subranges. (That could also be acheived through slicing, though you can say the same about string's elems method. I presume the reason we want an iterable is to avoid allocating for each item.)

str(bytes) would decode bytes to UTF-K, replacing invalid sequences by U+FFFD.

bytes(str) (that is, a new predeclared function called "bytes") would encode a text string to UTF-8, replacing invalid code units with the encoding of U+FFFD. (If K=8 and str is a valid encoding, the result is the same as the input.)

Should bytes values also have a decode method, and strings an encode method, as in Python?

Should we support the ability to specify an encoding besides UTF-8? What if Starlark code wants to write a file in an odd encoding? If we don't support that now, will we be able to add it seamlessly in the future simply by extending this design with an encoding param as in Python?

Should we support customizable (or at least multiple) behaviors for invalid text, e.g. "strict" mode (fail fast)? Should strict be the default as in Python 3? (Note that we might support user-defined behaviors by passing callback functions, as opposed to how Python passes names of callbacks registered with codecs.register_error().)

I'm fine with deferring all of these in implementation, so long as they don't conflict with anything we're doing today.

Additional points:

Are bytes objects iterable, or are you required to use something like elems()? Probably best to prohibit implicit iteration for consistency with Starlark strings.
Regarding unicode values (not escape sequences) appearing in literals: As was mentioned on an email thread, "Within a string literal a code point denotes its UTF-k encoding; within a bytes literal each source file byte denotes a string byte." Since Starlark files are specified to be UTF-8 encoded, this would seem to imply that a unicode code point appearing within a bytes literal is treated as its UTF encoding, whereas the same code point appearing within a string literal is transcoded (if necessary) to the implementation's native string encoding.

brandjon · 2020-10-02T19:23:39Z

The solution to this problem is to specify the behavior of Starlark strings in terms of an implementation-defined parameter K, which is 8 for Go and 16 for Java. Nearly all string operations over text can be usefully described this way, enabling portable text handling.

To expand on this, every string operation besides strip() expects strings (or substrings) as arguments, not individual unicode code points or code units. Therefore, they are agnostic to whether those arguments represent one code point or many. It doesn't matter that 🌿 encodes as 4 elements in the Go implementation and 2 elements in the Java one, because mystring.count('\U0001f33f') will return how many of these code points appear in mystring either way.

For strip, a plausible solution is to specify that it interprets its as a set of code points, not string elements. That is, it will decode the given string from UTF-k into unicode code points, reencode each of those code points separately back to UTF-k, then proceed to strip the input text of any occurrences of those separate substrings. Note that this implies strip's processing of its arguments could fail with unicode errors and is subject to the same decisions we're now making about U+FFFD vs strictness in other transcoding methods.

Yet more thoughts:

Python appears to be case insensitive for literal prefixes like b and r. Currently the Java implementation allows r for strings but not R. We should be consistent with ourselves and either only allow lowercase literal prefixes, or else be case insensitive for all prefixes (to be consistent with ourselves and Python).
Python has raw bytes literals, spelled rb or br (again case insensitive).
Python has \N to allow specifying unicode characters by name. This seems like extra complexity we don't need and it can easily be added in the future since we reject illegal escape sequences (under --incompatible_restrict_string_escapes).
Triple-quoted bytes literals are permitted by Python so we should probably allow them as well.
Python disallows non-ASCII code points from appearing inside bytes literals, contrary to what was discussed above. '🌿' is a syntax error.
We should disallow implicit bytes literal concatenation (b'abc' b'def'), as we do for strings. We should allow explicit concatenation (b'abc' + b'def').
In Python, ord() also works on bytes values of length 1.
Python also has hex() and fromhex() methods on bytes.
For strings, the various alphanumeric / whitespace methods will need to be made unicode-aware to match Python 3's behavior. Last I looked, this entailed embedding a database of unicode character traits in the binary.

brandjon · 2020-10-02T20:12:06Z

Another thing: Apparently in Python you can pass non-ASCII numeric code points to the int and float functions:

>>> int('\u0c67')
1

It seems niche to support this, but maybe it's worse to not support it and make a backwards-incompatible change if we regret that?

The int and float functions also accept bytes objects.

brandjon · 2020-10-02T20:21:14Z

Would byte literals support the format operator % as in Python 3?

alandonovan · 2020-10-19T18:01:06Z

The Python spec calls them "bytes literals"... I suggest we do the same.

Agreed.

indexing bytes[int] would yield a 1-element byte string. This differs from Python 3.

Good point. Let's follow Python. But we will need an operation for converting a byte value to a 1-byte string (analogous to what chr(x) and "%c" % x to do for code points. What is the Python3 approach? bytes([x])? The list allocation makes me sad.

If you want the items of elems to be of the same type as the values returned by indexing, then elems would be an iterable of integers.

Good point. Ordinary (UTF-K) strings need methods for iterating over two sequences---elements, and Unicode code points---and both of these usefully come in two flavors: singleton substrings, and numeric codes. Byte strings don't need the code point iterator, but they would want both substring and numeric flavors unless there's a convenient and cheap alternative conversion. [Update: ord is that conversion, from 1-element strings to numbers.]

Should bytes values also have a decode method, and strings an encode method, as in Python?

No. It's redundant wrt bytes(string) and str(bytes).

Should we support the ability to specify an encoding besides UTF-8?

No, I don't think it is important to support odd encodings. Users can add encoding functions to their applications if they want, but the world is moving to UTF-8. (The utf8everywhere.org manifesto is a good read.)

Should we support customizable (or at least multiple) behaviors for invalid text, e.g. "strict" mode (fail fast)?

No. Such things could in principle be added later, but I don't see a compelling need, and I doubt one will arise. I really don't want to add callback functions to these simple interfaces.

Probably best to prohibit implicit iteration for consistency with Starlark strings.

Agreed.

a unicode code point appearing within a bytes literal is treated as its UTF8 encoding, whereas the same code point appearing within a string literal is transcoded (if necessary) to the implementation's native string encoding.

Exactly. The important thing is that for any "abc" that is legal in both "abc" and b"abc" literals, we have the invariant bytes("abc") == b"abc".

For strip, a plausible solution is to specify that it interprets its [cutset] as a set of code points, not string elements.

That seems like the best approach; it's the one taken by Go's strings.Trim (https://golang.org/pkg/strings/#Trim), for example. The implementation could be simpler than what you suggest: instead of materializing the sequence of code points, you can decode the input string and test each code point as you go. The most common cutset is a single ASCII byte, followed by a sequence of ASCII bytes, both of which make the decoding and the testing trivial. Rarely is the cutset non-ASCII or large, so the quadratic implementation is fine for those cases.

Python appears to be case insensitive for literal prefixes like b and r.

I'd vote for lowercase only.

Python has raw bytes literals, spelled rb or br (again case insensitive).

Starlark should too.

Python has \N to allow specifying unicode characters by name. This seems like extra complexity we don't need

Agreed.

Triple-quoted bytes literals are permitted by Python so we should probably allow them as well.

Agreed.

Python disallows non-ASCII code points from appearing inside bytes literals, contrary to what was discussed above. '🌿' is a syntax error.

That's no doubt because Python doesn't mandate a source file encoding, but Starlark does (though Bazel screws things up here).

We should disallow implicit bytes literal concatenation (b'abc' b'def'), as we do for strings. We should allow explicit concatenation (b'abc' + b'def').

Agreed.

In Python, ord() also works on bytes values of length 1.

That's good to know. This provides a means of efficiently mapping from 1-element substrings to numbers, and means we could do away with the numeric flavor of the 1-element substring iterator for both strings and for bytes.

Python also has hex() and fromhex() methods on bytes.

That seems reasonable to add. It can be defined in the language (or rather, it could be if we had '%x"%x, and we should), but not without an enormous amount of wasteful allocation.

For strings, the various alphanumeric / whitespace methods will need to be made unicode-aware to match Python 3's behavior. Last I looked, this entailed embedding a database of unicode character traits in the binary.

Yes. I recently reviewed a change to RE2/J which generated these tables from ICU, instead of from Go's unicode package:
google/re2j@eedfe4b
We should learn what we can from there before coding.

Apparently in Python you can pass non-ASCII numeric code points to the int and float functions:

int('\u0c67') # U+0C67 is TELUGU DIGIT ONE
1
It seems niche to support this

It sure does seem niche, not to mention an intriguing avenue for security attacks. We should definitely not support it.

The int and float functions also accept bytes objects.

Seems like a relic of Python's old string handling, and not something we should support. bytes are not text.

Would byte literals support the format operator % as in Python 3?

No. Again, bytes are not text.

brandjon · 2020-10-19T21:15:02Z

we will need an operation for converting a byte value to a 1-byte string (analogous to what chr(x) and "%c" % x to do for code points. What is the Python3 approach? bytes([x])? The list allocation makes me sad.

1-element slices would work. Of course that still allocates, but that's unavoidable in Java if you're creating a (non-primitive) value, no?

Byte strings don't need the code point iterator, but they would want both substring and numeric flavors unless there's a convenient and cheap alternative conversion. [Update: ord is that conversion, from 1-element strings to numbers.]

Then is the proposal to have bytes.elems() give you an iterable over the 1-element slices, and have no other new methods, relying on ord() for the numeric conversion on an element-by-element basis? (Note that if e is an item in b.elems(), then ord(e) is just e[0]. But ord allows for an ad hoc polymorphism between strings and bytes that wouldn't otherwise exist.)

No. It's redundant wrt bytes(string) and str(bytes).

I imagine other constructs in Starlark could be considered redundant. List += vs extend(), for instance. If we are to leave out a redundant construct, do we have a sense of which spelling -- the constructor-like notation, or the encode/decode methods -- is more commonly used by Python programmers?

No, I don't think it is important to support odd encodings. Users can add encoding functions to their applications if they want, but the world is moving to UTF-8. (The utf8everywhere.org manifesto is a good read.)

That does look like a good read. I can understand having a bias towards UTF-8, but I wonder how much of it is aspirational and how much of it reflects today's reality. Will lack of built-in support for other encodings be problematic to users?

I wonder: What do we do in Bazel when writing things like command line args of an action? Do we just never serialize any non-ASCII text in practice? Probably. But if we handle it correctly, the transcoding differs between windows and unix.

No. Such things could in principle be added later, but I don't see a compelling need, and I doubt one will arise. I really don't want to add callback functions to these simple interfaces.

If we reach a point where callbacks are being requested, I'd agree that we can just tell people to write a library function. But strictness seems like a basic enough feature that we can make it an option (agreed that it doesn't have to be done right now).

As for whether U+FFFE replacement is a sensible default: the argument against strictness is that you don't want to see failures when somebody throws bad data at you, you'd rather just process it and give garbage out for their garbage in? Ordinarily as a rule we prefer to fail fast. I'm undecided here.

Python is strict by default. Apparently strict means erroring out on surrogate code points, but not on U+FFFE and U+FFFF.

What was the root of Python 2's mess with unicode, where the moment the code handled non-ASCII data you'd suddenly see failures that your tests didn't catch? Was that related to the encoding strictness, or only to the loose typing between bytes and strings?

I'd vote for lowercase only.

Yeah, I'd kinda like to ban uppercase as non-conventional and weird.

That's no doubt because Python doesn't mandate a source file encoding, but Starlark does (though Bazel screws things up here).

Hm. According to SO, Java doesn't specify the source code format below the level of unicode code points. Is it in any way problematic to require UTF-8, beyond e.g. breaking the workflows of the masses of people clamoring to write Starlark code in Windows Notepad?

This provides a means of efficiently mapping from 1-element substrings to numbers, and means we could do away with the numeric flavor of the 1-element substring iterator for both strings and for bytes.

So for str, it'd just be s.elems(), s.codepoints(), and s.codepoint_ords()? Sounds like codepoint_ords could also be replaced by a function that operates on a single UTF-8-encoded code point, but in that case we'd be introducing a new predeclared symbol, so we may as well keep the method. How confident are we that this set of primitives -- iterators over elements and code points and str/bytes conversion via constructors -- provides the right set of tools to users?

Sticking to Python's treatment of unicode character metadata, and abandoning Python's loose byte typing w.r.t. int()/float() and its % formatting, sounds good to me.

This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. This change removes go.starlark.net.lib.proto.Bytes. Updates bazelbuild/starlark#112 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c

alandonovan · 2020-12-10T21:50:07Z

If you're interested in this proposal, please take a look at the attached PR for the reference implementation in Go. I have a Java implementation too but it's considerably more complicated due to Bazel's long-standing mis-decoding of UTF-8 files as Latin1.

Suggestions welcome.

This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. This change removes go.starlark.net.lib.proto.Bytes. Updates bazelbuild/starlark#112 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c

This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. This change removes go.starlark.net.lib.proto.Bytes. Updates bazelbuild/starlark#112 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c

brandjon · 2020-12-22T22:31:55Z

I aggregated the issues raised in this discussion thread so far.

Resolved

They’re called “bytes literals”, not “byte string literals”.
Bytes literals support \xXX up to \xFF and \xxx up to \377. Regular string literals support \xXX up to \x7F and \xxx up to \177. \uXXXX and \UXXXXXXXX are only supported in string literals and denote the encoding of the given code point into UTF-k. We don’t support the \N escape for named unicode code points.
Following Starlark's treatment of strings, bytes values are not iterable, and bytes literals are not implicitly concatenated.
int() and float() casts do not work on bytes values. On string values, they do not treat non-ASCII code points as numeric text (no int('\u0c67')).
Bytes values do not support % formatting.
ord() works on bytes values of length 1, and can be called in a loop over 1-element subsequences to extract the sequence of values. (Mind that int(‘1’) != ord(‘1’).)
Indexing a bytes value returns the integer ([0-255]) at that position. I.e., b[i] == ord(b[i:i+1]).
Raw unicode code points in bytes literals are encoded into the bytes value as UTF-8 (effectively a no-op since hte source representation is UTF-8). Raw unicode code points in string literals are encoded into the string value as UTF-k (only a no-op when k = 8).
We support literal prefixes r, b, rb, and br, but not their uppercase versions. We support all quoting styles (', ", and their triple-quoted variants).
Character trait information (alphanumericness, whitespaceness, etc.) will be based on a unicode database, similar to Python and Go.

Open issues

Iteration API

Exactly what elems(), elem_ords(), etc. methods do we support on strings and bytes?

Future extensions to encoding/decoding

If we support more encodings or more modes (strictness, etc.) in the future, what would that look like and would it interfere with any decisions we’re making now? I think I'm ok with the type names being the principal conversion operator (as opposed to encode()/decode() methods), so that means possibly adding keyword args to the str and bytes functions.

What was the root of Python 2's mess with unicode, where the moment the code handled non-ASCII data you'd suddenly see failures that your tests didn't catch? Was that related to the encoding strictness, or only to the loose typing between bytes and strings?

FWIW, this summed up what I was probably thinking of.

Surrogates

Under what circumstances does unpaired surrogate replacement by U+FFFD occur?

Does it happen when decoding the source text as UTF-8 (which technically shouldn't allow unpaired surrogates as content)? If so, that means unpaired surrogates would be transformed even when appearing inside bytes literals, though the transformation would occur at the source decoding level, not the lexer or parser. Is that within scope of the spec?

Does it happen when reading a \u or \U escape sequence representing a surrogate code point in a string literal? (Of course it wouldn't for a bytes literal.) With k=16, can you write "\uXXXX\uXXXX" to form a two-element string containing a valid surrogate pair, and have it be the exact same string value as if you wrote "\UXXXXXXXX" for the corresponding code point that the pair represents?

Does it happen when you convert between strings and bytes using str() and bytes()? In particular, how does this square with the property you cited:

The important thing is that for any "abc" that is legal in both "abc" and b"abc" literals, we have the invariant bytes("abc") == b"abc".

Note that U+FFFD replacement is the only thing standing in the way of having bytes <-> str conversion be lossless when k = 8.

1-byte strings

we will need an operation for converting a byte value to a 1-byte string (analogous to what chr(x) and "%c" % x to do for code points. What is the Python3 approach? bytes([x])? The list allocation makes me sad.

1-element slices would work. Of course that still allocates, but that's unavoidable in Java if you're creating a (non-primitive) value, no?

Actually, I suppose we can simply preallocate the 256 length-1 bytes values. But I see now we're talking about going from an integer to a bytes value.

Python apparently has bytes(size) -> bytes value with size many 0x00 values. I presume we don’t want to carry over that constructor, but its presence in Python blocks us from doing bytes(octet) -> bytes([octet]). Maybe we could use a keyword parameter, e.g. bytes(octet=x) -> bytes([x])?

Strange values of k

In the spec, do we support k = 7 (UTF-7) or k = 32 (UTF-32)? Or are we explicitly mandating that k is either 8 or 16?

brandjon · 2020-12-23T15:45:17Z

See my comments on the starlark-go PR. I think the unifying theme of the problem with elems/ord is that we need a convenient, orthogonal, and obvious way to construct and deconstruct arbitrary sequences of k-bit elements. The restriction on non-ASCII numeric escapes in strings doesn't do that. The unicode-centric interpretation of ord() doesn't do that.

Instead, what if we make ord() always give the integer corresponding to the element (for both bytes and strings), and allow numeric escapes up to the max value representable in k? I.e., you could do "\xFF" in the Go implementation, and possibly even "\XFFFF" (new escape for new width) in Java. Then it's easy to make any kind of string, including malformed unicode strings, which were already possible but just edge cases. For both strings and bytes we do away with .elem_ords() unless it turns out to be performance critical for something. ord(a[i]) gives the number of the ith element for both strings and bytes.

For strings, we also provide a top-level codepoint_ord() and a string.codepoints() method (with string.codepoint_ords() similarly being omitted as redundant). codepoint_ord() gets the number of the first codepoint in the arg.

Going from integers to string or byte elements is less uniform. We have bytes([x]) and chr(x), though the latter may produce more than one element depending on x and k.

Note that the possible \XNNNN escape is redundant with \uNNNN on k=16 platforms, and illegal on k=8 platforms. Its main purpose is to signal that we're constructing an element, not a UTF-8 sequence. I'm open to other ideas though. (The escape-based approach also falls apart entirely if we are allowed to conceive of k=32 implementations.)

Edit: You may argue that this makes strings too similar to bytes. But if we really wanted them to be so dissimilar, wouldn't we define them as sequences of code points like Python does?

brandjon · 2020-12-23T21:06:33Z

Notes from my discussion with Alan follow.

Corrections from my summary above:

\u and \U are allowed in bytes literals and denote the UTF-8 encoding of the given hex code point.
Bytes can't be on the left-hand side of % formatting, but they can be on the right.

Construction

The constructor syntax we were going to use differs from Python. In Python:

str(bytes_value) gives the repr-style representation of bytes_value as a literal, e.g. str(b'abc') == "b'abc'". In contrast, str(bytes_value, encoding=...) converts bytes_value to a corresponding string.
bytes(string_value) is disallowed, while bytes(string_value, encoding=...) converts strings to bytes.
str.encode() and bytes.decode() convert between string and bytes with a default encoding of utf-8 if not specified.

We should aim for consistency with Python. At the very least, this means disallowing bytes(string_value) and having repr-like behavior for str(bytes_value). I suggest that we support both the constructor-style and encode()/decode()-style conversions, but the encode/decode one is more important since it permits omitting the encoding param in the common case of utf-8.

Unlike Python, we should require that encoding is passed by keyword only, not positionally. This improves user code clarity and helps avoid the situation where we can't introduce subsequent parameters (strict) until this one's implemented.

We omit Python's bytes(size) constructor. (The nullary bytes() constructor should probably still be allowed and produce an empty bytes object, analogous to str() and to other nullary constructors producing "zero" values.)

The bytes.join() method allows for linear-time concatenation of bytes, similar to strings. E.g. b''.join(bytes_builder) where bytes_builder is a list of bytes objects.

We may support bytes(octet=x) as a slightly more efficient synonym for bytes([x]), though we don't have to decide that right away. chr() and codepoint_chr() (see below) already serve that purpose for strings. For bytes, you can construct an arbitrary sequence dynamically (i.e., outside literals) by using bytes([n1, n2, ...]). For strings, you'd have to do ''.join([chr(n) for n in [n1, n2, ...]]).

Strictness

Similarly, we should side with Python's default of failing fast when an encoding error is detected. This behavior can be overridden by the strict parameter, which behaves as in Python (but perhaps omitting some of its possible values to start, e.g. "ignore"). Unlike Python, we make strict keyword-only.

While decoding (utf-8-encoded) Starlark source code, an invalid surrogate code point should result in an error.

Value of k

We specify that k = 8 and k = 16 are the only valid implementations. No k = 32 in particular.

Elements and encodings

Manipulating string elements is a first-class concept regardless of unicode encodings.

We add a \Xnnnn escape (not present in Python) for specifying element values from 0x0000 to 0xFFFF (k=16) or to 0x00FF (k=8). It's a lexical error to go out of range on a k=8 implementation. On a k=16 implementation, \X has the same effect as \u for non-surrogate code points.

It is illegal to use \u or \U to specify surrogate code points. This differs from Python. The rationale is that in Python, unicode escapes create an element whose code point matches the given number, and a surrogate is a valid Python string element (and a valid Starlark string element for k = 16). But in this proposal, unicode escapes simply append the utf-k encoding of the given code point to the string, and a surrogate's code point cannot be properly encoded. If you really want this encoded value anyway, you can construct it with one \X (for k=16 implementations) or several \x (for k=8, or for bytes objects) escapes.

The string.elems() and bytes.elems() methods give you an iterable of the elements -- individual length-1 strings for strings, and octet integers for bytes. In both cases this matches what you'd get from directly iterating over the object in Python (which we don't support). The ord() top-level function takes in a length-1 string or bytes value and returns its ordinal. The chr() top-level function does the inverse operation for strings, taking an integer (in range 2^k) to a 1-element string. Note that all of this is independent of unicode encodings.

For strings we also provide a string.codepoints() method and two toplevel functions, codepoint_ord() and codepoint_chr().

string.codepoints() returns an iterable of successive substrings, each one constituting a single utf-k-encoded code point.
codepoint_ord() takes in a string and returns the code point of the first utf-k-encoded character. (Unlike ord(), it tolerate excess data after the first item.)
codepoint_chr() takes in a code point and returns the (possibly multi-element) utf-k-encoded string representing it.

All three methods could take in a strict parameter to control their tolerance of accepting and yielding surrogates. They never accept or produce values out of the unicode range (U+0000 to U+10FFFF).

This gives us a complete elements-oriented API for dealing with strings and bytes, as well as a separate unicode-oriented API for strings. Note that whereas the format of a bytes object can be customized by the encoding parameter of encode()/decode(), the API always assumes utf-k for strings.

We do need to give some thought to how to effectively handle strict failure, given that Starlark does not have exceptions. It seems undesirable to add look-before-you-leap variants of these methods, or even a keyword param that changes it from a results-producing-call to a safety check. Maybe we can have "strict" mean "returns None on failure".

This change adds initial specification of the bytes data type following length discussion in bazelbuild#112. It also explains the implementation-dependent encoding of text strings, and the \u \U \X escapes. More will follow, but let's get the easy parts out of the way first. Updates bazelbuild#112 Change-Id: I8cfbb4910c2f85a1076f9b8bdf1081c89dd5948a

THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote, but uses \X to escape raw code units as needed. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \X or \u escapes instead. Updates bazelbuild/starlark#112 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c

THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote, but uses \X to escape raw code units as needed. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \X or \u escapes instead. Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c

This change adds initial specification of the bytes data type following length discussion in bazelbuild#112. It also explains the implementation-dependent encoding of text strings, and the \u \U \X escapes. More will follow, but let's get the easy parts out of the way first. Updates bazelbuild#112 Change-Id: I8cfbb4910c2f85a1076f9b8bdf1081c89dd5948a

THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \u escapes instead if you want the UTF-8 encoding of a code point in the range U+80 to U+FF. A string literal can no longer denote invalid text, such as the 1-element string formerly written "\xff". Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c

This change adds initial specification of the bytes data type following length discussion in bazelbuild#112. It also explains the implementation-dependent encoding of text strings, and the \u and \U escapes. More will follow, but let's get the easy parts out of the way first. Updates bazelbuild#112 Change-Id: I8cfbb4910c2f85a1076f9b8bdf1081c89dd5948a

* spec: bytes data type This change adds initial specification of the bytes data type following length discussion in #112. It also explains the implementation-dependent encoding of text strings, and the \u and \U escapes. More will follow, but let's get the easy parts out of the way first. Updates #112

THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \u escapes instead if you want the UTF-8 encoding of a code point in the range U+80 to U+FF. A string literal can no longer denote invalid text, such as the 1-element string formerly written "\xff". Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c

This change defines the semantics of: - bytes.elems() -- iterable of int values of byte elements - hash(bytes) -- 32-bit BE FNV-1a hash - bytes in bytes -- substring test - int in bytes -- element membership test Updates bazelbuild#112 Change-Id: Ide3459c4115fff718197001c381da4da7a45a9d7

THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \u escapes instead if you want the UTF-8 encoding of a code point in the range U+80 to U+FF. A string literal can no longer denote invalid text, such as the 1-element string formerly written "\xff". Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c

This change defines the semantics of: - str(bytes) -- UTF-k decoding with U+FFFD replacement - bytes(str) -- UTF-k encoding with U+FFFD replacement - bytes.elems() -- iterable of int values of byte elements - hash(bytes) -- 32-bit FNV-1a hash - bytes in bytes -- substring test - int in bytes -- element membership test Updates bazelbuild#112 Change-Id: Ide3459c4115fff718197001c381da4da7a45a9d7

This change defines the semantics of: - str(bytes) -- UTF-k decoding with U+FFFD replacement - bytes(str) -- UTF-k encoding with U+FFFD replacement - bytes.elems() -- iterable of int values of byte elements - hash(bytes) -- 32-bit FNV-1a hash - bytes in bytes -- substring test - int in bytes -- element membership test Updates #112 Co-authored-by: Alan Donovan <alan@alandonovan.net>

THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \u escapes instead if you want the UTF-8 encoding of a code point in the range U+80 to U+FF. A string literal can no longer denote invalid text, such as the 1-element string formerly written "\xff". Updates bazelbuild/starlark#112 Fixes #222

Bazel's Starlark does not provide access to a string's bytes, only its codepoints, so we are unable to do this escaping in Starlark. So a second pass is needed, at least until the spec and implementation work to get a [`bytes` type](bazelbuild/starlark#112) lands. Fixes bazel-contrib#794

alandonovan mentioned this issue Oct 2, 2020

spec: floating point numbers #113

Closed

alandonovan mentioned this issue Nov 11, 2020

spec: convergence with Go #20

Open

brandjon mentioned this issue Dec 23, 2020

starlark: add 'bytes' data type, for binary strings google/starlark-go#330

Merged

alandonovan mentioned this issue Feb 5, 2021

Clarify string escaping #160

Closed

alandonovan mentioned this issue Feb 5, 2021

spec: bytes data type #161

Merged

alandonovan mentioned this issue Feb 12, 2021

spec: more bytes operations #164

Merged

mahmoudimus mentioned this issue May 16, 2021

[✅] ready for merge: starlark: add 'bytes' data type, for binary strings verygoodsecurity/starlarky#115

Merged

5 tasks

alexeagle mentioned this issue May 3, 2023

[FR]: ord, chr, hex bazel-contrib/bazel-lib#418

Closed

plobsing mentioned this issue Nov 22, 2024

feat: vis-encode all path characters except visible ASCII bazel-contrib/bazel-lib#985

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec: new 'bytes' data type #112

spec: new 'bytes' data type #112

alandonovan commented Oct 2, 2020 •

edited

Loading

brandjon commented Oct 2, 2020

brandjon commented Oct 2, 2020

brandjon commented Oct 2, 2020 •

edited

Loading

brandjon commented Oct 2, 2020

alandonovan commented Oct 19, 2020

brandjon commented Oct 19, 2020

alandonovan commented Dec 10, 2020

brandjon commented Dec 22, 2020

brandjon commented Dec 23, 2020 •

edited

Loading

brandjon commented Dec 23, 2020 •

edited

Loading

spec: new 'bytes' data type #112

spec: new 'bytes' data type #112

Comments

alandonovan commented Oct 2, 2020 • edited Loading

brandjon commented Oct 2, 2020

brandjon commented Oct 2, 2020

brandjon commented Oct 2, 2020 • edited Loading

brandjon commented Oct 2, 2020

alandonovan commented Oct 19, 2020

brandjon commented Oct 19, 2020

alandonovan commented Dec 10, 2020

brandjon commented Dec 22, 2020

Resolved

Open issues

Iteration API

Future extensions to encoding/decoding

Surrogates

1-byte strings

Strange values of k

brandjon commented Dec 23, 2020 • edited Loading

brandjon commented Dec 23, 2020 • edited Loading

Corrections from my summary above:

Construction

Strictness

Value of k

Elements and encodings

alandonovan commented Oct 2, 2020 •

edited

Loading

brandjon commented Oct 2, 2020 •

edited

Loading

brandjon commented Dec 23, 2020 •

edited

Loading

brandjon commented Dec 23, 2020 •

edited

Loading