-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spec: new 'bytes' data type #112
Comments
The Python spec calls them "bytes literals", conspicuously omitting the term "string" when referring to them. I suggest we do the same.
Bizarrely, Python 3 also permits
So that way escape sequences besides
This differs from Python 3.
Note that many of the excluded methods exist on the bytes type in Python 3. Presumably this is to give access to Python 2-style string handling, so perhaps we don't need it.
See above about indexing. If you want the items of
Should bytes values also have a Should we support the ability to specify an encoding besides UTF-8? What if Starlark code wants to write a file in an odd encoding? If we don't support that now, will we be able to add it seamlessly in the future simply by extending this design with an Should we support customizable (or at least multiple) behaviors for invalid text, e.g. "strict" mode (fail fast)? Should strict be the default as in Python 3? (Note that we might support user-defined behaviors by passing callback functions, as opposed to how Python passes names of callbacks registered with I'm fine with deferring all of these in implementation, so long as they don't conflict with anything we're doing today. Additional points:
|
To expand on this, every string operation besides strip() expects strings (or substrings) as arguments, not individual unicode code points or code units. Therefore, they are agnostic to whether those arguments represent one code point or many. It doesn't matter that 🌿 encodes as 4 elements in the Go implementation and 2 elements in the Java one, because For Yet more thoughts:
|
Another thing: Apparently in Python you can pass non-ASCII numeric code points to the
It seems niche to support this, but maybe it's worse to not support it and make a backwards-incompatible change if we regret that? The |
Would byte literals support the format operator |
Agreed.
Good point. Let's follow Python. But we will need an operation for converting a byte value to a 1-byte string (analogous to what
Good point. Ordinary (UTF-K) strings need methods for iterating over two sequences---elements, and Unicode code points---and both of these usefully come in two flavors: singleton substrings, and numeric codes. Byte strings don't need the code point iterator, but they would want both substring and numeric flavors unless there's a convenient and cheap alternative conversion. [Update: ord is that conversion, from 1-element strings to numbers.]
No. It's redundant wrt bytes(string) and str(bytes).
No, I don't think it is important to support odd encodings. Users can add encoding functions to their applications if they want, but the world is moving to UTF-8. (The utf8everywhere.org manifesto is a good read.)
No. Such things could in principle be added later, but I don't see a compelling need, and I doubt one will arise. I really don't want to add callback functions to these simple interfaces.
Agreed.
Exactly. The important thing is that for any "abc" that is legal in both "abc" and b"abc" literals, we have the invariant
That seems like the best approach; it's the one taken by Go's strings.Trim (https://golang.org/pkg/strings/#Trim), for example. The implementation could be simpler than what you suggest: instead of materializing the sequence of code points, you can decode the input string and test each code point as you go. The most common cutset is a single ASCII byte, followed by a sequence of ASCII bytes, both of which make the decoding and the testing trivial. Rarely is the cutset non-ASCII or large, so the quadratic implementation is fine for those cases.
I'd vote for lowercase only.
Starlark should too.
Agreed.
Agreed.
That's no doubt because Python doesn't mandate a source file encoding, but Starlark does (though Bazel screws things up here).
Agreed.
That's good to know. This provides a means of efficiently mapping from 1-element substrings to numbers, and means we could do away with the numeric flavor of the 1-element substring iterator for both strings and for bytes.
That seems reasonable to add. It can be defined in the language (or rather, it could be if we had
Yes. I recently reviewed a change to RE2/J which generated these tables from ICU, instead of from Go's unicode package:
It sure does seem niche, not to mention an intriguing avenue for security attacks. We should definitely not support it.
Seems like a relic of Python's old string handling, and not something we should support. bytes are not text.
No. Again, bytes are not text. |
1-element slices would work. Of course that still allocates, but that's unavoidable in Java if you're creating a (non-primitive) value, no?
Then is the proposal to have
I imagine other constructs in Starlark could be considered redundant. List
That does look like a good read. I can understand having a bias towards UTF-8, but I wonder how much of it is aspirational and how much of it reflects today's reality. Will lack of built-in support for other encodings be problematic to users? I wonder: What do we do in Bazel when writing things like command line args of an action? Do we just never serialize any non-ASCII text in practice? Probably. But if we handle it correctly, the transcoding differs between windows and unix.
If we reach a point where callbacks are being requested, I'd agree that we can just tell people to write a library function. But strictness seems like a basic enough feature that we can make it an option (agreed that it doesn't have to be done right now). As for whether U+FFFE replacement is a sensible default: the argument against strictness is that you don't want to see failures when somebody throws bad data at you, you'd rather just process it and give garbage out for their garbage in? Ordinarily as a rule we prefer to fail fast. I'm undecided here. Python is strict by default. Apparently strict means erroring out on surrogate code points, but not on U+FFFE and U+FFFF. What was the root of Python 2's mess with unicode, where the moment the code handled non-ASCII data you'd suddenly see failures that your tests didn't catch? Was that related to the encoding strictness, or only to the loose typing between bytes and strings?
Yeah, I'd kinda like to ban uppercase as non-conventional and weird.
Hm. According to SO, Java doesn't specify the source code format below the level of unicode code points. Is it in any way problematic to require UTF-8, beyond e.g. breaking the workflows of the masses of people clamoring to write Starlark code in Windows Notepad?
So for Sticking to Python's treatment of unicode character metadata, and abandoning Python's loose byte typing w.r.t. |
This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. This change removes go.starlark.net.lib.proto.Bytes. Updates bazelbuild/starlark#112 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. This change removes go.starlark.net.lib.proto.Bytes. Updates bazelbuild/starlark#112 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. This change removes go.starlark.net.lib.proto.Bytes. Updates bazelbuild/starlark#112 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
If you're interested in this proposal, please take a look at the attached PR for the reference implementation in Go. I have a Java implementation too but it's considerably more complicated due to Bazel's long-standing mis-decoding of UTF-8 files as Latin1. Suggestions welcome. |
This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. This change removes go.starlark.net.lib.proto.Bytes. Updates bazelbuild/starlark#112 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. This change removes go.starlark.net.lib.proto.Bytes. Updates bazelbuild/starlark#112 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. This change removes go.starlark.net.lib.proto.Bytes. Updates bazelbuild/starlark#112 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. This change removes go.starlark.net.lib.proto.Bytes. Updates bazelbuild/starlark#112 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. This change removes go.starlark.net.lib.proto.Bytes. Updates bazelbuild/starlark#112 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
I aggregated the issues raised in this discussion thread so far. Resolved
Open issuesIteration APIExactly what Future extensions to encoding/decodingIf we support more encodings or more modes (strictness, etc.) in the future, what would that look like and would it interfere with any decisions we’re making now? I think I'm ok with the type names being the principal conversion operator (as opposed to
FWIW, this summed up what I was probably thinking of. SurrogatesUnder what circumstances does unpaired surrogate replacement by U+FFFD occur? Does it happen when decoding the source text as UTF-8 (which technically shouldn't allow unpaired surrogates as content)? If so, that means unpaired surrogates would be transformed even when appearing inside bytes literals, though the transformation would occur at the source decoding level, not the lexer or parser. Is that within scope of the spec? Does it happen when reading a Does it happen when you convert between strings and bytes using
Note that U+FFFD replacement is the only thing standing in the way of having bytes <-> str conversion be lossless when k = 8. 1-byte strings
Actually, I suppose we can simply preallocate the 256 length-1 bytes values. But I see now we're talking about going from an integer to a bytes value. Python apparently has bytes(size) -> bytes value with size many 0x00 values. I presume we don’t want to carry over that constructor, but its presence in Python blocks us from doing bytes(octet) -> bytes([octet]). Maybe we could use a keyword parameter, e.g. bytes(octet=x) -> bytes([x])? Strange values of kIn the spec, do we support k = 7 (UTF-7) or k = 32 (UTF-32)? Or are we explicitly mandating that k is either 8 or 16? |
See my comments on the starlark-go PR. I think the unifying theme of the problem with elems/ord is that we need a convenient, orthogonal, and obvious way to construct and deconstruct arbitrary sequences of k-bit elements. The restriction on non-ASCII numeric escapes in strings doesn't do that. The unicode-centric interpretation of Instead, what if we make For strings, we also provide a top-level Going from integers to string or byte elements is less uniform. We have Note that the possible Edit: You may argue that this makes strings too similar to bytes. But if we really wanted them to be so dissimilar, wouldn't we define them as sequences of code points like Python does? |
Notes from my discussion with Alan follow. Corrections from my summary above:
ConstructionThe constructor syntax we were going to use differs from Python. In Python:
We should aim for consistency with Python. At the very least, this means disallowing Unlike Python, we should require that We omit Python's The We may support StrictnessSimilarly, we should side with Python's default of failing fast when an encoding error is detected. This behavior can be overridden by the While decoding (utf-8-encoded) Starlark source code, an invalid surrogate code point should result in an error. Value of kWe specify that k = 8 and k = 16 are the only valid implementations. No k = 32 in particular. Elements and encodingsManipulating string elements is a first-class concept regardless of unicode encodings. We add a It is illegal to use The For strings we also provide a
All three methods could take in a This gives us a complete elements-oriented API for dealing with strings and bytes, as well as a separate unicode-oriented API for strings. Note that whereas the format of a bytes object can be customized by the We do need to give some thought to how to effectively handle strict failure, given that Starlark does not have exceptions. It seems undesirable to add look-before-you-leap variants of these methods, or even a keyword param that changes it from a results-producing-call to a safety check. Maybe we can have "strict" mean "returns |
This change adds initial specification of the bytes data type following length discussion in bazelbuild#112. It also explains the implementation-dependent encoding of text strings, and the \u \U \X escapes. More will follow, but let's get the easy parts out of the way first. Updates bazelbuild#112 Change-Id: I8cfbb4910c2f85a1076f9b8bdf1081c89dd5948a
This change adds initial specification of the bytes data type following length discussion in bazelbuild#112. It also explains the implementation-dependent encoding of text strings, and the \u \U \X escapes. More will follow, but let's get the easy parts out of the way first. Updates bazelbuild#112 Change-Id: I8cfbb4910c2f85a1076f9b8bdf1081c89dd5948a
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote, but uses \X to escape raw code units as needed. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \X or \u escapes instead. Updates bazelbuild/starlark#112 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote, but uses \X to escape raw code units as needed. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \X or \u escapes instead. Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
This change adds initial specification of the bytes data type following length discussion in bazelbuild#112. It also explains the implementation-dependent encoding of text strings, and the \u \U \X escapes. More will follow, but let's get the easy parts out of the way first. Updates bazelbuild#112 Change-Id: I8cfbb4910c2f85a1076f9b8bdf1081c89dd5948a
This change adds initial specification of the bytes data type following length discussion in bazelbuild#112. It also explains the implementation-dependent encoding of text strings, and the \u \U \X escapes. More will follow, but let's get the easy parts out of the way first. Updates bazelbuild#112 Change-Id: I8cfbb4910c2f85a1076f9b8bdf1081c89dd5948a
This change adds initial specification of the bytes data type following length discussion in bazelbuild#112. It also explains the implementation-dependent encoding of text strings, and the \u \U \X escapes. More will follow, but let's get the easy parts out of the way first. Updates bazelbuild#112 Change-Id: I8cfbb4910c2f85a1076f9b8bdf1081c89dd5948a
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \u escapes instead if you want the UTF-8 encoding of a code point in the range U+80 to U+FF. A string literal can no longer denote invalid text, such as the 1-element string formerly written "\xff". Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \u escapes instead if you want the UTF-8 encoding of a code point in the range U+80 to U+FF. A string literal can no longer denote invalid text, such as the 1-element string formerly written "\xff". Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \u escapes instead if you want the UTF-8 encoding of a code point in the range U+80 to U+FF. A string literal can no longer denote invalid text, such as the 1-element string formerly written "\xff". Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
This change adds initial specification of the bytes data type following length discussion in bazelbuild#112. It also explains the implementation-dependent encoding of text strings, and the \u and \U escapes. More will follow, but let's get the easy parts out of the way first. Updates bazelbuild#112 Change-Id: I8cfbb4910c2f85a1076f9b8bdf1081c89dd5948a
* spec: bytes data type This change adds initial specification of the bytes data type following length discussion in #112. It also explains the implementation-dependent encoding of text strings, and the \u and \U escapes. More will follow, but let's get the easy parts out of the way first. Updates #112
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \u escapes instead if you want the UTF-8 encoding of a code point in the range U+80 to U+FF. A string literal can no longer denote invalid text, such as the 1-element string formerly written "\xff". Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
This change defines the semantics of: - bytes.elems() -- iterable of int values of byte elements - hash(bytes) -- 32-bit BE FNV-1a hash - bytes in bytes -- substring test - int in bytes -- element membership test Updates bazelbuild#112 Change-Id: Ide3459c4115fff718197001c381da4da7a45a9d7
This change defines the semantics of: - bytes.elems() -- iterable of int values of byte elements - hash(bytes) -- 32-bit BE FNV-1a hash - bytes in bytes -- substring test - int in bytes -- element membership test Updates bazelbuild#112 Change-Id: Ide3459c4115fff718197001c381da4da7a45a9d7
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \u escapes instead if you want the UTF-8 encoding of a code point in the range U+80 to U+FF. A string literal can no longer denote invalid text, such as the 1-element string formerly written "\xff". Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
This change defines the semantics of: - str(bytes) -- UTF-k decoding with U+FFFD replacement - bytes(str) -- UTF-k encoding with U+FFFD replacement - bytes.elems() -- iterable of int values of byte elements - hash(bytes) -- 32-bit FNV-1a hash - bytes in bytes -- substring test - int in bytes -- element membership test Updates bazelbuild#112 Change-Id: Ide3459c4115fff718197001c381da4da7a45a9d7
This change defines the semantics of: - str(bytes) -- UTF-k decoding with U+FFFD replacement - bytes(str) -- UTF-k encoding with U+FFFD replacement - bytes.elems() -- iterable of int values of byte elements - hash(bytes) -- 32-bit FNV-1a hash - bytes in bytes -- substring test - int in bytes -- element membership test Updates bazelbuild#112 Change-Id: Ide3459c4115fff718197001c381da4da7a45a9d7
This change defines the semantics of: - str(bytes) -- UTF-k decoding with U+FFFD replacement - bytes(str) -- UTF-k encoding with U+FFFD replacement - bytes.elems() -- iterable of int values of byte elements - hash(bytes) -- 32-bit FNV-1a hash - bytes in bytes -- substring test - int in bytes -- element membership test Updates #112 Co-authored-by: Alan Donovan <alan@alandonovan.net>
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \u escapes instead if you want the UTF-8 encoding of a code point in the range U+80 to U+FF. A string literal can no longer denote invalid text, such as the 1-element string formerly written "\xff". Updates bazelbuild/starlark#112 Fixes #222
Bazel's Starlark does not provide access to a string's bytes, only its codepoints, so we are unable to do this escaping in Starlark. So a second pass is needed, at least until the spec and implementation work to get a [`bytes` type](bazelbuild/starlark#112) lands. Fixes bazel-contrib#794
This proposal argues for the addition of a new 'bytes' data type to Starlark, defined as an immutable sequence of byte (uint8) values.
Starlark has implementations in Go, Java, and Rust. It is crucial for performance that Starlark use the same string data type as the host language, but this poses a problem, because Java strings are sequences of uint16 values, conventionally holding UTF-16 encoded text, whereas Go's strings are sequences of uint8 values, conventionally holding UTF-8 encoded text.
The solution to this problem is to specify the behavior of Starlark strings in terms of an implementation-defined parameter K, which is 8 for Go and 16 for Java. Nearly all string operations over text can be usefully described this way, enabling portable text handling. (The sole obvious exception is
str.strip(chars='abc')
, as it treats the chars string as a set rather than an ordered sequence.) [This needs its own proposal. TODO(adonovan)]This leaves the Java implementation without a way to represent arbitrary byte strings, such as the contents of arbitrary binary files, or fields of protocol buffers. This need would be met by adding a new data type, 'bytes', to the core language:
b"ABC"
would denote a byte string, analogous to how "ABC" denotes a UTF-K string.\xXX
and octal escapes\377
over the range [0-255]. Ordinary strings would permit hex and octal escapes only for the ASCII range [0-127], but would additionally allow \uXXXX and \UXXXXXXXX escapes to denote the UTF-K encoding of a unicode code point specified in hex.type(bytes)
would return"bytes"
.len(bytes)
would return the number of bytes.bool(bytes)
would be equivalent tolen(bytes) > 0
.bytes[int]
would yield a 1-element byte string.elems
method would iterate over the 1-byte substrings, andelem_ords
would iterate over the numeric byte values.str(bytes)
would decode bytes to UTF-K, replacing invalid sequences by U+FFFD.bytes(str)
(that is, a new predeclared function called "bytes") would encode a text string to UTF-8, replacing invalid code units with the encoding of U+FFFD. (If K=8 and str is a valid encoding, the result is the same as the input.)repr(bytes)
would return a byte string literal such asb"ABC"
.The text was updated successfully, but these errors were encountered: