-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spec: more bytes operations #164
Conversation
merge upstream
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
Some comments on what is (and what is not) in this PR:
|
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
spec.md
Outdated
@@ -2110,19 +2114,20 @@ these operators. | |||
#### Membership tests | |||
|
|||
```text | |||
any in sequence (list, tuple, dict, string) | |||
any in sequence (list, tuple, dict, string, bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Here and immediately below) Also range objects, and possibly other view-like objects (even an elems()
iterable).
Separately, I wonder if the spec should say anything about which operations an application-defined type may elect to participate in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the spec would get into the business of exhaustively listing all possible types, instead referring to abstractions like "indexable sequence" and given examples of one or to instances.
I've added range here and in a couple places, but it's a slippery slope.
which operations an application-defined type may elect to participate in.
I think that's entirely up to the implementation, and essentially all operators are fair game.
If x is a `bytes`, the result is x. | ||
|
||
If x is a string, the result is a `bytes` whose elements are | ||
the UTF-8 encoding of the string. Each element of the string that is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's multiple ways to apply U+FFFD replacement to invalid UTF-8 data. Looking at Section 3.9 of the unicode standard (see heading "Constraints on Conversion Process"), it appears the only hard requirement is that no valid sequence of code units encoding a single code point be consumed by the replacement process. But there's flexibility on whether each code unit that is not part of a valid sequence gets its own replacement character, or whether consecutive such units get batched into just one replacement character.
I imagine the algorithm used might vary across string encoding libraries in different Starlark implementation's host languages. Which means we may want to allow for implementation-defined behavior here.
Of course, that's for decoding, not encoding. I suppose bytes(str)
is thought of as applying a composition of decoding the string's elements from UTF-K and re-encoding them to UTF-8, even when K = 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the Unicode spec doesn't prescribe it exactly, and this question has a history; see https://hsivonen.fi/broken-utf-8/Unicode. But it's safer for us to prescribe the behavior here, as that very document points out:
Web standards these days tend to avoid implementation-defined behavior due to the painful experience of Web sites developing dependencies on the quirks of particular browsers in areas that old specs considered error situations not worth spelling out precise processing requirements for. Therefore, there has been a long push towards well-defined behavior even in error situations on the Web without debating each particular error case...
I chose this behavior because it's easy to implement, and it's the one required by the Go spec, and if I recall correctly, Java's String.getBytes(utf8) uses '?' as its replacement (!!), and thus isn't compatible with any reading of Unicode, so we'll have to handle the conversion ourselves anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is consistent with practices in Go, it's good enough for me.
In other words `x == y` implies `hash(x) == hash(y)`. | ||
Any other type of argument in an error, even if it is suitable as the key of a dict. | ||
|
||
In the interests of reproducibility of Starlark program behavior over time and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interests of reproducibility
Which is exactly what Python aims to avoid for the sake of thwarting collision DoS attacks... Can't be helped, I suppose. Do we have the option to extend hash()
with a seed argument in the future? Or perhaps at that point you'd just run the interpreter in a mode where it's allowed to substitute an arbitrary hash function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but Starlark prizes determinism higher than DoS protection. This was a material problem when I implemented Herb: execution of BUILD files would vary based on the hash function, causing (for example) different elements to be chosen from a set, causing different results, and in some cases, errors only in one implementation.
Do we have the option to extend hash() with a seed argument in the future? Or perhaps at that point you'd just run the interpreter in a mode where it's allowed to substitute an arbitrary hash function.
No, we don't have an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, put it this way: If Starlark is ever used by an application that wants DoS protection and values it over determinism, they'll do what they want in the implementation (to the extent that they control their interpreter) and ignore the spec. So it's just a choice of whether we bless it or not, and I don't feel strongly either way.
This change defines the semantics of: - str(bytes) -- UTF-k decoding with U+FFFD replacement - bytes(str) -- UTF-k encoding with U+FFFD replacement - bytes.elems() -- iterable of int values of byte elements - hash(bytes) -- 32-bit FNV-1a hash - bytes in bytes -- substring test - int in bytes -- element membership test Updates bazelbuild#112 Change-Id: Ide3459c4115fff718197001c381da4da7a45a9d7
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
spec: more bytes operations
Updates #112