-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schema: Fix naming for PositionEncoding cases #225
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it to @kritzcreek to stamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd maybe use the term Unicode Scalar Value (USV) instead of UTF-32, as otherwise you get into trouble with things like UTF32-BE vs UTF32-LE.
Shouldn't you also rename UTF8ByteOffsetFromLineStart
to UTF8CodeUnitOffsetFromLineStart
for consistency with the utf-16
measure?
Looks like UTF-16 also has big-endian and little-endian versions. So should we add some clarification there too? |
To discuss this a bit more concretely, the offsets for the BE/LE case are the same. The only thing that differs is the order of interpreting the bytes. In the common case, we'll have UTF-8 encoded text. Say we got offsets from a SCIP indexer implemented in Python. We want to map those to UTF-8 offsets for consistency. In that case, we can convert the UTF-8 text to UTF-32BE or UTF-32LE (whichever we want), and then index that array using the offsets provided in the SCIP index. We then map those offsets back to UTF-8 offsets using a line table. Since the target encoding is determined by us (and independent of the indexer), we can correctly interpret the bytes->scalar values without asking the indexer to convey whether it used BE or LE. So long as the offsets are either in BE or LE, then we're fine. If the SCIP indexer had a byte order mark at the start of the stream, then it could end up measuring offsets that are off by 1 at the start of the line. I think it's fine to mandate that the SCIP indexer not include any byte order mark for measuring offsets. |
Okay, reading the specification for utf-32 a bit closer, it looks like it also explicitly forbids any lone surrogates. In that case USV or UTF-32 is equivalent here.
|
@kritzcreek I've renamed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Used some wrong terminology earlier. Technically, this is a breaking change,
but we literally just landed this change yesterday and didn't cut a release
after that, so it should be OK.
Test plan
n/a