Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schema: Fix naming for PositionEncoding cases #225

Merged
merged 2 commits into from
Dec 13, 2023
Merged

Conversation

varungandhi-src
Copy link
Contributor

Used some wrong terminology earlier. Technically, this is a breaking change,
but we literally just landed this change yesterday and didn't cut a release
after that, so it should be OK.

Test plan

n/a

Copy link
Member

@olafurpg olafurpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it to @kritzcreek to stamp

Copy link
Contributor

@kritzcreek kritzcreek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd maybe use the term Unicode Scalar Value (USV) instead of UTF-32, as otherwise you get into trouble with things like UTF32-BE vs UTF32-LE.

Shouldn't you also rename UTF8ByteOffsetFromLineStart to UTF8CodeUnitOffsetFromLineStart for consistency with the utf-16 measure?

@varungandhi-src
Copy link
Contributor Author

varungandhi-src commented Dec 13, 2023

I'd maybe use the term Unicode Scalar Value (USV) instead of UTF-32, as otherwise you get into trouble with things like UTF32-BE vs UTF32-LE.

Looks like UTF-16 also has big-endian and little-endian versions. So should we add some clarification there too?

CleanShot 2023-12-13 at 17 11 55@2x

@varungandhi-src
Copy link
Contributor Author

To discuss this a bit more concretely, the offsets for the BE/LE case are the same. The only thing that differs is the order of interpreting the bytes.

In the common case, we'll have UTF-8 encoded text. Say we got offsets from a SCIP indexer implemented in Python. We want to map those to UTF-8 offsets for consistency. In that case, we can convert the UTF-8 text to UTF-32BE or UTF-32LE (whichever we want), and then index that array using the offsets provided in the SCIP index. We then map those offsets back to UTF-8 offsets using a line table. Since the target encoding is determined by us (and independent of the indexer), we can correctly interpret the bytes->scalar values without asking the indexer to convey whether it used BE or LE.

So long as the offsets are either in BE or LE, then we're fine. If the SCIP indexer had a byte order mark at the start of the stream, then it could end up measuring offsets that are off by 1 at the start of the line. I think it's fine to mandate that the SCIP indexer not include any byte order mark for measuring offsets.

@kritzcreek
Copy link
Contributor

Okay, reading the specification for utf-32 a bit closer, it looks like it also explicitly forbids any lone surrogates. In that case USV or UTF-32 is equivalent here.

As for all of the Unicode encoding forms, UTF-32 is restricted to representation of code
points in the ranges 0..D7FF_16 and E000 16..10FFFF_16 —that is, Unicode scalar values. This
guarantees interoperability with the UTF-16 and UTF-8 encoding forms.

@varungandhi-src
Copy link
Contributor Author

@kritzcreek I've renamed ByteOffset to CodeUnit, but I've avoided the BE/LE discussion as I don't think it is likely to be an issue here (e.g. if you had a PL with 0-based indexing for strings, you'd expect the 0 index to give the first code unit based on whatever encoding, not a byte order mark).

Copy link
Contributor

@kritzcreek kritzcreek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants