schema: Fix naming for PositionEncoding cases #225

varungandhi-src · 2023-12-13T10:14:18Z

Used some wrong terminology earlier. Technically, this is a breaking change,
but we literally just landed this change yesterday and didn't cut a release
after that, so it should be OK.

Test plan

n/a

olafurpg

I'll leave it to @kritzcreek to stamp

kritzcreek

I'd maybe use the term Unicode Scalar Value (USV) instead of UTF-32, as otherwise you get into trouble with things like UTF32-BE vs UTF32-LE.

Shouldn't you also rename UTF8ByteOffsetFromLineStart to UTF8CodeUnitOffsetFromLineStart for consistency with the utf-16 measure?

varungandhi-src · 2023-12-13T11:43:09Z

I'd maybe use the term Unicode Scalar Value (USV) instead of UTF-32, as otherwise you get into trouble with things like UTF32-BE vs UTF32-LE.

Looks like UTF-16 also has big-endian and little-endian versions. So should we add some clarification there too?

varungandhi-src · 2023-12-13T11:56:47Z

To discuss this a bit more concretely, the offsets for the BE/LE case are the same. The only thing that differs is the order of interpreting the bytes.

In the common case, we'll have UTF-8 encoded text. Say we got offsets from a SCIP indexer implemented in Python. We want to map those to UTF-8 offsets for consistency. In that case, we can convert the UTF-8 text to UTF-32BE or UTF-32LE (whichever we want), and then index that array using the offsets provided in the SCIP index. We then map those offsets back to UTF-8 offsets using a line table. Since the target encoding is determined by us (and independent of the indexer), we can correctly interpret the bytes->scalar values without asking the indexer to convey whether it used BE or LE.

So long as the offsets are either in BE or LE, then we're fine. If the SCIP indexer had a byte order mark at the start of the stream, then it could end up measuring offsets that are off by 1 at the start of the line. I think it's fine to mandate that the SCIP indexer not include any byte order mark for measuring offsets.

kritzcreek · 2023-12-13T11:59:45Z

Okay, reading the specification for utf-32 a bit closer, it looks like it also explicitly forbids any lone surrogates. In that case USV or UTF-32 is equivalent here.

As for all of the Unicode encoding forms, UTF-32 is restricted to representation of code
points in the ranges 0..D7FF_16 and E000 16..10FFFF_16 —that is, Unicode scalar values. This
guarantees interoperability with the UTF-16 and UTF-8 encoding forms.

varungandhi-src · 2023-12-13T12:17:58Z

@kritzcreek I've renamed ByteOffset to CodeUnit, but I've avoided the BE/LE discussion as I don't think it is likely to be an issue here (e.g. if you had a PL with 0-based indexing for strings, you'd expect the 0 index to give the first code unit based on whatever encoding, not a byte order mark).

kritzcreek

LGTM

schema: Fix naming for PositionEncoding cases

fe01c50

varungandhi-src requested review from olafurpg and kritzcreek December 13, 2023 11:21

olafurpg reviewed Dec 13, 2023

View reviewed changes

kritzcreek reviewed Dec 13, 2023

View reviewed changes

Rename ByteOffset to CodeUnit for consistency

ce735c9

kritzcreek approved these changes Dec 13, 2023

View reviewed changes

varungandhi-src merged commit 1d0b24e into main Dec 13, 2023

varungandhi-src deleted the vg/fix-naming branch December 13, 2023 12:34

varungandhi-src mentioned this pull request Dec 13, 2023

Correctly handle different kinds of character offsets in SCIP indexes sourcegraph/sourcegraph-public-snapshot#58956

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema: Fix naming for PositionEncoding cases #225

schema: Fix naming for PositionEncoding cases #225

varungandhi-src commented Dec 13, 2023

olafurpg left a comment

kritzcreek left a comment •

edited

Loading

varungandhi-src commented Dec 13, 2023 •

edited

Loading

varungandhi-src commented Dec 13, 2023

kritzcreek commented Dec 13, 2023

varungandhi-src commented Dec 13, 2023

kritzcreek left a comment

schema: Fix naming for PositionEncoding cases #225

schema: Fix naming for PositionEncoding cases #225

Conversation

varungandhi-src commented Dec 13, 2023

Test plan

olafurpg left a comment

Choose a reason for hiding this comment

kritzcreek left a comment • edited Loading

Choose a reason for hiding this comment

varungandhi-src commented Dec 13, 2023 • edited Loading

varungandhi-src commented Dec 13, 2023

kritzcreek commented Dec 13, 2023

varungandhi-src commented Dec 13, 2023

kritzcreek left a comment

Choose a reason for hiding this comment

kritzcreek left a comment •

edited

Loading

varungandhi-src commented Dec 13, 2023 •

edited

Loading