Tracking Issue for CharIndices::offset function #83871

TrolledWoods · 2021-04-05T07:15:23Z

Feature gate: #![feature(char_indices_offset)]

This is a tracking issue for the function CharIndices::offset. It returns the byte position of the next character, or the length of the underlying string if there are no more characters. This is useful for getting ranges over strings you're iterating over.

Public API

let mut chars = "a楽".char_indices();

assert_eq!(chars.offset(), 0);
assert_eq!(chars.next(), Some((0, 'a')));

assert_eq!(chars.offset(), 1);
assert_eq!(chars.next(), Some((1, '楽')));

assert_eq!(chars.offset(), 4);
assert_eq!(chars.next(), None);

Steps / History

Implementation: Added CharIndices::offset function #82585
Final commenting period (FCP)
Stabilization PR

Unresolved Questions

Bad API Tracking Issue for CharIndices::offset function #83871 (comment)

The text was updated successfully, but these errors were encountered:

GilRtr · 2021-07-04T21:35:39Z

Is there anything holding this back?

pmetzger · 2021-10-31T01:17:55Z

I just realized I desperately wanted this today. As it stands it's rather unpleasant assembling a range of characters in a string because you can't easily get the byte offset of the next character to be the end of the range, so then constructing a slice when what you have now is the last character is unpleasant.

TrolledWoods · 2022-02-16T16:35:13Z

I realized I sort of forgot about this, sorry about that, is there anything I need to do to move this along in stabilization?

pmetzger · 2022-02-16T17:53:37Z

No idea but I'd really love to see it stabilized.

Jay-Madden · 2022-07-12T01:29:44Z

Agreed, this would be very useful

cogsandsquigs · 2022-09-29T01:46:32Z

Yeah, it'd be really nice to have this finally released in the stable builds. Any progress on this?

pmetzger · 2022-10-01T00:59:39Z

@TrolledWoods Maybe you can ask around on the Rust dev fora about how to progress this?

jdahlstrom · 2022-10-04T10:26:08Z

As I learned a while back on u.r.l.o, there's also char::len_utf8 that can be used to construct a slice representing (or ending at) the current char.

m-ou-se · 2022-11-26T13:47:50Z

@rfcbot merge

rfcbot · 2022-11-26T13:47:51Z

Team member @m-ou-se has proposed to merge this. The next step is review by the rest of the tagged team members:

Concerns:

~~bad API~~ resolved by Tracking Issue for CharIndices::offset function #83871 (comment)
~~better-name~~ resolved by Tracking Issue for CharIndices::offset function #83871 (comment)

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

Amanieu · 2022-12-06T16:14:21Z

In my experience, additional methods on iterator types tend to be hard/unergonomic to use. In this case it would be better to return the offset as part of a tuple in the iterator's item type.

@rfcbot concern bad API

BurntSushi · 2022-12-28T14:39:39Z

Can someone show a small but real world program where this method is used?

Also @Amanieu, I think that suggestion would require building a new iterator type just for this. Probably an offset() method, if the use case makes sense, is the less-bad option.

Amanieu · 2022-12-29T20:38:06Z

Also @Amanieu, I think that suggestion would require building a new iterator type just for this. Probably an offset() method, if the use case makes sense, is the less-bad option.

I disagree, methods on iterators are very difficult to use ergonomically. Consider Iterator::peekable: it provides useful functionality, but is such a pain to use that I actively avoid it wherever I can. In that specific case the only reason we didn't go with the approach of returning a new iterator that yields (T, Option<&T>) is that the lifetime could not be expressed due to the lack of GATs.

If character offsets are really needed then they should be returned through a proper iterator interface that is compatible with all the existing iterator combinators.

pmetzger · 2023-05-03T14:49:52Z

I have only a big real world program where I needed it, not a small one. I had a hand-written lexical analyzer that needed it. However, in the intervening years, I replaced the lexical analyzer with one generated by re2c now that it produces rust code, so I don't have an example handy any more.

solson · 2023-06-26T00:27:13Z

In this case it would be better to return the offset as part of a tuple in the iterator's item type.

@Amanieu You seem to have gotten a bit confused here - your suggestion is precisely the CharIndices iterator that already exists. Its whole deal is returning the offset as part of a tuple in its item type. This proposal is to add a method to CharIndices to access the current position (which it always knows) before advancing, or when at the end.

The implementation PR's description describes a few of the problems I just ran into while trying to use CharIndices in a lexer.

Can someone show a small but real world program where this method is used?

@BurntSushi Here's an excerpt of something I was just working on, followed by how it looks with CharIndices::offset. Realistically, though, the cleanest approach while this method is unstable is to write a near copy of CharIndices with this method added, rather than working around the edge cases with peeking and manual end tracking.

struct Lexer<'a> {
    input: Peekable<CharIndices<'a>>,
    end: usize, // Initialized to the length of the input.
}

impl Lexer<'a> {
    fn lex(&mut self) -> Token {
        let Some((start, c)) = self.input.next() else {
            return Token {
                kind: Eof,
                start: self.end,
                end: self.end,
            }
        };

        let kind = match c {
            ...
        };

        let end = self.input.peek().map(|(pos, _)| pos).unwrap_or(self.end);
        Token { kind, start, end }
    }
}

struct Lexer<'a> {
    input: CharIndices<'a>,
}

impl Lexer<'a> {
    fn lex(&mut self) -> Token {
        let Some((start, c)) = self.input.next() else {
            return Token {
                kind: Eof,
                start: self.input.offset(),
                end: self.input.offset(),
            }
        };

        let kind = match c {
            ...
        };

        let end = self.input.offset();
        Token { kind, start, end }
    }
}

Amanieu · 2023-06-26T09:33:43Z

Thanks, this gives a better idea of how this is useful in practice. I believe this API is good, but I don't like the naming: offset isn't very clear (what offset?). Perhaps next_offset or next_char_offset would be clearer?

jdahlstrom · 2023-06-26T10:57:26Z

offset isn't very clear (what offset?). Perhaps next_offset or next_char_offset would be clearer?

Indeed, this baffled me a bit at first as well. It's supposed to be the index that you get next (with the difference that it retuns one-past-the-end when next would return None).

Furthermore, "offset" is a word alien to the current public API; the type and method names use "indices" and the docs additionally talk about "position". If "offset" is just a synonym to "index" or "position" then one of those words should be used instead, preferably "index" because that's what the API already uses. So next_index or next_char_index?

solson · 2023-06-26T11:14:11Z

Personally, I always think of it as "the current position". I've written near copies of CharIndices in the past and this field would always be named something like position or pos. But it would likely make sense to use "index" as mentioned because of the type name and the phrase "byte index" used in docs.

I'm ambivalent on attaching next_ to the name. It's accurate to think of it simply as the current position of the iterator. The reason .next() returns this value (if not at the end) is because the next char always starts at the current position (but extends beyond it, if multibyte).

In either case it's one of those things that is really straightforward if you go-to-definition into the stdlib source and a bit wordier to explain in prose...

andrewhickman · 2023-06-26T15:11:00Z

Another possible name is peek or peek_index for analogy with the Peekable iterator.

eternaleye · 2023-06-26T16:08:50Z

Another possible name is peek or peek_index for analogy with the Peekable iterator.

It's still valid at the end, though, which makes that analogy misleading.

m-ou-se · 2023-06-26T16:25:48Z

Thanks, this gives a better idea of how this is useful in practice. I believe this API is good, but I don't like the naming: offset isn't very clear (what offset?). Perhaps next_offset or next_char_offset would be clearer?

The offset of the iterator into the string it iterates over. 'next offset' or 'next char offset' wouldn't make sense for the offset at the end (when .as_str() is empty), when there is no next item.

the type and method names use "indices" and the docs additionally talk about "position". If "offset" is just a synonym to "index" or "position" then one of those words should be used instead, preferably "index" because that's what the API already uses.

Calling this index() or position() could work, although then it's less clear we're talking about a byte offset rather than counting chars (like .chars().position()). (But perhaps that ship has sailed by calling this type CharIndices instead of CharOffsets.)

Amanieu · 2023-06-26T16:29:22Z

Thinking about this some more, it seems that what you really want here is a CharRanges iterator that returns a Range<usize> for each character. Would that work better for your use case? As I've said before, I'm not really a fan of methods on iterators since they don't work well with typical use cases for them.

m-ou-se · 2023-06-26T16:33:05Z

I recently ran into nearly the same use case as mentioned above. (I currently use Chars::as_str() and use .as_ptr() to do pointer math to get the offset.)

In my use case, I don't just want CharRanges, because I also use the offset for error reporting to know where we are in the input, without having to store the location separately when the iterator already knows anyway.

And without an offset/position/index method, I'd still have to special case the final location where the iterator returns None but I still needs it offset in the input (at the end).

m-ou-se · 2023-06-26T16:36:53Z

As I've said before, I'm not really a fan of methods on iterators since they don't work well with typical use cases for them.

I use Chars::as_str() quite often to get the unconsumed part of the string back for later use. I don't think I've ever had any issues with its API.

solson · 2023-06-26T16:37:01Z

Some similar code only stores the start index for each token, since you can reconstruct ranges later, so ranges aren't necessary. Ranges also are not sufficient because they can't give you the iterator's position when it's at the end, for an Eof token. (It would unhelpfully give you that value slightly too early, with the last char, before you know you're about to hit Eof.)

On a more general note, I find that additional methods on iterators often make the difference between getting frustrated and doing things by hand instead, or feeling like I'm using a beautiful API that's considered the gaps and edge cases. I'm suspicious of iterators that only support next - they're often missing something important.

From another angle if I just "follow the data" this method exposes something CharIndices must know, and it comes down to whether it needlessly obscures it from me or not.

BurntSushi · 2023-06-26T16:40:43Z

I agree. While I think "consider a new iterator" is a good idea in general, if there's a niche use case that can be served by adding a natural method to an existing iterator type, then that seems like a win to me.

jdahlstrom · 2023-06-26T17:22:26Z

I'm ambivalent on attaching next_ to the name. It's accurate to think of it simply as the current position of the iterator. The reason .next() returns this value (if not at the end) is because the next char always starts at the current position (but extends beyond it, if multibyte).

Yes, it makes sense from the implementer's point of view, but IMO not the user's. And API naming should of course be user-centric. The fact that next() uses the current state is just an implementation detail. From the user's POV the state change happens when next is called, not before.

m-ou-se · 2023-06-26T17:25:06Z

It'd be an issue if we have a .next() method that modifies the state to advance the iterator and also have a .next_offset() method that does not modify the state and just returns the current position.

WhyNotHugo · 2023-11-27T03:23:23Z

I agree that this method is hard to use "in the usual places where you'd use an iterator". This isn't really a problem: this method it's not meant to be used in such cases; it's meant to be used in logic that's dealing specifically with the CharIndeces type (typically code that's parsing a string).

This type already has an as_str method, which exposes also exposes part of the the underlying data. as_str is also "hard to use" in the usual contexts where you'd use a generic iterator. And it also isn't intended to be used in such places.

From what I understand, the current major holdback is the naming. Perhaps offset_of_next is unambiguous enough?

pmetzger · 2023-11-28T13:54:41Z

Delaying a useful API for long periods because of difficulty finding the right name seems unfortunate. Perhaps someone could just post a poll with a few suggestions and be done with it?

TrolledWoods · 2023-11-29T12:53:18Z

Thinking back on this I agree that offset is not a great name. Since CharIndices already uses "Indices" as the term, maybe index is a decent name? I would think that using next as a prefix/suffix might imply mutation, so if we're going to add something extra maybe peek or current might be reasonable? Though as the as_str method already returns the remaining string and not the whole string, I think it's fairly clear what it does even without any extra attachments.

pmetzger · 2023-11-29T15:39:54Z

One can ask what color to paint the bikeshed endlessly. (This doesn't mean having a good name isn't important, just that the conversation may not terminate, having an okayish name and having the thing available is better than waiting forever for perfection, and someone is going to have to make a decision.)

sffc · 2024-02-03T05:11:01Z

A function with the following signature would be useful to me in the code I'm currently writing:

impl<'a> CharIndices<'a> {
    pub fn current(&self) -> (usize, Option<char>) { ... }
}

However, if the implementation of this function would be no better than char_indices.as_str().get(char_indices.offset()), then it seems fine to return the offset by itself, in which case offset() seems like a fine function name.

Currently I am using Peekable<CharIndices> which is just extremely wasteful and unergonomic.

apt1002 · 2024-05-23T10:36:07Z

A work-around for finding the current byte index is to subtract chars.as_str().len() from the length of the original string. This idea has the following advantages:

It avoids calling offset().
It avoids parsing the UTF8 bytes twice.
It avoids a special case for the end of the string.
It's clearer what you mean when iterating through a substring.
It even works with a Chars (as opposed to a CharIndices).

The main disadvantage is that you have to separately record the length of the original string.

PS I agree with calling it index().

joshtriplett · 2024-08-06T15:24:27Z

I do think this needs a better name, something that makes it clear it's the index of the next character.

@rfcbot concern better-name

But I do think we should have it:

@rfcbot reviewed

dtolnay · 2024-08-06T16:44:35Z

IMO offset() is fine as is, and I don't think a better name has been proposed so far in this issue. I agree with the following 3 comments on why putting next into the name would not be better:

Namely:

The offset it returns can be conceived as the current offset of the iterator. The iterator has already been mutated as part of advancing past the character that it most recently returned. The offset of the character already returned is the offset of the previous character from the point of view of the iterator. The caller already has the offset of the previous character because they received it in the iterator's most recent Item. That item is no longer what the iterator refers to, so I don't see ambiguity about whether the iterator's offset is still that previous location.
There is not necessarily any next character in the iterator. Offset doesn't return the offset of a next character, it returns the current position of the iterator.
Having one next that advances the iterator and returns tuple of offset and char, and a different …next… that returns offset without char, but does not advance the iterator, is misleading.

The difference in mental model between "previous"/"current" vs "previous"/"next" is similar to something we debated previously in rust-lang/rfcs#2570 (comment). For linked list cursor, it did make sense to me that the cursor points in between linked list nodes, so "before"/"after" was a better way to design the API than "before"/"current".

But notice that this is meaningfully different than the char offsets use case! We redesigned linked list cursors from pointing at a node, to pointing at the boundary between two nodes. But offsets already refer to the boundary between two chars! The char range 0..1 refers to a range from the boundary before the initial byte, to the boundary between the initial byte and the next one. When a char indices iterator is at a particular char boundary, there is no "previous"/"next" offset. There is a previous/next char, and there is a current offset.

Amanieu · 2024-08-06T17:04:38Z

David's reasoning above resolves my concern.

@rfcbot resolve bad API

joshtriplett · 2024-08-09T16:37:40Z

Alright, I'm convinced by the description that it's the offset of the current character after having called next. I think that's going to need some clear documentation showing a loop and calling attention to how after next the iterator is already pointing after the returned character, but as a name it works.

@rfcbot resolved better-name

rfcbot · 2024-08-09T16:37:54Z

🔔 This is now entering its final comment period, as per the review above. 🔔

rfcbot · 2024-08-19T16:38:34Z

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

…ffset, r=Amanieu Stabilize feature `char_indices_offset` Stabilized API: ```rust impl CharIndices<'_> { pub fn offset(&self) -> usize; } ``` Tracking issue: rust-lang#83871 Closes rust-lang#83871 I also attempted to improved the documentation to make it more clear that it returns the offset of the character that will be returned by the next call to `next()`.

Rollup merge of rust-lang#129276 - eduardosm:stabilize-char_indices_offset, r=Amanieu Stabilize feature `char_indices_offset` Stabilized API: ```rust impl CharIndices<'_> { pub fn offset(&self) -> usize; } ``` Tracking issue: rust-lang#83871 Closes rust-lang#83871 I also attempted to improved the documentation to make it more clear that it returns the offset of the character that will be returned by the next call to `next()`.

pmetzger · 2024-08-23T13:59:11Z

Hooray!

Jay-Madden · 2024-08-24T03:44:43Z

Excellent

error[E0599]: no method named `split_at_checked` found for reference `&str` in the current scope --> crates\compiler\src\export.rs:229:53 | 229 | match s.len().checked_sub(2).and_then(|i| s.split_at_checked(i)) { | ^^^^^^^^^^^^^^^^ help: there is a method with a similar name: `split_at` error[E0658]: use of unstable library feature 'char_indices_offset' --> crates\compiler\src\mml\tokenizer.rs:223:23 | 223 | let i = c.offset(); | ^^^^^^ | = note: see issue #83871 <rust-lang/rust#83871> for more information

TrolledWoods added C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Apr 5, 2021

rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. labels Nov 26, 2022

rfcbot added the final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. label Aug 9, 2024

rfcbot removed the proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. label Aug 9, 2024

eduardosm mentioned this issue Aug 19, 2024

Stabilize feature char_indices_offset #129276

Merged

apiraino removed the to-announce Announce this issue on triage meeting label Aug 22, 2024

bors closed this as completed in 26672c9 Aug 23, 2024

Tracking Issue for CharIndices::offset function #83871

Tracking Issue for CharIndices::offset function #83871

Comments

TrolledWoods commented Apr 5, 2021 • edited by Dylan-DPC Loading

Public API

Steps / History

Unresolved Questions

GilRtr commented Jul 4, 2021

pmetzger commented Oct 31, 2021

TrolledWoods commented Feb 16, 2022

pmetzger commented Feb 16, 2022

Jay-Madden commented Jul 12, 2022

cogsandsquigs commented Sep 29, 2022

pmetzger commented Oct 1, 2022

jdahlstrom commented Oct 4, 2022 • edited Loading

m-ou-se commented Nov 26, 2022

rfcbot commented Nov 26, 2022 • edited Loading

Amanieu commented Dec 6, 2022

BurntSushi commented Dec 28, 2022

Amanieu commented Dec 29, 2022

pmetzger commented May 3, 2023

solson commented Jun 26, 2023 • edited Loading

Amanieu commented Jun 26, 2023

jdahlstrom commented Jun 26, 2023

solson commented Jun 26, 2023

andrewhickman commented Jun 26, 2023

eternaleye commented Jun 26, 2023

m-ou-se commented Jun 26, 2023

Amanieu commented Jun 26, 2023

m-ou-se commented Jun 26, 2023

m-ou-se commented Jun 26, 2023

solson commented Jun 26, 2023

BurntSushi commented Jun 26, 2023

jdahlstrom commented Jun 26, 2023

m-ou-se commented Jun 26, 2023

WhyNotHugo commented Nov 27, 2023

pmetzger commented Nov 28, 2023

TrolledWoods commented Nov 29, 2023

pmetzger commented Nov 29, 2023

sffc commented Feb 3, 2024

apt1002 commented May 23, 2024 • edited Loading

joshtriplett commented Aug 6, 2024

dtolnay commented Aug 6, 2024

Amanieu commented Aug 6, 2024

joshtriplett commented Aug 9, 2024

rfcbot commented Aug 9, 2024

rfcbot commented Aug 19, 2024

pmetzger commented Aug 23, 2024

Jay-Madden commented Aug 24, 2024

TrolledWoods commented Apr 5, 2021 •

edited by Dylan-DPC

Loading

jdahlstrom commented Oct 4, 2022 •

edited

Loading

rfcbot commented Nov 26, 2022 •

edited

Loading

solson commented Jun 26, 2023 •

edited

Loading

apt1002 commented May 23, 2024 •

edited

Loading