-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking Issue for CharIndices::offset function #83871
Comments
Is there anything holding this back? |
I just realized I desperately wanted this today. As it stands it's rather unpleasant assembling a range of characters in a string because you can't easily get the byte offset of the next character to be the end of the range, so then constructing a slice when what you have now is the last character is unpleasant. |
I realized I sort of forgot about this, sorry about that, is there anything I need to do to move this along in stabilization? |
No idea but I'd really love to see it stabilized. |
Agreed, this would be very useful |
Yeah, it'd be really nice to have this finally released in the stable builds. Any progress on this? |
@TrolledWoods Maybe you can ask around on the Rust dev fora about how to progress this? |
As I learned a while back on u.r.l.o, there's also char::len_utf8 that can be used to construct a slice representing (or ending at) the current char. |
@rfcbot merge |
Team member @m-ou-se has proposed to merge this. The next step is review by the rest of the tagged team members: Concerns:
Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! See this document for info about what commands tagged team members can give me. |
In my experience, additional methods on iterator types tend to be hard/unergonomic to use. In this case it would be better to return the offset as part of a tuple in the iterator's item type. @rfcbot concern bad API |
Can someone show a small but real world program where this method is used? Also @Amanieu, I think that suggestion would require building a new iterator type just for this. Probably an |
I disagree, methods on iterators are very difficult to use ergonomically. Consider If character offsets are really needed then they should be returned through a proper iterator interface that is compatible with all the existing iterator combinators. |
I have only a big real world program where I needed it, not a small one. I had a hand-written lexical analyzer that needed it. However, in the intervening years, I replaced the lexical analyzer with one generated by re2c now that it produces rust code, so I don't have an example handy any more. |
@Amanieu You seem to have gotten a bit confused here - your suggestion is precisely the The implementation PR's description describes a few of the problems I just ran into while trying to use
@BurntSushi Here's an excerpt of something I was just working on, followed by how it looks with struct Lexer<'a> {
input: Peekable<CharIndices<'a>>,
end: usize, // Initialized to the length of the input.
}
impl Lexer<'a> {
fn lex(&mut self) -> Token {
let Some((start, c)) = self.input.next() else {
return Token {
kind: Eof,
start: self.end,
end: self.end,
}
};
let kind = match c {
...
};
let end = self.input.peek().map(|(pos, _)| pos).unwrap_or(self.end);
Token { kind, start, end }
}
} struct Lexer<'a> {
input: CharIndices<'a>,
}
impl Lexer<'a> {
fn lex(&mut self) -> Token {
let Some((start, c)) = self.input.next() else {
return Token {
kind: Eof,
start: self.input.offset(),
end: self.input.offset(),
}
};
let kind = match c {
...
};
let end = self.input.offset();
Token { kind, start, end }
}
} |
Thanks, this gives a better idea of how this is useful in practice. I believe this API is good, but I don't like the naming: |
Indeed, this baffled me a bit at first as well. It's supposed to be the index that you get next (with the difference that it retuns one-past-the-end when Furthermore, "offset" is a word alien to the current public API; the type and method names use "indices" and the docs additionally talk about "position". If "offset" is just a synonym to "index" or "position" then one of those words should be used instead, preferably "index" because that's what the API already uses. So |
Personally, I always think of it as "the current position". I've written near copies of CharIndices in the past and this field would always be named something like I'm ambivalent on attaching In either case it's one of those things that is really straightforward if you go-to-definition into the stdlib source and a bit wordier to explain in prose... |
Another possible name is |
It's still valid at the end, though, which makes that analogy misleading. |
The offset of the iterator into the string it iterates over. 'next offset' or 'next char offset' wouldn't make sense for the offset at the end (when
Calling this |
Thinking about this some more, it seems that what you really want here is a |
I recently ran into nearly the same use case as mentioned above. (I currently use In my use case, I don't just want And without an offset/position/index method, I'd still have to special case the final location where the iterator returns None but I still needs it offset in the input (at the end). |
I use Chars::as_str() quite often to get the unconsumed part of the string back for later use. I don't think I've ever had any issues with its API. |
Some similar code only stores the start index for each token, since you can reconstruct ranges later, so ranges aren't necessary. Ranges also are not sufficient because they can't give you the iterator's position when it's at the end, for an Eof token. (It would unhelpfully give you that value slightly too early, with the last char, before you know you're about to hit Eof.) On a more general note, I find that additional methods on iterators often make the difference between getting frustrated and doing things by hand instead, or feeling like I'm using a beautiful API that's considered the gaps and edge cases. I'm suspicious of iterators that only support From another angle if I just "follow the data" this method exposes something CharIndices must know, and it comes down to whether it needlessly obscures it from me or not. |
I agree. While I think "consider a new iterator" is a good idea in general, if there's a niche use case that can be served by adding a natural method to an existing iterator type, then that seems like a win to me. |
Yes, it makes sense from the implementer's point of view, but IMO not the user's. And API naming should of course be user-centric. The fact that |
It'd be an issue if we have a |
I agree that this method is hard to use "in the usual places where you'd use an iterator". This isn't really a problem: this method it's not meant to be used in such cases; it's meant to be used in logic that's dealing specifically with the This type already has an From what I understand, the current major holdback is the naming. Perhaps |
Delaying a useful API for long periods because of difficulty finding the right name seems unfortunate. Perhaps someone could just post a poll with a few suggestions and be done with it? |
Thinking back on this I agree that |
One can ask what color to paint the bikeshed endlessly. (This doesn't mean having a good name isn't important, just that the conversation may not terminate, having an okayish name and having the thing available is better than waiting forever for perfection, and someone is going to have to make a decision.) |
A function with the following signature would be useful to me in the code I'm currently writing: impl<'a> CharIndices<'a> {
pub fn current(&self) -> (usize, Option<char>) { ... }
} However, if the implementation of this function would be no better than Currently I am using |
A work-around for finding the current byte index is to subtract
The main disadvantage is that you have to separately record the length of the original string. PS I agree with calling it |
IMO
Namely:
The difference in mental model between "previous"/"current" vs "previous"/"next" is similar to something we debated previously in rust-lang/rfcs#2570 (comment). For linked list cursor, it did make sense to me that the cursor points in between linked list nodes, so "before"/"after" was a better way to design the API than "before"/"current". But notice that this is meaningfully different than the char offsets use case! We redesigned linked list cursors from pointing at a node, to pointing at the boundary between two nodes. But offsets already refer to the boundary between two chars! The char range 0..1 refers to a range from the boundary before the initial byte, to the boundary between the initial byte and the next one. When a char indices iterator is at a particular char boundary, there is no "previous"/"next" offset. There is a previous/next char, and there is a current offset. |
David's reasoning above resolves my concern. @rfcbot resolve bad API |
Alright, I'm convinced by the description that it's the offset of the current character after having called next. I think that's going to need some clear documentation showing a loop and calling attention to how after @rfcbot resolved better-name |
🔔 This is now entering its final comment period, as per the review above. 🔔 |
The final comment period, with a disposition to merge, as per the review above, is now complete. As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed. This will be merged soon. |
…ffset, r=Amanieu Stabilize feature `char_indices_offset` Stabilized API: ```rust impl CharIndices<'_> { pub fn offset(&self) -> usize; } ``` Tracking issue: rust-lang#83871 Closes rust-lang#83871 I also attempted to improved the documentation to make it more clear that it returns the offset of the character that will be returned by the next call to `next()`.
…ffset, r=Amanieu Stabilize feature `char_indices_offset` Stabilized API: ```rust impl CharIndices<'_> { pub fn offset(&self) -> usize; } ``` Tracking issue: rust-lang#83871 Closes rust-lang#83871 I also attempted to improved the documentation to make it more clear that it returns the offset of the character that will be returned by the next call to `next()`.
…ffset, r=Amanieu Stabilize feature `char_indices_offset` Stabilized API: ```rust impl CharIndices<'_> { pub fn offset(&self) -> usize; } ``` Tracking issue: rust-lang#83871 Closes rust-lang#83871 I also attempted to improved the documentation to make it more clear that it returns the offset of the character that will be returned by the next call to `next()`.
Rollup merge of rust-lang#129276 - eduardosm:stabilize-char_indices_offset, r=Amanieu Stabilize feature `char_indices_offset` Stabilized API: ```rust impl CharIndices<'_> { pub fn offset(&self) -> usize; } ``` Tracking issue: rust-lang#83871 Closes rust-lang#83871 I also attempted to improved the documentation to make it more clear that it returns the offset of the character that will be returned by the next call to `next()`.
Hooray! |
Excellent |
error[E0599]: no method named `split_at_checked` found for reference `&str` in the current scope --> crates\compiler\src\export.rs:229:53 | 229 | match s.len().checked_sub(2).and_then(|i| s.split_at_checked(i)) { | ^^^^^^^^^^^^^^^^ help: there is a method with a similar name: `split_at` error[E0658]: use of unstable library feature 'char_indices_offset' --> crates\compiler\src\mml\tokenizer.rs:223:23 | 223 | let i = c.offset(); | ^^^^^^ | = note: see issue #83871 <rust-lang/rust#83871> for more information
Feature gate:
#![feature(char_indices_offset)]
This is a tracking issue for the function
CharIndices::offset
. It returns the byte position of the next character, or the length of the underlying string if there are no more characters. This is useful for getting ranges over strings you're iterating over.Public API
Steps / History
Unresolved Questions
The text was updated successfully, but these errors were encountered: