Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crate panics when IRC Tag contain non ascii characters #228

Closed
olback opened this issue Nov 1, 2020 · 7 comments
Closed

Crate panics when IRC Tag contain non ascii characters #228

olback opened this issue Nov 1, 2020 · 7 comments

Comments

@olback
Copy link
Contributor

olback commented Nov 1, 2020

Here's the panic message:

thread 'main' panicked at 'byte index 96 is not a char boundary; it is inside '甲' (bytes 94..97) of `@badge-info=;badges=;client-nonce=5a0d93af08a7aec23672b859e0d1aa88;color=#00FF7F;display-name=甲森;emotes=;flags=;id=746b7296-ce51-4198-9002-5021ba9080d9;mod=0;room-id=71092938;subscriber=0;tmi-sent-ts=1604198943834;turbo=0;user-id=72760962;user-type= :g`[...]', /home/olback/.cargo/registry/src/github.com-1ecc6299db9ec823/twitchchat-0.14.5/src/irc/tag_indices.rs:91:58
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Version: 0.14.5

cargo --version -v:

cargo 1.47.0 (f3c7e066a 2020-08-28)
release: 1.47.0
commit-hash: f3c7e066ad66e05439cf8eab165a2de580b41aaf
commit-date: 2020-08-28

rustc --version -v:

rustc 1.47.0 (18bf6b4f0 2020-10-07)
binary: rustc
commit-hash: 18bf6b4f01a6feaf7259ba7cdae58031af1b7b39
commit-date: 2020-10-07
host: x86_64-unknown-linux-gnu
release: 1.47.0
LLVM version: 11.0

Side note: Consider adding issue templates to make it easier to report bugs/make feature request 👍

@museun
Copy link
Owner

museun commented Nov 2, 2020

I see, was this from Twitch or were your parsing custom messages?

I should look into issue templates, all I really need is the input data (or sample), the problem and the crate version. Its in the tags which I last added string escaping for. I'll see where its getting the indices wrong. I also assume it was the display-name tag.

@olback
Copy link
Contributor Author

olback commented Nov 2, 2020

Yes, this was from Twitch. I don't really know how a non ascii character even gets into the display-name as it has to be the same as your username, just different capitalization.

When titling the issue I assumed that all tags are parsed the same way, the issue is with display-name though.

Edit: Managed to get the user-id from one of the crashing messages, here is a partial response from the /helix/users endpoint with user in question:

{
  "data": [
    {
      "id": "86293428",
      "login": "yuebing233",
      "display_name": "月饼",
      ...
    }
  ]
}

@olback olback changed the title Crate panics when IRC Tag contains non ascii characters Crate panics when IRC Tag contain non ascii characters Nov 2, 2020
@museun
Copy link
Owner

museun commented Nov 2, 2020

display-name can be localized (e.g. non-ascii scripts). Its a newish feature (perhaps a year now) that allows non-western users to provide a native name.

Looking at the code, I don't know if I can fix this in 0.14.x. There is a fatal flaw here:

for (i, ch) in input.char_indices().skip(1) {
let i = i + 1;

I would need to calculate the byte offset of all further 'chars' which is something I don't really want to do.

I kind of want to get rid of the the whole super-cheap indices approach. Currently, all of the messages, each, use a single allocation and then provide their 'sub-strings' as indices. These indices refer back to the single &str/String. But getting rid of this and just moving to a naive 'struct of indv. &str/String would probably break the semver.

This'd allow me to use utf-8 aware splitting without having to really be considerate of boundaries -- let the std library provide that. I have quite a bit already pushed for the 0.15.x branch (#226). I can take this into consideration, but I'm still looking at providing one last 0.14.x release. I'm going to think about a way of not breaking the semver when changing all of the internal memory representation of the types.

@museun
Copy link
Owner

museun commented Nov 2, 2020

A bit more thinking, I can just change the tags representation -- it already has to allocate a boxed slice:

pub struct TagIndices {
pub(super) map: Box<[(MaybeOwnedIndex, MaybeOwnedIndex)]>,
}

I can just make this a Box<[(Cow<'a, str>, Cow<'a, str>)]> internally and it wouldn't change the public API. This would ensure the tag and its index (now removed) are always using the same character (code point/scalar) boundaries.

I would basically just remove the indices and make

#[derive(Clone, PartialEq)]
pub struct Tags<'a> {
pub(crate) data: &'a MaybeOwned<'a>,
pub(crate) indices: &'a TagIndices,
}
simpler.

I don't think I expose most of this to the user so it shouldn't be breaking.

@olback
Copy link
Contributor Author

olback commented Nov 2, 2020

Looks great.

display-name can be localized (e.g. non-ascii scripts). Its a newish feature (perhaps a year now) that allows non-western users to provide a native name.

Ah, neat.

@museun museun closed this as completed in 793556e Nov 2, 2020
@museun
Copy link
Owner

museun commented Nov 2, 2020

I found a workaround, its not ideal but its transparent for the most part.

I've published 0.14.6 which fixes this problem.

@olback
Copy link
Contributor Author

olback commented Nov 2, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants