Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join: add support for multibyte separators #6736

Merged
merged 4 commits into from
Oct 6, 2024

Conversation

jtracey
Copy link
Contributor

@jtracey jtracey commented Sep 25, 2024

GNU join used to not support any separators over 1 byte in length, and had tests ensuring such separators would return an error. As such, years ago, we removed our support for multibyte unicode characters, and added support for single non-unicode bytes to match their behavior, and got GNU tests passing. Now, GNU join supports multibyte separators when they're valid characters in the current locale's encoding. We're not in a position to add proper locale support yet, but this adds the next best thing, and should get the new GNU tests to pass in CI by assuming UTF-8 encodings. This is just straight up wrong on Windows, which always uses UTF-16, but that was always weird, and the "right" thing on Windows at this point would be to reject all possible separators other than null, which doesn't sound very useful.

The first commit fixes our tests to better reflect the new GNU behavior. The second commit gets multibyte separators working in a way closely resembling what we were already doing. But this bloats the separator enum we were using, and adds a bunch of clones or (de)refs, which profiling showed added a notable hit to performance (though still significantly faster than GNU), all to support a few extra bytes that will likely be rarely used. The third commit therefore changes the enum into a trait, so we can use generics instead of matching, getting us most of that lost performance back (or possibly all, it's hard to get consistent measurements).

@sylvestre
Copy link
Contributor

clippy is complaining on:

 error: this argument is passed by value, but not consumed in the function body
   --> src/uu/join/src/join.rs:366:56
    |
366 |     fn new<Sep: Separator>(string: Vec<u8>, separator: Sep, len_guess: usize) -> Self {
    |                                                        ^^^ help: consider taking a reference instead: `&Sep`
    |
    = help: for further information visit [https://rust-lang.github.io/rust-clippy/master/](https://rust-lang.github.io/rust-clippy/master/index.html#needless_pass_by_value)

Copy link

GNU testsuite comparison:

GNU test failed: tests/join/join. tests/join/join is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/join/join-utf8 is no longer failing!

@sylvestre
Copy link
Contributor

Congrats! The gnu test tests/join/join-utf8 is no longer failing!

well done :)

@jtracey jtracey force-pushed the join-multibyte branch 2 times, most recently from 69d71f7 to 849479c Compare September 26, 2024 03:03
Copy link

GNU testsuite comparison:

GNU test failed: tests/join/join. tests/join/join is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/join/join-utf8 is no longer failing!
Congrats! The gnu test tests/timeout/timeout is no longer failing!

@jtracey jtracey force-pushed the join-multibyte branch 2 times, most recently from e15103c to 54b2b0c Compare September 26, 2024 04:12
Copy link

GNU testsuite comparison:

GNU test failed: tests/join/join. tests/join/join is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/join/join-utf8 is no longer failing!

@jtracey jtracey marked this pull request as draft September 26, 2024 04:58
Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/join/join-utf8 is no longer failing!

@jtracey jtracey force-pushed the join-multibyte branch 6 times, most recently from 3ddebb3 to cfd32e2 Compare September 27, 2024 21:54
@jtracey
Copy link
Contributor Author

jtracey commented Sep 27, 2024

Sorry for all the force pushes, made a series of dumb mistakes by trying to fix things too quickly. The new additional commit adds a test that should have been there already and would have saved me some time (join handles whitespace separators differently than every other kind of separator).

@jtracey jtracey marked this pull request as ready for review September 27, 2024 22:04
Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/join/join-utf8 is no longer failing!

Comment on lines 65 to 68
Byte(u8),
Char(Vec<u8>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document these two, the diff isn't obvious :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done.

@jtracey
Copy link
Contributor Author

jtracey commented Sep 28, 2024

It occurs to me that even though GNU join doesn't allow multibyte separators with invalid or multiple code points, I don't actually know why they have this restriction. When it was only single byte separators allowed, that made perfect sense, as it simplifies the algorithms and makes them faster. But once you allow any code point, I don't know why you would bother restricting it. I suppose there are some additional performance hacks you could do by assuming UTF-8 and knowing the separator is at most 4 bytes, but GNU supports other encodings, and we're not doing any such hacks. Should we just remove those checks, and allow using any arbitrary byte string as the separator?

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)
GNU test failed: tests/timeout/timeout. tests/timeout/timeout is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/join/join-utf8 is no longer failing!

@sylvestre
Copy link
Contributor

I don't actually know why they have this restriction

maybe ask on their mailing list?

Should we just remove those checks, and allow using any arbitrary byte string as the separator?

Sure but in a different PR
and documented here:
https://github.com/uutils/coreutils/blob/main/docs/src/extensions.md

@sylvestre sylvestre merged commit a51a731 into uutils:main Oct 6, 2024
67 of 68 checks passed
@jtracey jtracey deleted the join-multibyte branch October 10, 2024 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants