Skip to content

Commit

Permalink
Add String#to_utf16 and String.from_utf16 (#5541)
Browse files Browse the repository at this point in the history
  • Loading branch information
asterite authored and RX14 committed Jan 5, 2018
1 parent 5bdd279 commit 50b563c
Show file tree
Hide file tree
Showing 3 changed files with 141 additions and 2 deletions.
47 changes: 47 additions & 0 deletions spec/std/string/utf16_spec.cr
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
require "spec"

describe "String UTF16" do
describe "to_utf16" do
it "in the range U+0000..U+D7FF" do
encoded = "\u{0}hello\u{d7ff}".to_utf16
encoded.should eq(Slice[0_u16, 0x68_u16, 0x65_u16, 0x6c_u16, 0x6c_u16, 0x6f_u16, 0xd7ff_u16])
end

it "in the range U+E000 to U+FFFF" do
encoded = "\u{e000}\u{ffff}".to_utf16
encoded.should eq(Slice[0xe000_u16, 0xffff_u16])
end

it "in the range U+10000..U+10FFFF" do
encoded = "\u{10000}\u{10FFFF}".to_utf16
encoded.should eq(Slice[0xd800_u16, 0xdc00_u16, 0xdbff_u16, 0xdfff_u16])
end

it "in the range U+D800..U+DFFF" do
encoded = "\u{D800}\u{DFFF}".to_utf16
encoded.should eq(Slice[0xFFFD_u16, 0xFFFD_u16])
end
end

describe "from_utf16" do
it "in the range U+0000..U+D7FF" do
input = Slice[0_u16, 0x68_u16, 0x65_u16, 0x6c_u16, 0x6c_u16, 0x6f_u16, 0xd7ff_u16]
String.from_utf16(input).should eq("\u{0}hello\u{d7ff}")
end

it "in the range U+E000 to U+FFFF" do
input = Slice[0xe000_u16, 0xffff_u16]
String.from_utf16(input).should eq("\u{e000}\u{ffff}")
end

it "in the range U+10000..U+10FFFF" do
input = Slice[0xd800_u16, 0xdc00_u16]
String.from_utf16(input).should eq("\u{10000}")
end

it "in the range U+D800..U+DFFF" do
input = Slice[0xdc00_u16, 0xd800_u16]
String.from_utf16(input).should eq("\u{fffd}\u{fffd}")
end
end
end
3 changes: 1 addition & 2 deletions src/string.cr
Original file line number Diff line number Diff line change
Expand Up @@ -4272,5 +4272,4 @@ class String
end
end

require "./string/formatter"
require "./string/builder"
require "./string/*"
93 changes: 93 additions & 0 deletions src/string/utf16.cr
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
class String
# Returns the UTF-16 encoding of the given *string*.
#
# Invalid chars (in the range U+D800..U+DFFF) are encoded with the
# unicode replacement char value `0xfffd`.
#
# ```
# "hi 𐂥".to_utf16 # => Slice[104_u16, 105_u16, 32_u16, 55296_u16, 56485_u16]
# ```
def to_utf16 : Slice(UInt16)
size = 0
each_char do |char|
size += char.ord < 0x10000 ? 1 : 2
end

slice = Slice(UInt16).new(size)

This comment has been minimized.

Copy link
@RX14

RX14 Jan 6, 2018

Contributor

I just realised: we don't make sure that the Slice ends with a null byte. Should we make this guarantee for all slices, or just do it manually in this method?

This comment has been minimized.

Copy link
@asterite

asterite Jan 6, 2018

Author Member

Hmmm... right. But I guess this is only needed for Windows interaction, right? And I get when we get a WCHAR* from Windows, it ends with a null byte. So these two methods aren't useful for Windows interactions.

I'm more and more convinced that these two methods (adapted to suit Windows needs) should be private to the std, maybe in a Crystal::Windows module.

This comment has been minimized.

Copy link
@ysbaddaden

ysbaddaden Jan 7, 2018

Contributor

If we assume anything and they're not generic methods to convert from/to UTF-16 in the general case, then I agree these methods should be somewhat private internal methods —and maybe reflect windows naming.

This comment has been minimized.

Copy link
@RX14

RX14 Jan 7, 2018

Contributor

I don't think we need to rethink this so much. We simply need another overload to each_utf16_char which takes a Pointer(UInt16) instead of a Slice, and to relax the type restriction on from_utf16. Then simply ensure there's a NUL encoded off the end of the returned Slice (which is easy to do without breaking the current contract).

Just tweaks, not a rethink is required here.

This comment has been minimized.

Copy link
@RX14

RX14 Jan 7, 2018

Contributor

Actually, almost all windows functions give you the length of the buffer they just wrote, so I actually don't need from_utf16(WCHAR*), just from_utf16(Slice(WCHAR)) as we have now.

This comment has been minimized.

Copy link
@asterite

asterite Jan 7, 2018

Author Member

What about when you pass a string to windows? You need a null at the end, right?

If you are implementing some of this stuff (like File or Dir functions) I'd say let's wait until you do it, using whatever method you need (you can change these methods too if you need to), but I'd like to have them kept internally in the standard library.

This is very simple: we need it now for windows. Just for windows. If someone comes up and says they need it for something else, we can discuss what's the best API and design for that is. There's no need to rush over this functionality as a general thing because it can always be implemented in a shard. But for interfacing with windows we can make the internal API as convenient/ugly/nice as we want without depending on people using it.

This comment has been minimized.

Copy link
@RX14

RX14 Jan 7, 2018

Contributor

@asterite yes, we need a null on the end, but that's a very small change to this API so I don't think we should gun for removing it right now.

This comment has been minimized.

Copy link
@asterite

asterite Jan 7, 2018

Author Member

I guess for now we can add a append_null boolean flag to the to_utf16 method. I don't know what the default value for that should be. I guess for Windows it's more convenient to append a null, but for other cases (what cases??) I don't know.

This comment has been minimized.

Copy link
@RX14

RX14 Jan 7, 2018

Contributor

Why not do it all the time? Like String, it makes no difference to the API. The zero byte would be off the end of the returned Slice range and invisible to anyone who doesn't use to_unsafe on the slice.

This comment has been minimized.

Copy link
@asterite

asterite Jan 7, 2018

Author Member

How it's invisible if the last element of the slice will always be zero?

This comment has been minimized.

Copy link
@asterite

asterite Jan 7, 2018

Author Member

Unless you mean allocating space for the zero, but only returning a subslice. Sounds pretty obscure to me.

This comment has been minimized.

Copy link
@RX14

RX14 Jan 7, 2018

Contributor

@asterite that's exactly what I meant. String does the same thing, I wouldn't call it obscure if it's documented.

This comment has been minimized.

Copy link
@asterite

asterite Jan 7, 2018

Author Member

I guess it's fine then. What about creating a String from WCHAR*? Do Windows always tell you the length? If not, in that case we could probably have String.from_utf16(UInt16*) which would be unsafe and documented as ending in a zero.

This comment has been minimized.

Copy link
@RX14

RX14 Jan 7, 2018

Contributor

@asterite that was exactly my plan once I found a function that needed it. So far I've found only functions which tell you the buffer length.

This comment has been minimized.

Copy link
@asterite

asterite Jan 7, 2018

Author Member

Cool. Do you want to prepare a PR for this? I'd like to wait until we have all use cases (regarding the windows API) known to just file a single PR. Maybe it's even better to send it together with some windows stuff.

This comment has been minimized.

Copy link
@RX14

RX14 Jan 7, 2018

Contributor

I'll wait until I need it and have tested it to file a PR. The problem with this commit was I merged it before I tested it properly on windows for my usecase.

This comment has been minimized.

Copy link
@RX14

RX14 Jan 7, 2018

Contributor

All the commits I make start off on feature/windows-spec and I cherry-pick them out into PRs anyway.


i = 0
each_char do |char|
ord = char.ord
if ord <= 0xd800 || (0xe000 <= ord < 0x10000)
# One UInt16 is enough
slice[i] = ord.to_u16
elsif ord >= 0x10000
# Needs surrogate pair
ord -= 0x10000
slice[i] = 0xd800_u16 + ((ord >> 10) & 0x3ff) # Keep top 10 bits
i += 1
slice[i] = 0xdc00_u16 + (ord & 0x3ff) # Keep low 10 bits
else
# Invalid char: use replacement
slice[i] = 0xfffd_u16
end
i += 1
end

slice
end

# Decodes the given *slice* UTF-16 sequence into a String.
#
# Invalid values are encoded using the unicode replacement char with
# codepoint `0xfffd`.
#
# ```
# slice = Slice[104_u16, 105_u16, 32_u16, 55296_u16, 56485_u16]
# String.from_utf16(slice) # => "hi 𐂥"
# ```
def self.from_utf16(slice : Slice(UInt16)) : String
bytesize = 0
size = 0

each_utf16_char(slice) do |char|
bytesize += char.bytesize
size += 1
end

String.new(bytesize) do |buffer|
each_utf16_char(slice) do |char|
char.each_byte do |byte|
buffer.value = byte
buffer += 1
end
end
{bytesize, size}
end
end

# Yields each decoded char in the given slice.
private def self.each_utf16_char(slice : Slice(UInt16))
i = 0
while i < slice.size
byte = slice[i].to_i
if byte < 0xd800 || byte >= 0xe000
# One byte
codepoint = byte
elsif 0xd800 <= byte < 0xdc00 &&
(i + 1) < slice.size &&
0xdc00 <= slice[i + 1] <= 0xdfff
# Surrougate pair
codepoint = ((byte - 0xd800) << 10) + (slice[i + 1] - 0xdc00) + 0x10000
i += 1
else
# Invalid byte
codepoint = 0xfffd
end

yield codepoint.chr

i += 1
end
end
end

0 comments on commit 50b563c

Please sign in to comment.