Unicode support #3072

ncannasse · 2014-05-29T13:23:17Z

We want to add to the standard library, fully tested versions of the following classes:

haxe.Ucs2
haxe.Utf8
haxe.Utf16

They will be implemented as abstracts and will follow the String API + conversions between them and fromBytes/toBytes.

elliott5 · 2014-05-31T18:47:03Z

Please could the class haxe.Utf32 also be added?

This class would be an array of 32-bit unicode code points. Unlike any of the other 3 classes, each position in the array would be a single unicode character, making character-by-character manipulation of the string very simple.

The usage would be to:

Cast the required haxe.Ucs2, haxe.Utf8 or haxe.Utf16 strings to haxe.Utf32;
Do whatever character-position based manipulations are required between strings in the haxe.Utf32 form, using simple array-index-for-each-character logic; and
Cast the result back to the required storage encoding of haxe.Ucs2, haxe.Utf8 or haxe.Utf16.

As it is so simple, the haxe.Utf32 class might also provide a useful intermediate form when casting between haxe.Ucs2, haxe.Utf8 and haxe.Utf16.

deltaluca · 2014-06-01T07:33:09Z

I havn't seen any discussion on handling unicode normalization before. I'd hate to have a 'unicode string type' say two strings aren't equal just because one uses precomposed characters, and the other combining diacratics. Or that a string search would return no results for similar reasons.

@elliott5 all this means that even utf32 does not really make things any easier. Encoding/decoding is the easy part of unicode.

elliott5 · 2014-06-01T11:22:00Z

I bow to your greater knowledge @deltaluca. You prompted me to read http://en.wikipedia.org/wiki/Unicode_normalization, which I found very helpful.

I know the Go language much better than I know Haxe. As two of the original three inventors of the Go language (Ken Thompson and Rob Pike) also invented UTF-8, this is a topic close to their hearts. You can read about the Go approach to Unicode normalization at http://blog.golang.org/normalization. In summary, the Go project provides standard library tools which allow programmers to use any of the four Unicode normalization forms.

However the Go blog makes reference to http://www.macchiato.com/unicode/nfc-faq#TOC-How-much-text-is-already-NFC- , showing that ~99.98% of web HTML page content characters (after discarding markup, and doing entity resolution) are already in NFC form (Normalization Form Canonical Composition, where characters are decomposed and then recomposed by canonical equivalence).

So if Haxe were to support just one Unicode normalisation form, NFC would be the obvious choice.

But as @deltaluca points out, even after NFC normalization, there is still the problem of how to deal with those diacratics that remain. Go has an extensive library for handling Unicode code-points (see http://golang.org/pkg/unicode/ ) at least some of which logic will need to also be present in Haxe libraries.

One note of caution though, having used my little TARDIS Go project to translate the Go "unicode" library into Haxe, I know that the data-structures required are quite big - so Haxe may only want to implement a sub-set of that functionality.

Of course one day, in my dreams, TARDIS Go will provide Haxe programmers with access to all of these Go libraries for free...

I could look at producing a proof-of-concept, if anyone is interested. Since all the hard work has already been done by the Go team, it seems a shame not to re-use it. Of course the automatically generated Haxe code produced will be big and slow, at least until the TARDIS Go project matures; but also free and much faster to generate than rewriting it from scratch.

mandel59 · 2014-06-12T12:44:10Z

Unifill, the library for Unicode string support I developed, provides three ways to deal strings.

The unifill.Unifill utility class providing wrapper methods to deal native strings as UTF-32-encoded strings. It probably be the easiest way to make an application support unicode, but might cause performance regression. Index access is O(n).
The unifill.InternalEncoding class providing necessary methods to deal native strings. Their methods are intended to be operations to deal UTF encodings without considering actual encoding of native strings. Index access is O(1) but indexed by code units.
The unifill.UtfX classes (unifill.Utf8, unifill.Utf16, unifill.Utf32). They have strings encoded as its name suggests. They are like haxe.Utf16 as proposed here although UtfX isn't abstruct but class to make it a subtype of the Utf interface. Utf defines necessity minimum methods to deal each encodings. Utf is defined here: https://github.com/mandel59/unifill/blob/master/unifill/Utf.hx When you adopt this solution, you have to convert native strings and UtfX each other.

I suppose the first way is suitable for small applications or whenever string manipulation is not heavy, and the second way is good when dealing native string heavily. The third way would usable in situations the encoding is specified, for instance, to decode binary file containing encoded strings. Each way has pros and cons.

mandel59 · 2014-06-12T13:01:04Z

Haxe's String API is compatible to ActionScript (and JavaScript) but the api is poor due to lack of variable-length encoding support (i.e. surrogate pair support). Java String API is extended to deal surrogate pair.
UTF-16 support is very important: Unicode Beyond-BMP Top Ten List—2014 Redux I suppose Haxe should provide methods to support them.

waneck · 2014-10-01T21:47:16Z

I can't see how we could handle this if not by changing ALL haxe.io.Input/haxe.io.Output to take an additional encoding parameter when reading to a String or writing from a String.
The current haxe.Utf8-like API simply has no meaning in some platforms - like Java and C#. These platforms have a very specific enconding in their Strings, and there's no meaning in haxe.Utf8.encode's signature being String->String

Simn · 2015-06-03T20:26:46Z

Can someone comment on the approach(es) taken by https://github.com/mandel59/unifill and if we could make use of it?

deltaluca · 2015-06-03T20:36:34Z

For reference: http://site.icu-project.org/ and libraries that wrap ICU in other languages: http://site.icu-project.org/related

ncannasse · 2015-06-03T20:38:03Z

@Simn broken link

Simn · 2015-06-03T20:39:02Z

edited (no idea how that happened)

ncannasse · 2015-06-03T20:48:07Z

I don't think that the Unifill using is a good idea : since all calls go through the Utf interface, no inlining is possible so while it makes the code more generic (for instance you can implement split for all encodings) in the end it's very slow.

I would instead have an Utf interface that mimics entirely the String API so it can be a drop down replacement (no using), then an implementation for each encoding. For fixed codepoint encoding (UCS2, UTF32), this would benefit from having not to take codepoint width into acconut

Simn · 2015-06-03T21:02:10Z

You're complaining about the Utf interface, but then suggest using a Utf interface instead. ?_?

mpcref · 2015-06-03T23:13:09Z

Hearing all the lacking-unicode-support-talk going on at WWX2015, I thought maybe the unorm project might be of help. It implements unicode normalization and can be used as a polyfill for the ES6-proposed String.prototype.normalize().

elliott5 · 2015-06-04T06:46:20Z

Just to reinforce what I said at my WWX2015 talk last Saturday, the existing body of Unicode-related Go packages could easily be auto-translated to Haxe using TARDISgo, including:

Package unicode provides data and functions to test some properties of Unicode code points.
Package utf8 implements functions and constants to support text encoded in UTF-8. It includes functions to translate between runes (32-bit unicode code-points) and UTF-8 byte sequences.
Package utf16 implements encoding and decoding of UTF-16 sequences.
Package norm contains types and functions for normalizing Unicode strings.
and many more related packages, see the repository of text-related packages, such as character encodings, text transformations, and locale-specific text handling.

The main issue would be creating a more user-friendly and haxe-friendly API wrapper for the generated code, as the current autogenerated API is rather ugly, see the normalization example from my talk. However this is nothing like as much work as writing the libraries from scratch.

By auto-generating the libraries in this way, the Haxe ecosystem could very quickly have a complete working solution, one which leverages the many man-years of work that the Google Go team have put into making these libraries correct. Of course the generated code is larger and slower than hand-written Haxe would be, so there would be opportunities to re-write key elements of the libraries to improve speed/size in future.

mandel59 · 2015-06-04T07:13:26Z

I've rewritten Unifill to avoid using Utf interface. Now UtfXs are abstract.

mandel59 · 2015-06-04T07:47:48Z

Unifill doesn't support Ucs2, because what we call UCS-2 is actually potentially ill-formed UTF-16 in most cases. (cf. https://simonsapin.github.io/wtf-8/#ill-formed-utf-16)
We, however, should consider how to deal potentially ill-formed UTF-16 (or UTF-8) strings. Ideally, dirty strings should be separated into different types, but the domain of Haxe is not an ideal world but the real. While Unifill provides validate method to assert string encoding validity, it may not be enough.

Simn · 2015-06-06T08:05:53Z

unifill seems to have all the required tools already, so is this just a matter of changing how to interact with it?

waneck · 2015-06-06T15:04:40Z

When we add support for unicode, let us please also include an encoding option for readString or any other way we provide to read from Bytes to the String (we may need to include it into the CFFI as well, by the way). This way we can still be sure that the string is always in a valid encoding internally, and there is no awkward encode function with a type like String->String

Simn · 2015-07-21T09:37:19Z

@ncannasse: Please follow up.

frabbit · 2017-06-03T08:57:55Z

i just want to update everyone, the current status of unicode can be found here:
https://github.com/frabbit/haxe-1/tree/unicode_vector

i'm planning to get it done in the next weeks, feedback appreciated.

m22spencer · 2017-06-18T06:38:37Z

Great job so far!

Couple of notes:

ucs2/utf16.fromCharCode(0xD800)
Why is 0xD800 not accepted? The break tests file require it.
http://www.unicode.org/Public/9.0.0/ucd/auxiliary/GraphemeBreakTest.txt
Mandatory conversion
Consider C#/Utf16 impl.
C# String type is Utf16. (Confirm?)

i18n.Utf16 converts C# String -> Bytes on construction
and Bytes -> C# String on toNativeString.
Doesn't this make interop with existing functions very expensive?
Can each native platform use String directly for that particular i18n repr?
eg:
- C#: String == i18n.Utf16
- js: String == i18n.Ucs2
- hl: String == i18n.Utf16
- ...
Grapheme clusters
Even with the unicode aware representation, another layer of abstraction is needed in order to get expected results.

See below:
(These results are the same for Ucs2/Utf8/Utf16/Utf32.):

var s = new haxe.i18n.Utf8("☔︎한🌍");    // 3 characters.

trace(s.charAt(0).toNativeString());   //☔ Hmm.. this umbrella looks different
trace(s.charAt(1).toNativeString());   //  Nothing?
trace(s.charAt(2).toNativeString());   //ᄒ ???

trace(s.substr(2,1).toNativeString()); //ᄒ
trace(s.substr(2,2).toNativeString()); //하
trace(s.substr(2,3).toNativeString()); //한

var a = new haxe.i18n.Utf8("한");      // 1 character.
var b = new haxe.i18n.Utf8("한");      // 1 character.
trace(a.charAt(0).toNativeString());  //ᄒ
trace(b.charAt(0).toNativeString());  //한

darmie · 2017-08-21T09:35:09Z

When are we likely to have this shipped into the official Haxe source?

frabbit · 2017-08-21T20:41:34Z

I plan to work on this and it should be included in 4.0.

andyli · 2017-10-09T17:25:57Z

@frabbit
We need this. It is blocking proper sys API specification, which is quite necessary for improving/redesigning haxelib.

If there is any problem, please voice out and see if we can help.

andyli · 2017-10-09T17:28:37Z

@ncannasse I think you want to review frabbit's work. Assigning to you.

frabbit · 2017-10-09T19:29:44Z

@andyli would be nice if i get more reviews, so that i have a clear roadmap on how to improve the current state. Hopefully i find some time next week, but i'm quite busy atm, and on vacation afterwards. Will really come back to this in the first november week.

frabbit · 2017-10-09T19:33:07Z

Things on my roadmap:

Fast Iterators for all encodings
Faster UTF16 Impl for native Ucs2 Platforms (java, js, c#, etc)

frabbit · 2017-11-01T18:00:21Z

I have pushed the implementation for these 2 features:
[x] Fast Iterators for all encodings
[x] Faster UTF16 Impl for native Ucs2 Platforms (java, js, c#, etc)

Would be nice to know what's still missing, any bugs left? @m22spencer

What about unicode normalization, it's not part of the implementation atm.
@ncannasse @andyli

frabbit · 2017-11-02T18:57:40Z

I wonder what i should do with toLowercase and toUpperCase, at the moment it's only applied for the ascii range 41...5A and 61...7A

frabbit · 2017-11-02T19:22:23Z

Some things that i would like to add to std, please let me know if that's fine @ncannasse @Simn

Vector + operator or append/concat method to concat two vectors into a new one.
Move some BytesData related stuff out of Bytes and into a helper class BytesDataTools or Something, so that Bytes and a new ByteAccess type can share common functions. ByteAccess is an abstract on top of BytesData with length access, it's used for ucs2, utf16 and utf8.
BytesBuffer.reset to reset and reuse a buffer
BytesBuffer.addBytesData or BytesBuffer.addData to push a full BytesData object into the buffer, without creating a bytes object first.
StringBuf.addString or something (Improve StringBuf.add to increase performance #5440)
BytesBuffer.addBuffer would be nice for stuff like String|Utf*.split, i have more than buffer at the same time.

Simn · 2017-11-03T09:04:41Z

I don't see any reason not to add/change these, but can we please do that after we're done with the first step?

frabbit · 2017-11-03T09:54:36Z

Of course, the question is what exactly is the first step (i guess a merge) and what's missing to do that.

R32 · 2017-11-06T13:34:08Z

I feel like we don't need utf16, actually the ucs2 is quite enough. If a guy uses a 4-byte utf16 character such as "𤭢", then just convince him not to use the character, because no one knows this weird character . In fact, there are only 2,000 Chinese characters commonly used in China.

back2dos · 2017-11-06T13:41:29Z

Emojis are outside the BMP. Maybe we can live with that. Just pointing out that not everything that doesn't fit ucs2 is largely unused.

frabbit · 2017-11-20T16:11:27Z

#6748

farteryhr · 2018-05-04T03:22:10Z

some general idea:

Unicode is a BIG pitfall.

When we're talking about "full support", (besides locale-dependent case conversion and composed characters with several kinds of normalization as mentioned above), we might also need to consider digits in regex: https://stackoverflow.com/questions/6998713/scanning-for-unicode-numbers-in-a-string-with-d

It's unavoidable eventually being a "locale" library involving Unicode databases that adds up times larger than Haxe itself. The code tables will have to be shipped with the application if we want to be consistent or up-to-date with latest Unicode specification. Programming languages aren't always consistent on Unicode with each other.

Considering "invalidity", there are not only surrogate pairs, but also unassigned, reserved, unused, non-characters sometimes described "should never occur". which ones should we check for?

My idea is to limit the range, not to deal with anything above "code points and byte representations" (that's already a lot), keep the numbers as-is and let them function (like iterators by code point).
Let word processors, font renderers, web browsers do the rest. But if there are conversion API's available on specific target, extern them with specific encoding. That's enough.

I suggest a Unicode class as abstract of Array<Int>, or something like String.dumpUnicode, defined as "each element correspond to a code point". This will essentially make advanced handling of Emojis and Chinese characters (in the SIP) much easier (than any of Utf8, Utf16 and String, if you really want to).

Not specifically Int32, nor named Utf32. I think this is consistent with the core types of Haxe, like Int with unspecified precision (but at least 32(31?) bits, enough for Unicode 42.0 I predict) and String with unspecified underlying representation (can't assume what we're operating on at all. bytes, code points or characters? maybe only OK for ASCII.)

back2dos · 2018-05-04T10:52:27Z

I mostly agree, however if you represent it as Array<Int> (or probably rather Vector<Int>) you're going to use 8 bytes per code point on many platforms. This is especially wasteful if you're operating in the ASCII range (which textual formats like JSON and XML tend to do). You might get away with it on the client, but on a server this will severely throttle your maximum throughput.

farteryhr · 2018-05-04T15:06:23Z

Yes, it depends on how carefully you want to process the data. Regarding existing utilities, they're OK and I don't mean to deprecate/delete any of them.

For parsing, just use code point iterator, returning where we are, what the code is and how big this char is, and cut out what we need according to them, all with the native String.

For daily usages like finding, splitting and replacing, since the key string and the big string are both encoded the same way on the same runtime, there will be no problem with the native APIs (assuming they're coded with valid utf-8, utf-16 underlying, of course).

Issues like #6897 are essentially what my proposal intend to deal with, when the underlying format makes unavoidable difference. String.split("") is something special (how much memory does the Array<String> take?).

And, what about a specialized Array<Int24>?

Simn · 2018-09-04T19:21:55Z

With #7009 merged, we can close here. #6748 is still open to track further Unicode support in the standard library.

ncannasse added this to the 3.2 milestone May 29, 2014

ncannasse mentioned this issue Jun 26, 2014

Problem with haxe.io.BytesInput #3148

Closed

Simn mentioned this issue Jan 22, 2015

Loop unrolling #3786

Merged

Simn modified the milestones: 3.3, 3.2 Feb 22, 2015

Simn added roadmap labels Jun 5, 2015

andyli mentioned this issue Jun 3, 2017

Hackey fix for FileSys tests failing on OS X due to Unicode normalisation #6221

Merged

andyli mentioned this issue Jul 20, 2017

Complete Sys unit tests #6461

Closed

glebm mentioned this issue Aug 20, 2017

Waxeye Parser Generator for the Haxe Language Target waxeye-org/waxeye#43

Open

andyli assigned ncannasse Oct 9, 2017

skial mentioned this issue Nov 2, 2017

Haxe Roundup 406 skial/haxe.io#446

Closed

markknol added the unicode label Mar 8, 2018

Simn modified the milestones: Release 4.0, Design Apr 17, 2018

Simn closed this as completed Sep 4, 2018

Aurel300 mentioned this issue Apr 9, 2019

Sys unicode unit tests #8094

Closed

Unicode support #3072

Unicode support #3072

Comments

ncannasse commented May 29, 2014

elliott5 commented May 31, 2014

deltaluca commented Jun 1, 2014

elliott5 commented Jun 1, 2014

mandel59 commented Jun 12, 2014

mandel59 commented Jun 12, 2014

waneck commented Oct 1, 2014

Simn commented Jun 3, 2015

deltaluca commented Jun 3, 2015

ncannasse commented Jun 3, 2015

Simn commented Jun 3, 2015

ncannasse commented Jun 3, 2015

Simn commented Jun 3, 2015

mpcref commented Jun 3, 2015

elliott5 commented Jun 4, 2015

mandel59 commented Jun 4, 2015

mandel59 commented Jun 4, 2015

Simn commented Jun 6, 2015

waneck commented Jun 6, 2015

Simn commented Jul 21, 2015

frabbit commented Jun 3, 2017

m22spencer commented Jun 18, 2017

darmie commented Aug 21, 2017

frabbit commented Aug 21, 2017

andyli commented Oct 9, 2017

andyli commented Oct 9, 2017

frabbit commented Oct 9, 2017 • edited Loading

frabbit commented Oct 9, 2017

frabbit commented Nov 1, 2017 • edited Loading

frabbit commented Nov 2, 2017

frabbit commented Nov 2, 2017 • edited Loading

Simn commented Nov 3, 2017

frabbit commented Nov 3, 2017

R32 commented Nov 6, 2017

back2dos commented Nov 6, 2017

frabbit commented Nov 20, 2017

farteryhr commented May 4, 2018 • edited Loading

back2dos commented May 4, 2018

farteryhr commented May 4, 2018 • edited Loading

Simn commented Sep 4, 2018

frabbit commented Oct 9, 2017 •

edited

Loading

frabbit commented Nov 1, 2017 •

edited

Loading

frabbit commented Nov 2, 2017 •

edited

Loading

farteryhr commented May 4, 2018 •

edited

Loading

farteryhr commented May 4, 2018 •

edited

Loading