Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode support #3072

Closed
ncannasse opened this issue May 29, 2014 · 44 comments
Closed

Unicode support #3072

ncannasse opened this issue May 29, 2014 · 44 comments
Assignees
Labels
Milestone

Comments

@ncannasse
Copy link
Member

We want to add to the standard library, fully tested versions of the following classes:

  • haxe.Ucs2
  • haxe.Utf8
  • haxe.Utf16

They will be implemented as abstracts and will follow the String API + conversions between them and fromBytes/toBytes.

@elliott5
Copy link

Please could the class haxe.Utf32 also be added?

This class would be an array of 32-bit unicode code points. Unlike any of the other 3 classes, each position in the array would be a single unicode character, making character-by-character manipulation of the string very simple.

The usage would be to:

  • Cast the required haxe.Ucs2, haxe.Utf8 or haxe.Utf16 strings to haxe.Utf32;
  • Do whatever character-position based manipulations are required between strings in the haxe.Utf32 form, using simple array-index-for-each-character logic; and
  • Cast the result back to the required storage encoding of haxe.Ucs2, haxe.Utf8 or haxe.Utf16.

As it is so simple, the haxe.Utf32 class might also provide a useful intermediate form when casting between haxe.Ucs2, haxe.Utf8 and haxe.Utf16.

@deltaluca
Copy link
Contributor

I havn't seen any discussion on handling unicode normalization before. I'd hate to have a 'unicode string type' say two strings aren't equal just because one uses precomposed characters, and the other combining diacratics. Or that a string search would return no results for similar reasons.

@elliott5 all this means that even utf32 does not really make things any easier. Encoding/decoding is the easy part of unicode.

@elliott5
Copy link

elliott5 commented Jun 1, 2014

I bow to your greater knowledge @deltaluca. You prompted me to read http://en.wikipedia.org/wiki/Unicode_normalization, which I found very helpful.

I know the Go language much better than I know Haxe. As two of the original three inventors of the Go language (Ken Thompson and Rob Pike) also invented UTF-8, this is a topic close to their hearts. You can read about the Go approach to Unicode normalization at http://blog.golang.org/normalization. In summary, the Go project provides standard library tools which allow programmers to use any of the four Unicode normalization forms.

However the Go blog makes reference to http://www.macchiato.com/unicode/nfc-faq#TOC-How-much-text-is-already-NFC- , showing that ~99.98% of web HTML page content characters (after discarding markup, and doing entity resolution) are already in NFC form (Normalization Form Canonical Composition, where characters are decomposed and then recomposed by canonical equivalence).

So if Haxe were to support just one Unicode normalisation form, NFC would be the obvious choice.

But as @deltaluca points out, even after NFC normalization, there is still the problem of how to deal with those diacratics that remain. Go has an extensive library for handling Unicode code-points (see http://golang.org/pkg/unicode/ ) at least some of which logic will need to also be present in Haxe libraries.

One note of caution though, having used my little TARDIS Go project to translate the Go "unicode" library into Haxe, I know that the data-structures required are quite big - so Haxe may only want to implement a sub-set of that functionality.

Of course one day, in my dreams, TARDIS Go will provide Haxe programmers with access to all of these Go libraries for free...

I could look at producing a proof-of-concept, if anyone is interested. Since all the hard work has already been done by the Go team, it seems a shame not to re-use it. Of course the automatically generated Haxe code produced will be big and slow, at least until the TARDIS Go project matures; but also free and much faster to generate than rewriting it from scratch.

@mandel59
Copy link
Contributor

Unifill, the library for Unicode string support I developed, provides three ways to deal strings.

  1. The unifill.Unifill utility class providing wrapper methods to deal native strings as UTF-32-encoded strings. It probably be the easiest way to make an application support unicode, but might cause performance regression. Index access is O(n).
  2. The unifill.InternalEncoding class providing necessary methods to deal native strings. Their methods are intended to be operations to deal UTF encodings without considering actual encoding of native strings. Index access is O(1) but indexed by code units.
  3. The unifill.UtfX classes (unifill.Utf8, unifill.Utf16, unifill.Utf32). They have strings encoded as its name suggests. They are like haxe.Utf16 as proposed here although UtfX isn't abstruct but class to make it a subtype of the Utf interface. Utf defines necessity minimum methods to deal each encodings. Utf is defined here: https://github.com/mandel59/unifill/blob/master/unifill/Utf.hx When you adopt this solution, you have to convert native strings and UtfX each other.

I suppose the first way is suitable for small applications or whenever string manipulation is not heavy, and the second way is good when dealing native string heavily. The third way would usable in situations the encoding is specified, for instance, to decode binary file containing encoded strings. Each way has pros and cons.

@mandel59
Copy link
Contributor

Haxe's String API is compatible to ActionScript (and JavaScript) but the api is poor due to lack of variable-length encoding support (i.e. surrogate pair support). Java String API is extended to deal surrogate pair.
UTF-16 support is very important: Unicode Beyond-BMP Top Ten List—2014 Redux I suppose Haxe should provide methods to support them.

@waneck
Copy link
Member

waneck commented Oct 1, 2014

I can't see how we could handle this if not by changing ALL haxe.io.Input/haxe.io.Output to take an additional encoding parameter when reading to a String or writing from a String.
The current haxe.Utf8-like API simply has no meaning in some platforms - like Java and C#. These platforms have a very specific enconding in their Strings, and there's no meaning in haxe.Utf8.encode's signature being String->String

@Simn Simn mentioned this issue Jan 22, 2015
@Simn Simn modified the milestones: 3.3, 3.2 Feb 22, 2015
@Simn
Copy link
Member

Simn commented Jun 3, 2015

Can someone comment on the approach(es) taken by https://github.com/mandel59/unifill and if we could make use of it?

@deltaluca
Copy link
Contributor

For reference: http://site.icu-project.org/ and libraries that wrap ICU in other languages: http://site.icu-project.org/related

@ncannasse
Copy link
Member Author

@Simn broken link

@Simn
Copy link
Member

Simn commented Jun 3, 2015

edited (no idea how that happened)

@ncannasse
Copy link
Member Author

I don't think that the Unifill using is a good idea : since all calls go through the Utf interface, no inlining is possible so while it makes the code more generic (for instance you can implement split for all encodings) in the end it's very slow.

I would instead have an Utf interface that mimics entirely the String API so it can be a drop down replacement (no using), then an implementation for each encoding. For fixed codepoint encoding (UCS2, UTF32), this would benefit from having not to take codepoint width into acconut

@Simn
Copy link
Member

Simn commented Jun 3, 2015

You're complaining about the Utf interface, but then suggest using a Utf interface instead. ?_?

@mpcref
Copy link

mpcref commented Jun 3, 2015

Hearing all the lacking-unicode-support-talk going on at WWX2015, I thought maybe the unorm project might be of help. It implements unicode normalization and can be used as a polyfill for the ES6-proposed String.prototype.normalize().

@elliott5
Copy link

elliott5 commented Jun 4, 2015

Just to reinforce what I said at my WWX2015 talk last Saturday, the existing body of Unicode-related Go packages could easily be auto-translated to Haxe using TARDISgo, including:

The main issue would be creating a more user-friendly and haxe-friendly API wrapper for the generated code, as the current autogenerated API is rather ugly, see the normalization example from my talk. However this is nothing like as much work as writing the libraries from scratch.

By auto-generating the libraries in this way, the Haxe ecosystem could very quickly have a complete working solution, one which leverages the many man-years of work that the Google Go team have put into making these libraries correct. Of course the generated code is larger and slower than hand-written Haxe would be, so there would be opportunities to re-write key elements of the libraries to improve speed/size in future.

@mandel59
Copy link
Contributor

mandel59 commented Jun 4, 2015

I've rewritten Unifill to avoid using Utf interface. Now UtfXs are abstract.

@mandel59
Copy link
Contributor

mandel59 commented Jun 4, 2015

Unifill doesn't support Ucs2, because what we call UCS-2 is actually potentially ill-formed UTF-16 in most cases. (cf. https://simonsapin.github.io/wtf-8/#ill-formed-utf-16)
We, however, should consider how to deal potentially ill-formed UTF-16 (or UTF-8) strings. Ideally, dirty strings should be separated into different types, but the domain of Haxe is not an ideal world but the real. While Unifill provides validate method to assert string encoding validity, it may not be enough.

@Simn
Copy link
Member

Simn commented Jun 6, 2015

unifill seems to have all the required tools already, so is this just a matter of changing how to interact with it?

@waneck
Copy link
Member

waneck commented Jun 6, 2015

When we add support for unicode, let us please also include an encoding option for readString or any other way we provide to read from Bytes to the String (we may need to include it into the CFFI as well, by the way). This way we can still be sure that the string is always in a valid encoding internally, and there is no awkward encode function with a type like String->String

@Simn
Copy link
Member

Simn commented Jul 21, 2015

@ncannasse: Please follow up.

@frabbit
Copy link
Member

frabbit commented Jun 3, 2017

i just want to update everyone, the current status of unicode can be found here:
https://github.com/frabbit/haxe-1/tree/unicode_vector

i'm planning to get it done in the next weeks, feedback appreciated.

@m22spencer
Copy link
Contributor

Great job so far!

Couple of notes:

  • ucs2/utf16.fromCharCode(0xD800)
    Why is 0xD800 not accepted? The break tests file require it.
    http://www.unicode.org/Public/9.0.0/ucd/auxiliary/GraphemeBreakTest.txt

  • Mandatory conversion
    Consider C#/Utf16 impl.
    C# String type is Utf16. (Confirm?)

    i18n.Utf16 converts C# String -> Bytes on construction
    and Bytes -> C# String on toNativeString.
    Doesn't this make interop with existing functions very expensive?
    Can each native platform use String directly for that particular i18n repr?
    eg:

    • C#: String == i18n.Utf16
    • js: String == i18n.Ucs2
    • hl: String == i18n.Utf16
    • ...
  • Grapheme clusters
    Even with the unicode aware representation, another layer of abstraction is needed in order to get expected results.

    See below:
    (These results are the same for Ucs2/Utf8/Utf16/Utf32.):

var s = new haxe.i18n.Utf8("☔︎한🌍");    // 3 characters.

trace(s.charAt(0).toNativeString());   //☔ Hmm.. this umbrella looks different
trace(s.charAt(1).toNativeString());   //  Nothing?
trace(s.charAt(2).toNativeString());   //ᄒ ???

trace(s.substr(2,1).toNativeString()); //
trace(s.substr(2,2).toNativeString()); //하
trace(s.substr(2,3).toNativeString()); //한

var a = new haxe.i18n.Utf8("한");      // 1 character.
var b = new haxe.i18n.Utf8("");      // 1 character.
trace(a.charAt(0).toNativeString());  //
trace(b.charAt(0).toNativeString());  //

@darmie
Copy link

darmie commented Aug 21, 2017

When are we likely to have this shipped into the official Haxe source?

@frabbit
Copy link
Member

frabbit commented Aug 21, 2017

I plan to work on this and it should be included in 4.0.

@andyli
Copy link
Member

andyli commented Oct 9, 2017

@frabbit
We need this. It is blocking proper sys API specification, which is quite necessary for improving/redesigning haxelib.

If there is any problem, please voice out and see if we can help.

@andyli
Copy link
Member

andyli commented Oct 9, 2017

@ncannasse I think you want to review frabbit's work. Assigning to you.

@frabbit
Copy link
Member

frabbit commented Oct 9, 2017

@andyli would be nice if i get more reviews, so that i have a clear roadmap on how to improve the current state. Hopefully i find some time next week, but i'm quite busy atm, and on vacation afterwards. Will really come back to this in the first november week.

@frabbit
Copy link
Member

frabbit commented Oct 9, 2017

Things on my roadmap:

  • Fast Iterators for all encodings
  • Faster UTF16 Impl for native Ucs2 Platforms (java, js, c#, etc)

@frabbit
Copy link
Member

frabbit commented Nov 1, 2017

I have pushed the implementation for these 2 features:
[x] Fast Iterators for all encodings
[x] Faster UTF16 Impl for native Ucs2 Platforms (java, js, c#, etc)

Would be nice to know what's still missing, any bugs left? @m22spencer

What about unicode normalization, it's not part of the implementation atm.
@ncannasse @andyli

@frabbit
Copy link
Member

frabbit commented Nov 2, 2017

I wonder what i should do with toLowercase and toUpperCase, at the moment it's only applied for the ascii range 41...5A and 61...7A

@frabbit
Copy link
Member

frabbit commented Nov 2, 2017

Some things that i would like to add to std, please let me know if that's fine @ncannasse @Simn

  • Vector + operator or append/concat method to concat two vectors into a new one.
  • Move some BytesData related stuff out of Bytes and into a helper class BytesDataTools or Something, so that Bytes and a new ByteAccess type can share common functions. ByteAccess is an abstract on top of BytesData with length access, it's used for ucs2, utf16 and utf8.
  • BytesBuffer.reset to reset and reuse a buffer
  • BytesBuffer.addBytesData or BytesBuffer.addData to push a full BytesData object into the buffer, without creating a bytes object first.
  • StringBuf.addString or something (Improve StringBuf.add to increase performance #5440)
  • BytesBuffer.addBuffer would be nice for stuff like String|Utf*.split, i have more than buffer at the same time.

@Simn
Copy link
Member

Simn commented Nov 3, 2017

I don't see any reason not to add/change these, but can we please do that after we're done with the first step?

@frabbit
Copy link
Member

frabbit commented Nov 3, 2017

Of course, the question is what exactly is the first step (i guess a merge) and what's missing to do that.

@R32
Copy link
Contributor

R32 commented Nov 6, 2017

I feel like we don't need utf16, actually the ucs2 is quite enough. If a guy uses a 4-byte utf16 character such as "𤭢", then just convince him not to use the character, because no one knows this weird character . In fact, there are only 2,000 Chinese characters commonly used in China.

@back2dos
Copy link
Member

back2dos commented Nov 6, 2017

Emojis are outside the BMP. Maybe we can live with that. Just pointing out that not everything that doesn't fit ucs2 is largely unused.

@frabbit
Copy link
Member

frabbit commented Nov 20, 2017

#6748

@Simn Simn modified the milestones: Release 4.0, Design Apr 17, 2018
@farteryhr
Copy link

farteryhr commented May 4, 2018

some general idea:

Unicode is a BIG pitfall.

When we're talking about "full support", (besides locale-dependent case conversion and composed characters with several kinds of normalization as mentioned above), we might also need to consider digits in regex: https://stackoverflow.com/questions/6998713/scanning-for-unicode-numbers-in-a-string-with-d

It's unavoidable eventually being a "locale" library involving Unicode databases that adds up times larger than Haxe itself. The code tables will have to be shipped with the application if we want to be consistent or up-to-date with latest Unicode specification. Programming languages aren't always consistent on Unicode with each other.

Considering "invalidity", there are not only surrogate pairs, but also unassigned, reserved, unused, non-characters sometimes described "should never occur". which ones should we check for?

My idea is to limit the range, not to deal with anything above "code points and byte representations" (that's already a lot), keep the numbers as-is and let them function (like iterators by code point).
Let word processors, font renderers, web browsers do the rest. But if there are conversion API's available on specific target, extern them with specific encoding. That's enough.

I suggest a Unicode class as abstract of Array<Int>, or something like String.dumpUnicode, defined as "each element correspond to a code point". This will essentially make advanced handling of Emojis and Chinese characters (in the SIP) much easier (than any of Utf8, Utf16 and String, if you really want to).

Not specifically Int32, nor named Utf32. I think this is consistent with the core types of Haxe, like Int with unspecified precision (but at least 32(31?) bits, enough for Unicode 42.0 I predict) and String with unspecified underlying representation (can't assume what we're operating on at all. bytes, code points or characters? maybe only OK for ASCII.)

@back2dos
Copy link
Member

back2dos commented May 4, 2018

I mostly agree, however if you represent it as Array<Int> (or probably rather Vector<Int>) you're going to use 8 bytes per code point on many platforms. This is especially wasteful if you're operating in the ASCII range (which textual formats like JSON and XML tend to do). You might get away with it on the client, but on a server this will severely throttle your maximum throughput.

@farteryhr
Copy link

farteryhr commented May 4, 2018

Yes, it depends on how carefully you want to process the data. Regarding existing utilities, they're OK and I don't mean to deprecate/delete any of them.

For parsing, just use code point iterator, returning where we are, what the code is and how big this char is, and cut out what we need according to them, all with the native String.

For daily usages like finding, splitting and replacing, since the key string and the big string are both encoded the same way on the same runtime, there will be no problem with the native APIs (assuming they're coded with valid utf-8, utf-16 underlying, of course).

Issues like #6897 are essentially what my proposal intend to deal with, when the underlying format makes unavoidable difference. String.split("") is something special (how much memory does the Array<String> take?).

And, what about a specialized Array<Int24>? funny

@Simn
Copy link
Member

Simn commented Sep 4, 2018

With #7009 merged, we can close here. #6748 is still open to track further Unicode support in the standard library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests