-
-
Notifications
You must be signed in to change notification settings - Fork 666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode support #3072
Comments
Please could the class haxe.Utf32 also be added? This class would be an array of 32-bit unicode code points. Unlike any of the other 3 classes, each position in the array would be a single unicode character, making character-by-character manipulation of the string very simple. The usage would be to:
As it is so simple, the haxe.Utf32 class might also provide a useful intermediate form when casting between haxe.Ucs2, haxe.Utf8 and haxe.Utf16. |
I havn't seen any discussion on handling unicode normalization before. I'd hate to have a 'unicode string type' say two strings aren't equal just because one uses precomposed characters, and the other combining diacratics. Or that a string search would return no results for similar reasons. @elliott5 all this means that even utf32 does not really make things any easier. Encoding/decoding is the easy part of unicode. |
I bow to your greater knowledge @deltaluca. You prompted me to read http://en.wikipedia.org/wiki/Unicode_normalization, which I found very helpful. I know the Go language much better than I know Haxe. As two of the original three inventors of the Go language (Ken Thompson and Rob Pike) also invented UTF-8, this is a topic close to their hearts. You can read about the Go approach to Unicode normalization at http://blog.golang.org/normalization. In summary, the Go project provides standard library tools which allow programmers to use any of the four Unicode normalization forms. However the Go blog makes reference to http://www.macchiato.com/unicode/nfc-faq#TOC-How-much-text-is-already-NFC- , showing that ~99.98% of web HTML page content characters (after discarding markup, and doing entity resolution) are already in NFC form (Normalization Form Canonical Composition, where characters are decomposed and then recomposed by canonical equivalence). So if Haxe were to support just one Unicode normalisation form, NFC would be the obvious choice. But as @deltaluca points out, even after NFC normalization, there is still the problem of how to deal with those diacratics that remain. Go has an extensive library for handling Unicode code-points (see http://golang.org/pkg/unicode/ ) at least some of which logic will need to also be present in Haxe libraries. One note of caution though, having used my little TARDIS Go project to translate the Go "unicode" library into Haxe, I know that the data-structures required are quite big - so Haxe may only want to implement a sub-set of that functionality. Of course one day, in my dreams, TARDIS Go will provide Haxe programmers with access to all of these Go libraries for free... I could look at producing a proof-of-concept, if anyone is interested. Since all the hard work has already been done by the Go team, it seems a shame not to re-use it. Of course the automatically generated Haxe code produced will be big and slow, at least until the TARDIS Go project matures; but also free and much faster to generate than rewriting it from scratch. |
Unifill, the library for Unicode string support I developed, provides three ways to deal strings.
I suppose the first way is suitable for small applications or whenever string manipulation is not heavy, and the second way is good when dealing native string heavily. The third way would usable in situations the encoding is specified, for instance, to decode binary file containing encoded strings. Each way has pros and cons. |
Haxe's String API is compatible to ActionScript (and JavaScript) but the api is poor due to lack of variable-length encoding support (i.e. surrogate pair support). Java String API is extended to deal surrogate pair. |
I can't see how we could handle this if not by changing ALL |
Can someone comment on the approach(es) taken by https://github.com/mandel59/unifill and if we could make use of it? |
For reference: http://site.icu-project.org/ and libraries that wrap ICU in other languages: http://site.icu-project.org/related |
@Simn broken link |
edited (no idea how that happened) |
I don't think that the Unifill using is a good idea : since all calls go through the Utf interface, no inlining is possible so while it makes the code more generic (for instance you can implement split for all encodings) in the end it's very slow. I would instead have an Utf interface that mimics entirely the String API so it can be a drop down replacement (no using), then an implementation for each encoding. For fixed codepoint encoding (UCS2, UTF32), this would benefit from having not to take codepoint width into acconut |
You're complaining about the Utf interface, but then suggest using a Utf interface instead. ?_? |
Hearing all the lacking-unicode-support-talk going on at WWX2015, I thought maybe the unorm project might be of help. It implements unicode normalization and can be used as a polyfill for the ES6-proposed String.prototype.normalize(). |
Just to reinforce what I said at my WWX2015 talk last Saturday, the existing body of Unicode-related Go packages could easily be auto-translated to Haxe using TARDISgo, including:
The main issue would be creating a more user-friendly and haxe-friendly API wrapper for the generated code, as the current autogenerated API is rather ugly, see the normalization example from my talk. However this is nothing like as much work as writing the libraries from scratch. By auto-generating the libraries in this way, the Haxe ecosystem could very quickly have a complete working solution, one which leverages the many man-years of work that the Google Go team have put into making these libraries correct. Of course the generated code is larger and slower than hand-written Haxe would be, so there would be opportunities to re-write key elements of the libraries to improve speed/size in future. |
I've rewritten Unifill to avoid using Utf interface. Now UtfXs are abstract. |
Unifill doesn't support Ucs2, because what we call UCS-2 is actually potentially ill-formed UTF-16 in most cases. (cf. https://simonsapin.github.io/wtf-8/#ill-formed-utf-16) |
unifill seems to have all the required tools already, so is this just a matter of changing how to interact with it? |
When we add support for unicode, let us please also include an |
@ncannasse: Please follow up. |
i just want to update everyone, the current status of unicode can be found here: i'm planning to get it done in the next weeks, feedback appreciated. |
Great job so far! Couple of notes:
var s = new haxe.i18n.Utf8("☔︎한🌍"); // 3 characters.
trace(s.charAt(0).toNativeString()); //☔ Hmm.. this umbrella looks different
trace(s.charAt(1).toNativeString()); // Nothing?
trace(s.charAt(2).toNativeString()); //ᄒ ???
trace(s.substr(2,1).toNativeString()); //ᄒ
trace(s.substr(2,2).toNativeString()); //하
trace(s.substr(2,3).toNativeString()); //한
var a = new haxe.i18n.Utf8("한"); // 1 character.
var b = new haxe.i18n.Utf8("한"); // 1 character.
trace(a.charAt(0).toNativeString()); //ᄒ
trace(b.charAt(0).toNativeString()); //한 |
When are we likely to have this shipped into the official Haxe source? |
I plan to work on this and it should be included in 4.0. |
@frabbit If there is any problem, please voice out and see if we can help. |
@ncannasse I think you want to review frabbit's work. Assigning to you. |
@andyli would be nice if i get more reviews, so that i have a clear roadmap on how to improve the current state. Hopefully i find some time next week, but i'm quite busy atm, and on vacation afterwards. Will really come back to this in the first november week. |
Things on my roadmap:
|
I have pushed the implementation for these 2 features: Would be nice to know what's still missing, any bugs left? @m22spencer What about unicode normalization, it's not part of the implementation atm. |
I wonder what i should do with toLowercase and toUpperCase, at the moment it's only applied for the ascii range 41...5A and 61...7A |
Some things that i would like to add to std, please let me know if that's fine @ncannasse @Simn
|
I don't see any reason not to add/change these, but can we please do that after we're done with the first step? |
Of course, the question is what exactly is the first step (i guess a merge) and what's missing to do that. |
I feel like we don't need utf16, actually the ucs2 is quite enough. If a guy uses a 4-byte utf16 character such as "𤭢", then just convince him not to use the character, because no one knows this weird character . In fact, there are only 2,000 Chinese characters commonly used in China. |
Emojis are outside the BMP. Maybe we can live with that. Just pointing out that not everything that doesn't fit ucs2 is largely unused. |
some general idea: Unicode is a BIG pitfall. When we're talking about "full support", (besides locale-dependent case conversion and composed characters with several kinds of normalization as mentioned above), we might also need to consider digits in regex: https://stackoverflow.com/questions/6998713/scanning-for-unicode-numbers-in-a-string-with-d It's unavoidable eventually being a "locale" library involving Unicode databases that adds up times larger than Haxe itself. The code tables will have to be shipped with the application if we want to be consistent or up-to-date with latest Unicode specification. Programming languages aren't always consistent on Unicode with each other. Considering "invalidity", there are not only surrogate pairs, but also unassigned, reserved, unused, non-characters sometimes described "should never occur". which ones should we check for? My idea is to limit the range, not to deal with anything above "code points and byte representations" (that's already a lot), keep the numbers as-is and let them function (like iterators by code point). I suggest a Not specifically Int32, nor named Utf32. I think this is consistent with the core types of Haxe, like Int with unspecified precision (but at least 32(31?) bits, enough for Unicode 42.0 I predict) and String with unspecified underlying representation (can't assume what we're operating on at all. bytes, code points or characters? maybe only OK for ASCII.) |
I mostly agree, however if you represent it as |
Yes, it depends on how carefully you want to process the data. Regarding existing utilities, they're OK and I don't mean to deprecate/delete any of them. For parsing, just use code point iterator, returning where we are, what the code is and how big this char is, and cut out what we need according to them, all with the native For daily usages like finding, splitting and replacing, since the key string and the big string are both encoded the same way on the same runtime, there will be no problem with the native APIs (assuming they're coded with valid utf-8, utf-16 underlying, of course). Issues like #6897 are essentially what my proposal intend to deal with, when the underlying format makes unavoidable difference. String.split("") is something special (how much memory does the |
We want to add to the standard library, fully tested versions of the following classes:
They will be implemented as abstracts and will follow the String API + conversions between them and fromBytes/toBytes.
The text was updated successfully, but these errors were encountered: