-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UTF8 (and maybe UTF16) to System.Runtime.InteropServices.CharSet + Marshal class #4257
Comments
Hmm. Now, don't get me wrong, I like UTF8 as much as anyone. Its technical superiority over UTF-16 is obvious, and I'd definitely like to see it replace UTF-16. But there's more than a little exaggeration in that claim as stated! The platform native string type in both Windows and OSX is UTF-16. The platform native string type in both the CLR and the JVM is UTF-16. Take those out... and there's precious little left of "the world"! |
@masonwheeler: That may be, but I would argue your perspective is too narrow. You're looking at things from a framework-internal perspective. When using P/Invoke you are typically (or at least often) interacting with libraries hosted outside the .NET framework, often written in plain C. Most of those libraries will expect traditional 8-bit text ANSI/ASCII text and in those cases UTF8 is the only reliable way of getting Unicode-content across. And if we're going to start P/Invoking on other platforms than Windows, UTF8 is defacto the only way to represent unicode at system-level. |
@josteink Ah, you're right. That's a very good point. |
+9000 I've worked with internal applications that pinvoked into custom C DLLs, and we had to maintain tons of custom marshaling code because using UTF-16 was out of the question due to the memory requirements. Automatic UTF-8 marshaling would be so nice to have. |
While the name is misleading, on UNIX CoreCLR's ANSI marshaling is UTF8. |
@stephentoub The name is not only misleading, it is incompatible with the behavior on Windows. While this is what Mono has done, it causes problems for people that want to target both Unix and Windows, as evidenced by people that have to resort over and over to use the two approaches described above. I do not need the extra work in Mono (just like I assume you guys do not want the extra work), but it makes the platform very unpleasant to work with for everyone that has to consider more than a platform. This is a request on behalf of the users of what is a half-baked and broken setup. |
cc @yizhang82 - maybe a good candidate for MCG (Marshaling Code Generator) |
If this ever happen, would it mean we will not require to use |
Supporting UTF-8 marshalling does sound like something we might want to support in the future, most likely in MCG @jkotas has mentioned earlier. MCG is the new interop technology we have in .NET Native, and is a great place for experimenting with newer features like this. Implementing marshalling support in the CLR itself is rather painful - you'll need to write code generation code that emits IL. While in MCG, you write code that spits out C#. Eventually we'd like to see all the CLR runtime (.NET native being one of them) use the same underlying interop technology so that we don't have to implement them twice and maintain two separate code bases. @jasonwilliams200OK It's unlikely we'll be able to support marshalling directly to C++ std::strings. We don't necessarily want to tie to a particular layout of a C++ string implementation. I think C++/CLI libraries should provide a good way to marshal between managed string and C++ string. |
@yizhang82 C++ std:string is happy with c-string, as is C#. Why not just assume that native and C# will be using char[] or wchar_t[] for string import/export. That said, the UTF-8 marshaller is critical. The Libgit2Sharp project has a fairly good implementation without too many contributors (one really), might be worth speaking with them about making their implementation part of the coreclr, or at least a good starting point. /CC @nulltoken |
👍 for me @paulcbetts @phkelley @tclem @dahlbyk AFAIR you all contributed to it. Thoughts? |
First-class UTF8 seems reasonable to me. First-class Relevant LG2S classes:
|
^^ this would be great, there's really no reason to assume text is ASCII when marshaling strings, the only two options in modern software today are UTF-16 (i.e. Windows / OS X native style) and UTF-8 (basically every OSS lib). +1 for this feature. |
We have the API proposal out for review: https://github.com/dotnet/corefx/issues/7804 |
CC: @leemgs This seems to be related with the unexplained issue that depends on the LOCALE environment variables (not working if the locale is ko_KR.UTF8). |
@migueldeicaza Has anything happened related to this in the past few years? Especially now that .NET is focusing on better Unicode support with stuff like |
Nothing has happened. It is funny because this issue was the oldest bug we kept open for Mono - filed by Red Hat sometime around 2002-2003. We tried ECMA at the time, we tried every contact we had, and even now it seems to be a part of the code that nobody wants to touch and is scared of changing. After Xamarin was acquired I had various discussions in person about it. At this point developers have resorted to a spectrum of workarounds, hacks and band aids, depending on just how badly you need this and how much performance you require. In the end, given that there are suboptimal workaround prevents this from being considered. The world continues, but it remains unnecessarily complex for newcomers. |
We tried to add In the mean time, the following related thigs happened that made the UTF8 interop story better:
|
Those APIs and similar ones are the sort of bandaids I was referring to. They exist in assorted ways across cross platform projects because of this missing marshaling capability. I looked at the issue and could not figure out what part of it was the blocking one? |
There was nothing blocking. It is just a lot of work to get the design proposed here pushed through the system. In addition to implementing it in both CoreCLR and Mono runtimes for all situations where interop does string marshaling, there is also changing Roslyn, changing ilasm/ildasm and number of other tools and libraries that operate on IL, etc. The interop source generators will make it much easier. Proposed design |
So you're saying that if you use |
@Serentty It is a little more nuanced than that. With the source generators for P/Invokes proposal, we will be able to support UTF-8 in a portable and real way that doesn't cost any where near as much as the referenced PR (dotnet/coreclr#18186). The goal of source generators for P/Invokes is to enable support of many language and cross-platform features that can be much cheaper than the high cost of updating the built-in environment (e.g. For now, as @migueldeicaza stated, it is "unnecessarily complex" but reducing that complexity is going to be possible with the source generator approach. |
I see, thanks. |
Now that the LibraryImport generator (i.e., source generated |
Currently marshaling in .NET has three modes: Ansi, Unicode (platform dependent) and "Auto" which picks a good default between those two. The meaning of Unicode is closely associated with Window's UTF16.
There is today no convenient and reliable way to do UTF8 encoding, and at best we have an ambiguous definition of what Unicode is.
There are enough bits on the metadata tables to add these two values.
People can resort to custom marshalers (slow, cumbersome, everyone has to do it), or manual marshaling, or hope that the platform does the right thing.
Anecdotally: this also happens to be oldest Mono bug that is still open.
The world has spoken, and UTF8 is the standard, we should have first-class support for it both for P/Invoke signatures as well as the various helper methods in Marshal.
The text was updated successfully, but these errors were encountered: