Add UTF8 (and maybe UTF16) to System.Runtime.InteropServices.CharSet + Marshal class #4257

migueldeicaza · 2015-05-17T14:57:00Z

Currently marshaling in .NET has three modes: Ansi, Unicode (platform dependent) and "Auto" which picks a good default between those two. The meaning of Unicode is closely associated with Window's UTF16.

There is today no convenient and reliable way to do UTF8 encoding, and at best we have an ambiguous definition of what Unicode is.

There are enough bits on the metadata tables to add these two values.

People can resort to custom marshalers (slow, cumbersome, everyone has to do it), or manual marshaling, or hope that the platform does the right thing.

Anecdotally: this also happens to be oldest Mono bug that is still open.

The world has spoken, and UTF8 is the standard, we should have first-class support for it both for P/Invoke signatures as well as the various helper methods in Marshal.

masonwheeler · 2015-05-18T00:08:11Z

The world has spoken, and UTF8 is the standard

Hmm. Now, don't get me wrong, I like UTF8 as much as anyone. Its technical superiority over UTF-16 is obvious, and I'd definitely like to see it replace UTF-16. But there's more than a little exaggeration in that claim as stated!

The platform native string type in both Windows and OSX is UTF-16. The platform native string type in both the CLR and the JVM is UTF-16. Take those out... and there's precious little left of "the world"!

josteink · 2015-05-18T12:36:32Z

@masonwheeler: That may be, but I would argue your perspective is too narrow. You're looking at things from a framework-internal perspective.

When using P/Invoke you are typically (or at least often) interacting with libraries hosted outside the .NET framework, often written in plain C.

Most of those libraries will expect traditional 8-bit text ANSI/ASCII text and in those cases UTF8 is the only reliable way of getting Unicode-content across. And if we're going to start P/Invoking on other platforms than Windows, UTF8 is defacto the only way to represent unicode at system-level.

masonwheeler · 2015-05-18T12:41:54Z

@josteink Ah, you're right. That's a very good point.

MikePopoloski · 2015-05-18T13:34:34Z

+9000

I've worked with internal applications that pinvoked into custom C DLLs, and we had to maintain tons of custom marshaling code because using UTF-16 was out of the question due to the memory requirements. Automatic UTF-8 marshaling would be so nice to have.

stephentoub · 2015-05-18T14:09:24Z

While the name is misleading, on UNIX CoreCLR's ANSI marshaling is UTF8.

migueldeicaza · 2015-05-18T14:15:22Z

@stephentoub The name is not only misleading, it is incompatible with the behavior on Windows.

While this is what Mono has done, it causes problems for people that want to target both Unix and Windows, as evidenced by people that have to resort over and over to use the two approaches described above.

I do not need the extra work in Mono (just like I assume you guys do not want the extra work), but it makes the platform very unpleasant to work with for everyone that has to consider more than a platform. This is a request on behalf of the users of what is a half-baked and broken setup.

jkotas · 2015-05-18T17:20:53Z

cc @yizhang82 - maybe a good candidate for MCG (Marshaling Code Generator)

ghost · 2015-05-30T23:06:15Z

If this ever happen, would it mean we will not require to use MarshalString() between the native C++ and CLR strings, as the direct assignment would be possible? Which would also mean no more efforts to invent better (managed) variants of System::Interop like specified in @Cygon's blog: http://blog.nuclex-games.com/mono-dotnet/cxx-cli-string-marshaling/.
In that case 💯 👍

yizhang82 · 2015-06-02T05:32:54Z

Supporting UTF-8 marshalling does sound like something we might want to support in the future, most likely in MCG @jkotas has mentioned earlier. MCG is the new interop technology we have in .NET Native, and is a great place for experimenting with newer features like this. Implementing marshalling support in the CLR itself is rather painful - you'll need to write code generation code that emits IL. While in MCG, you write code that spits out C#. Eventually we'd like to see all the CLR runtime (.NET native being one of them) use the same underlying interop technology so that we don't have to implement them twice and maintain two separate code bases.

@jasonwilliams200OK It's unlikely we'll be able to support marshalling directly to C++ std::strings. We don't necessarily want to tie to a particular layout of a C++ string implementation. I think C++/CLI libraries should provide a good way to marshal between managed string and C++ string.

whoisj · 2015-06-02T16:17:33Z

@yizhang82 C++ std:string is happy with c-string, as is C#. Why not just assume that native and C# will be using char[] or wchar_t[] for string import/export.

That said, the UTF-8 marshaller is critical. The Libgit2Sharp project has a fairly good implementation without too many contributors (one really), might be worth speaking with them about making their implementation part of the coreclr, or at least a good starting point.

/CC @nulltoken

nulltoken · 2015-06-10T06:47:05Z

👍 for me

@paulcbetts @phkelley @tclem @dahlbyk AFAIR you all contributed to it. Thoughts?

dahlbyk · 2015-06-10T07:30:52Z

First-class UTF8 seems reasonable to me. First-class Encoding support seems like it could also be useful, but I'm not exactly sure what that would look like.

Relevant LG2S classes:

EncodingMarshaler base
LaxUtf8Marshaler (throwOnInvalidBytes: false)
StrictUtf8Marshaler (throwOnInvalidBytes: true)

anaisbetts · 2015-06-10T23:56:12Z

^^ this would be great, there's really no reason to assume text is ASCII when marshaling strings, the only two options in modern software today are UTF-16 (i.e. Windows / OS X native style) and UTF-8 (basically every OSS lib). +1 for this feature.

yizhang82 · 2016-04-19T18:10:22Z

We have the API proposal out for review: https://github.com/dotnet/corefx/issues/7804

myungjoo · 2016-06-21T06:57:12Z

CC: @leemgs This seems to be related with the unexplained issue that depends on the LOCALE environment variables (not working if the locale is ko_KR.UTF8).

Serentty · 2020-05-30T21:33:57Z

@migueldeicaza Has anything happened related to this in the past few years? Especially now that .NET is focusing on better Unicode support with stuff like System.Text.Rune, it seems like this would be a logical step.

danmoseley · 2020-05-31T02:04:58Z

cc @eerhardt @GrabYourPitchforks

migueldeicaza · 2020-05-31T13:54:03Z

Nothing has happened. It is funny because this issue was the oldest bug we kept open for Mono - filed by Red Hat sometime around 2002-2003.

We tried ECMA at the time, we tried every contact we had, and even now it seems to be a part of the code that nobody wants to touch and is scared of changing. After Xamarin was acquired I had various discussions in person about it.

At this point developers have resorted to a spectrum of workarounds, hacks and band aids, depending on just how badly you need this and how much performance you require.

In the end, given that there are suboptimal workaround prevents this from being considered.

The world continues, but it remains unnecessarily complex for newcomers.

jkotas · 2020-05-31T14:24:59Z

We tried to add CharSet.UTF8, but it turned out to be pretty difficult and far-reaching. The issue #17000 on this is still open. You can see some of the difficulties in closed PR dotnet/coreclr#18186. At this point, the most likely way we are going to fully address this is via interop source generators that we are starting to prototype. cc @elinor-fung @AaronRobinsonMSFT

In the mean time, the following related thigs happened that made the UTF8 interop story better:

Performance of UTF8 encoding / decoding was improved about an order of magnitude since .NET Core project started.
Span<T> interop is easier to write, more performant and more secure.
Marshal.LPUTF8Str was added
Marshal.PtrToStringUTF8, Marshal.StringToHGlobalUTF8 and Marshal.StringToCoTaskMemUTF8 APIs were added.
We have started using these new functions in the core framework, e.g. here:

runtime/src/libraries/Common/src/Interop/OSX/Interop.CoreFoundation.CFString.cs

Line 46 in 07a5ee8

return Marshal.PtrToStringUTF8(interiorPointer)!;

.

migueldeicaza · 2020-05-31T14:36:06Z

Those APIs and similar ones are the sort of bandaids I was referring to. They exist in assorted ways across cross platform projects because of this missing marshaling capability.

I looked at the issue and could not figure out what part of it was the blocking one?

jkotas · 2020-05-31T14:57:31Z

There was nothing blocking.

It is just a lot of work to get the design proposed here pushed through the system. In addition to implementing it in both CoreCLR and Mono runtimes for all situations where interop does string marshaling, there is also changing Roslyn, changing ilasm/ildasm and number of other tools and libraries that operate on IL, etc.

The interop source generators will make it much easier. Proposed design
https://github.com/dotnet/runtime/blob/master/docs/design/features/source-generator-pinvokes.md

Serentty · 2020-05-31T17:20:49Z

So you're saying that if you use DllImportAttribute with source generators, it will be possible to specify UTF-8 in a portable way, and that people should use that going forward instead of the old DllImport?

AaronRobinsonMSFT · 2020-05-31T17:35:00Z

So you're saying that if you use DllImportAttribute with source generators, it will be possible to specify UTF-8 in a portable way, and that people should use that going forward instead of the old DllImport?

@Serentty It is a little more nuanced than that. With the source generators for P/Invokes proposal, we will be able to support UTF-8 in a portable and real way that doesn't cost any where near as much as the referenced PR (dotnet/coreclr#18186). The goal of source generators for P/Invokes is to enable support of many language and cross-platform features that can be much cheaper than the high cost of updating the built-in environment (e.g. Span<T>). I say more nuanced because the exact shape and primary focus of v1 of the feature isn't defined yet. This conversation though does seem to warrant making it higher priority, but it is still a bit away from being released.

For now, as @migueldeicaza stated, it is "unnecessarily complex" but reducing that complexity is going to be possible with the source generator approach.

Serentty · 2020-05-31T17:52:31Z

I see, thanks.

AaronRobinsonMSFT · 2022-05-06T19:18:22Z

Now that the LibraryImport generator (i.e., source generated DllImports) has been exposed as a publicly consumable tool, this issue should be explored relative to that effort. This specific issue is being close due to first class support in the LibraryImport generator for UTF-8. See #67635.

msftgits transferred this issue from dotnet/coreclr Jan 30, 2020

msftgits added this to the Future milestone Jan 30, 2020

jkotas added area-System.Runtime.InteropServices and removed area-Interop-coreclr labels May 31, 2020

jkoritzinsky mentioned this issue Aug 20, 2021

Add UTF8 to CharSet #17000

Closed

AaronRobinsonMSFT closed this as completed May 6, 2022

ghost locked as resolved and limited conversation to collaborators Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UTF8 (and maybe UTF16) to System.Runtime.InteropServices.CharSet + Marshal class #4257

Add UTF8 (and maybe UTF16) to System.Runtime.InteropServices.CharSet + Marshal class #4257

migueldeicaza commented May 17, 2015

masonwheeler commented May 18, 2015

josteink commented May 18, 2015

masonwheeler commented May 18, 2015

MikePopoloski commented May 18, 2015

stephentoub commented May 18, 2015

migueldeicaza commented May 18, 2015

jkotas commented May 18, 2015

ghost commented May 30, 2015

yizhang82 commented Jun 2, 2015

whoisj commented Jun 2, 2015

nulltoken commented Jun 10, 2015

dahlbyk commented Jun 10, 2015

anaisbetts commented Jun 10, 2015

yizhang82 commented Apr 19, 2016 •

edited

Loading

myungjoo commented Jun 21, 2016

Serentty commented May 30, 2020

danmoseley commented May 31, 2020

migueldeicaza commented May 31, 2020

jkotas commented May 31, 2020 •

edited

Loading

migueldeicaza commented May 31, 2020 •

edited

Loading

jkotas commented May 31, 2020

Serentty commented May 31, 2020

AaronRobinsonMSFT commented May 31, 2020

Serentty commented May 31, 2020

AaronRobinsonMSFT commented May 6, 2022

Add UTF8 (and maybe UTF16) to System.Runtime.InteropServices.CharSet + Marshal class #4257

Add UTF8 (and maybe UTF16) to System.Runtime.InteropServices.CharSet + Marshal class #4257

Comments

migueldeicaza commented May 17, 2015

masonwheeler commented May 18, 2015

josteink commented May 18, 2015

masonwheeler commented May 18, 2015

MikePopoloski commented May 18, 2015

stephentoub commented May 18, 2015

migueldeicaza commented May 18, 2015

jkotas commented May 18, 2015

ghost commented May 30, 2015

yizhang82 commented Jun 2, 2015

whoisj commented Jun 2, 2015

nulltoken commented Jun 10, 2015

dahlbyk commented Jun 10, 2015

anaisbetts commented Jun 10, 2015

yizhang82 commented Apr 19, 2016 • edited Loading

myungjoo commented Jun 21, 2016

Serentty commented May 30, 2020

danmoseley commented May 31, 2020

migueldeicaza commented May 31, 2020

jkotas commented May 31, 2020 • edited Loading

migueldeicaza commented May 31, 2020 • edited Loading

jkotas commented May 31, 2020

Serentty commented May 31, 2020

AaronRobinsonMSFT commented May 31, 2020

Serentty commented May 31, 2020

AaronRobinsonMSFT commented May 6, 2022

yizhang82 commented Apr 19, 2016 •

edited

Loading

jkotas commented May 31, 2020 •

edited

Loading

migueldeicaza commented May 31, 2020 •

edited

Loading