Add UTF8 to CharSet #17000

stephentoub · 2016-04-17T12:07:15Z

The CharSet enumeration is used to specify how strings should be marshaled:
https://github.com/dotnet/corefx/blob/master/src/System.Runtime/ref/System.Runtime.cs#L3019-L3023

public enum CharSet
{
    Ansi = 2,
    Unicode = 3,
}

Unicode specifies that UTF16 should be used, regardless of platform, but Ansi is interpreted differently based on platform: on Windows it's interpreted to mean the ANSI format, whereas on Unix it's interpreted to mean UTF8. This means that on Windows we lack the ability to specify UTF8 as the marshaling, and more generally we lack the ability to specify UTF8 marshaling regardless of platform, making writing cross-platform managed components more difficult.

We should add a new UTF8 enum value:

public enum CharSet
{
    Ansi = 2,
    Unicode = 3,
    UTF8 = 5,
}

that when used will cause the runtime's marshaling to be done with UTF8, which is the standard for modern services.

[Added-by-Yi]

We should also add a new corresponding UnmanagedType enum in UnmanagedType for UTF8 as well, for finer control and parity (between UnmanagedType and CharSet):
https://github.com/dotnet/corefx/blob/master/src/System.Runtime.InteropServices.PInvoke/ref/System.Runtime.InteropServices.PInvoke.cs

public enum UnmanagedType
{
    LPUTF8Str = 0x30
}

And new Marshal helpers while we are at it:

public class PInvokeMarshal
{
    public static string PtrToStringUTF8(System.IntPtr ptr);
    public static string PtrToStringUTF8(System.IntPtr ptr, int len);
    public static System.IntPtr StringToAllocatedMemoryUTF8(string s);
    public static System.IntPtr ZeroFreeMemoryUTF8(System.IntPtr s);
}

The text was updated successfully, but these errors were encountered:

migueldeicaza · 2016-04-17T15:00:33Z

Adding some extra information.

When these values are embedded in P/Invoke definitions on the metadata, in the "Flags for ImplMap" [PInvokeAttributes]. There is one unused bit there, that we could take (0x08) and we could take one value of those to mean Utf8, leaving some room for other things in the future as well.

yizhang82 · 2016-04-19T00:25:22Z

I think we should also add UnmanagedType=LpUtf8Str (or UTF8String, name TBD). All the capabilities in CharSet should be available in UnmanagedType.

yizhang82 · 2016-04-19T00:25:53Z

The work is the same since we have to create a new ilmarshaler anyway.

yizhang82 · 2016-04-19T00:58:25Z

Added UnmanagedType into @stephentoub original proposal.

yizhang82 · 2016-04-19T18:14:20Z

Added new PInvokeMarshal helpers into the proposal as well.

masonwheeler · 2016-04-19T18:41:32Z

Yes, this has been needed for quite a while now. It definitely looks like a worthwhile proposal.

yizhang82 · 2016-04-19T20:45:19Z

Updated LpUTF8Str to LPUTF8Str according to @weshaggard 's feedback.

tijoytom-zz · 2016-05-10T17:32:03Z

@yizhang82
cc @weshaggard @stephentoub
Does the below API names sound reasonable? I know you named them StringToAllocatedMemoryUTF8 , but StringToCoTaskMemUTF8 sounds consistent with what we have now for ANSI and UniCode(aka UTF16).

unsafe public static String PtrToStringUTF8(IntPtr ptr)
unsafe public static String PtrToStringUTF8(IntPtr ptr,int len)
unsafe public static IntPtr StringToCoTaskMemUTF8(String s)
unsafe public static void ZeroFreeCoTaskMemUTF8(IntPtr s)

yizhang82 · 2016-05-11T01:03:46Z

@tijoytom Please note that these are the new names we defined for PInvokeMarshal class where the naming is consistent and use AllocatedMemory (instead of CoTaskMem, to avoid windows-ness).

whoisj · 2016-05-11T05:05:29Z

Any chance we can call a duck a "duck" and change Unicode to Utf16? I'd even settle for Ucs2 at this point.

yizhang82 · 2016-05-11T05:31:41Z

@whoisj Unfortunately changing existing API contract is a breaking change. We can change the Unicode names in the new PInvokeMarshal class APIs, but the CharSet enum is going to be problematic. I think we can slowly introduce new API and enum with better names and slowly deprecate the old ones (LPWstr is another such example).

bendono · 2016-05-11T08:38:50Z

@yizhang82 What about leaving Unicode and adding Utf16, with both having the same value of 3?

masonwheeler · 2016-05-11T09:57:20Z

@bendono That was my thought too. Or possibly even marking Unicode as [Obsolete]. (Can you do that with enum values?)

whoisj · 2016-05-11T15:14:34Z

@whoisj

Unfortunately changing existing API contract is a breaking change. We can change the Unicode names in the new PInvokeMarshal class APIs, but the CharSet enum is going to be problematic. I think we can slowly introduce new API and enum with better names and slowly deprecate the old ones (LPWstr is another such example).

I guess I do not understand why the enumeration value cannot be overloaded for "correctness" and "back compat". Something along the lines of:

public enum CharSet
{
    Ansi = 2,
    [Obsolete]
    Unicode = 3,
    UTF16 = 3,
    UTF8 = 5,
}

yizhang82 · 2016-05-11T15:47:59Z

@whoisj @masonwheeler Thanks for your suggestions. Yes, this is exactly what I was thinking as well (introduce new API and deprecate old ones). In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode. It is something a simple search/replace could fix, but still potentially a big impact regardless.

whoisj · 2016-05-11T16:20:54Z

@yizhang82 as a customer who would be impacted by such a change in the way you describe, I appreciate your concern.

I propose the following resolution:

Add UTF16 = 3 to the enumeration, and pubically state that Unicode will be deprecated in the future.
In the follow up release to NetFx 4.7 / 5 / 7, whatever it'll be named, add the [Obsolete] decorator. to the Unicode value.
Sometime in the future, when appropriate, remove the Unicode value from the enumeration.

This seems to be a decent compromise.

masonwheeler · 2016-05-11T16:56:26Z

@yizhang82

In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode.

But isn't that the whole point of warningaserror, declaring that they want their code to break if even something minor enough to be considered an error ends up being off about their codebase, including future changes?

tijoytom-zz · 2016-05-12T17:01:05Z

@yizhang82
unsafe public static String PtrToStringUTF8(IntPtr ptr,int len)

Realized that the semantics of this method (ie for ANSI and UTF16) usually 'len' parameter is the length in number of characters , now for UTF8 the user won't know the number of character pointed to by ptr. Now even if the user somehow find the number of chars , we still need to get the nubmer of bytes to do the actual conversion.
Given this , i think it might be better to drop this overload of PtrToStringUTF8.

whoisj · 2016-05-12T18:36:00Z

@tijoytom not every uft8 block is null terminated. Many use-cases will need the len value to be the number of utf8 bytes, but absolutely not the count of characters or code points in the block.

yizhang82 · 2016-05-12T18:45:52Z

@tijoytom The len here is not count of characters. It is byte len. Our Documentation on PtrToStringAnsi is incorrect. Let's change the name of PtrToStringUTF8 to be PtrToStringUTF8ByteLen

stephentoub · 2016-05-12T18:50:10Z

Let's change the name of PtrToStringUTF8 to be PtrToStringUTF8ByteLen

Wouldn't it be better to instead change the name of the parameter from len to byteLength or something like that?

tijoytom-zz · 2016-05-12T22:04:12Z

Yes, changed it to byteLen

AlexGhiondea · 2016-11-29T18:49:50Z

@yizhang82 @tijoytom is this something you are currently working on?

jnm2 · 2016-11-30T13:07:53Z

@masonwheeler

@yizhang82

In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode.

But isn't that the whole point of warningaserror, declaring that they want their code to break if even something minor enough to be considered an error ends up being off about their codebase, including future changes?

This topic has come up several times here and in .NET design meetings. Being respectful of warnings-as-errors is a good thing, just as much for the sake of not introducing 1000 new build warnings even without warnings-as-errors. But at the end of the day, if people opt into warnings as errors, they should expect to have their builds broken now and then. They want micromanagement, and they will get it.

I think obsoleting public API immediately is okay. If you don't do that, at least clearly document the value as deprecated in IntelliSense.

Personally I leave it off until the implementation is past primary development. I turn it on during the stage where I don't expect to be making more changes myself besides potential dependency updating.

danmoseley · 2017-01-03T17:12:33Z

@jnm2 the Visual Studio experience is the main concern. I can't speak for the compiler team but I understand they wish to avoid making new warnings appear in the VS error list on upgrade as over the years it has proven to deter people from upgrading their Visual Studio. Historically we have tried to encourage folks to disable specific warnings to get past this but it hasn't been sufficient. I completely agree with the sentiment because there are types we really want to steer folks away from but that's the current policy.

whoisj · 2017-01-03T19:21:27Z

@danmosemsft but CoreFx and VS are decoupled now, no?

yizhang82 · 2017-01-03T19:40:21Z

@whoisj They are decoupled - but VS needs to have a policy decision on which version of CoreFX meta package as the default in VS projects, and this has wide impact to developers using VS. At the end of the day, I think there is a trade off to be made case-by-case. In this case I don't see a strong reason. There will be some confusion, yes, but given that rest of the .NET escosystem (char, string, etc) is pretty much UTF-16 based, it is probably not worth impacting anyone that has CharSet.Unicode in their code.

We could simply add a new value UTF16 that has the same value without deprecating the Unicode value.

danmoseley · 2017-01-03T20:13:56Z

@whoisj Visual Studio does not run on CoreFX, but it does offer a tooling experience for CoreFX which is what matters here. Possibly I misunderstand.

whoisj · 2017-01-03T23:27:55Z

@yizhang82 I think this:

We could simply add a new value UTF16 that has the same value without deprecating the Unicode value.

Is the right answer. For those coming from non-Windows platforms, seeing the options Ascii, Unicode, and UTF8 as options is kind of mind boggling. For many, Unicode is UTF8. 😏

luqunl · 2018-02-09T18:11:46Z

assign this to @jeffschwMSFT and @luqunl

brian-armstrong-discord · 2018-06-05T23:13:07Z

I love that this is being looked at!

Wouldn't the correct casing be Utf8 in order to match Ansi? Keeping Ansi but adding UTF8 seems like an inconsistency.

danmoseley · 2018-06-05T23:28:57Z

Re casing, The design guidelines require Utf8. Looking in CoreFXLabs, they consistently use Utf8.

However there is plenty of use of upper casings eg if you look in CoreFX public API and in https://apisof.net.

@stephentoub did you choose upper case to be consistent with UTF8Encoding?

stephentoub · 2018-06-06T01:20:04Z

@stephentoub did you choose upper case to be consistent with UTF8Encoding?

With Encoding.UTF8

luqunl · 2018-06-06T17:27:22Z

@brian-armstrong-discord , We are discussing whether it is necessary to add Charset.UTF8. Most of case(except array), User can just add MarshalAs(UnmanagedType.LPUTF8Str) to each appropriate arguments/fields to use UTF8.

yizhang82 · 2018-06-06T18:10:05Z

@luqun Besides char, char[], CharSet also affects default String/StringBuilder marshaling behavior.

joshfree · 2019-01-18T20:58:23Z

@jeffschwMSFT what's the current status of this approved API? do you need anything from the corefx team to continue making progress here?

/cc @layomia and @GrabYourPitchforks

Thanks

adamsitnik · 2019-06-15T09:31:59Z

Any progress on this? The current behavior is confusing.

jeffschwMSFT · 2019-06-16T02:58:37Z

This is not currently on our .NET Core 3.0 list of final items. We can take a look and see if this would align with 3.1.

cc @jkoritzinsky

lahma0 · 2021-08-20T17:42:15Z

After 5 long years, why has this feature still not been implemented? Is there some technical or policy-related issue that is holding this up or is the can just being kicked down the road as a result of other priorities? It appears that most of the necessary discussions concerning technical matters ended in 2017 yet its implementation was skipped over in .NET Core 3.0, then skipped over in .NET Core 3.1, it was then added as a "5.0 milestone", then several months later removed as a "5.0 milestone", and finally was added as a "Future milestone" (which I have to assume just means "something we are going to ignore and don't want to worry about any time in the foreseeable future"). Not having this functionality implemented unnecessarily complicates both cross-platform development and working with native libraries which use/expect UTF8 strings. UTF8 is the most common character encoding method in use and has been for a long time... Will this ever be implemented?

jkoritzinsky · 2021-08-20T17:49:14Z

@lahma0 due to a number of implementation difficulties, we have generally been exploring other improvements. We have been working on a source-generated interop solution (see #4257 (comment) for a little more information) that should enable solutions similar to this issue without the far reaching changes in the runtime itself that the tentative implementation of this API had.

In the meantime, you can use [MarshalAs(UnmanagedType.LPUTF8Str)] for each string parameter/return value to get identical behavior to what a CharSet.UTF8 API would provide on P/Invokes

lahma0 · 2021-08-20T18:05:28Z

@jkoritzinsky I appreciate your very quick response and I apologize if my comment came across as a little bitter. While implementing these type of things might not look like that large of a hurdle to outsiders (such as myself), I know they tend to be much more nuanced/difficult and can create a cascade of other problems. I read about the plans for the source-generated interop stuff quite a while back but I haven't kept up to date on its progress. I will definitely check out the links your provided. Thanks again.

stephentoub · 2022-03-16T18:46:25Z

@dotnet/interop-contrib, have we decided not to do this? If so, we should close it.

AaronRobinsonMSFT · 2022-03-16T18:48:32Z

@stephentoub The plan was to wait till we get the source generator for DllImport productized and then close these issues. However, I suppose we could just point to the productization issue now.

AaronRobinsonMSFT · 2022-03-16T18:49:40Z

Closing this request in lieu of first-class support we are adding to the DllImport source generator. See #60595

stephentoub assigned yizhang82 Apr 17, 2016

luqunl assigned jeffschwMSFT and luqunl Feb 9, 2018

luqunl removed their assignment Oct 11, 2018

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 5.0 milestone Jan 31, 2020

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020

stephentoub removed the untriaged New issue has not been triaged by the area owner label Feb 25, 2020

jkotas added area-System.Runtime.InteropServices and removed area-System.Runtime labels May 31, 2020

jkotas modified the milestones: 5.0, Future May 31, 2020

jkotas mentioned this issue May 31, 2020

Add UTF8 (and maybe UTF16) to System.Runtime.InteropServices.CharSet + Marshal class #4257

Closed

stephentoub mentioned this issue Sep 29, 2021

Dllimport generator build and test fixes #59658

Merged

AaronRobinsonMSFT closed this as completed Mar 16, 2022

ghost locked as resolved and limited conversation to collaborators Apr 15, 2022

Add UTF8 to CharSet #17000

Add UTF8 to CharSet #17000

Comments

stephentoub commented Apr 17, 2016

migueldeicaza commented Apr 17, 2016

yizhang82 commented Apr 19, 2016

yizhang82 commented Apr 19, 2016

yizhang82 commented Apr 19, 2016

yizhang82 commented Apr 19, 2016

masonwheeler commented Apr 19, 2016

yizhang82 commented Apr 19, 2016

tijoytom-zz commented May 10, 2016

yizhang82 commented May 11, 2016

whoisj commented May 11, 2016

yizhang82 commented May 11, 2016

bendono commented May 11, 2016

masonwheeler commented May 11, 2016

whoisj commented May 11, 2016 • edited Loading

yizhang82 commented May 11, 2016

whoisj commented May 11, 2016

masonwheeler commented May 11, 2016

tijoytom-zz commented May 12, 2016

whoisj commented May 12, 2016

yizhang82 commented May 12, 2016

stephentoub commented May 12, 2016

tijoytom-zz commented May 12, 2016

AlexGhiondea commented Nov 29, 2016

jnm2 commented Nov 30, 2016 • edited Loading

danmoseley commented Jan 3, 2017

whoisj commented Jan 3, 2017

yizhang82 commented Jan 3, 2017

danmoseley commented Jan 3, 2017

whoisj commented Jan 3, 2017

luqunl commented Feb 9, 2018

brian-armstrong-discord commented Jun 5, 2018

danmoseley commented Jun 5, 2018

stephentoub commented Jun 6, 2018

luqunl commented Jun 6, 2018

yizhang82 commented Jun 6, 2018

joshfree commented Jan 18, 2019

adamsitnik commented Jun 15, 2019

jeffschwMSFT commented Jun 16, 2019

lahma0 commented Aug 20, 2021

jkoritzinsky commented Aug 20, 2021

lahma0 commented Aug 20, 2021

stephentoub commented Mar 16, 2022

AaronRobinsonMSFT commented Mar 16, 2022

AaronRobinsonMSFT commented Mar 16, 2022

whoisj commented May 11, 2016 •

edited

Loading

jnm2 commented Nov 30, 2016 •

edited

Loading