-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UTF8 to CharSet #17000
Comments
Adding some extra information. When these values are embedded in P/Invoke definitions on the metadata, in the "Flags for ImplMap" [PInvokeAttributes]. There is one unused bit there, that we could take (0x08) and we could take one value of those to mean Utf8, leaving some room for other things in the future as well. |
I think we should also add UnmanagedType=LpUtf8Str (or UTF8String, name TBD). All the capabilities in CharSet should be available in UnmanagedType. |
The work is the same since we have to create a new ilmarshaler anyway. |
Added UnmanagedType into @stephentoub original proposal. |
Added new PInvokeMarshal helpers into the proposal as well. |
Yes, this has been needed for quite a while now. It definitely looks like a worthwhile proposal. |
Updated LpUTF8Str to LPUTF8Str according to @weshaggard 's feedback. |
@yizhang82 unsafe public static String PtrToStringUTF8(IntPtr ptr) |
@tijoytom Please note that these are the new names we defined for PInvokeMarshal class where the naming is consistent and use AllocatedMemory (instead of CoTaskMem, to avoid windows-ness). |
Any chance we can call a duck a "duck" and change |
@whoisj Unfortunately changing existing API contract is a breaking change. We can change the Unicode names in the new PInvokeMarshal class APIs, but the CharSet enum is going to be problematic. I think we can slowly introduce new API and enum with better names and slowly deprecate the old ones (LPWstr is another such example). |
@yizhang82 What about leaving |
@bendono That was my thought too. Or possibly even marking |
I guess I do not understand why the enumeration value cannot be overloaded for "correctness" and "back compat". Something along the lines of: public enum CharSet
{
Ansi = 2,
[Obsolete]
Unicode = 3,
UTF16 = 3,
UTF8 = 5,
} |
@whoisj @masonwheeler Thanks for your suggestions. Yes, this is exactly what I was thinking as well (introduce new API and deprecate old ones). In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode. It is something a simple search/replace could fix, but still potentially a big impact regardless. |
@yizhang82 as a customer who would be impacted by such a change in the way you describe, I appreciate your concern. I propose the following resolution:
This seems to be a decent compromise. |
But isn't that the whole point of warningaserror, declaring that they want their code to break if even something minor enough to be considered an error ends up being off about their codebase, including future changes? |
@yizhang82 Realized that the semantics of this method (ie for ANSI and UTF16) usually 'len' parameter is the length in number of characters , now for UTF8 the user won't know the number of character pointed to by ptr. Now even if the user somehow find the number of chars , we still need to get the nubmer of bytes to do the actual conversion. |
@tijoytom not every uft8 block is null terminated. Many use-cases will need the |
@tijoytom The len here is not count of characters. It is byte len. Our Documentation on PtrToStringAnsi is incorrect. Let's change the name of PtrToStringUTF8 to be PtrToStringUTF8ByteLen |
Wouldn't it be better to instead change the name of the parameter from |
Yes, changed it to byteLen |
@yizhang82 @tijoytom is this something you are currently working on? |
This topic has come up several times here and in .NET design meetings. Being respectful of warnings-as-errors is a good thing, just as much for the sake of not introducing 1000 new build warnings even without warnings-as-errors. But at the end of the day, if people opt into warnings as errors, they should expect to have their builds broken now and then. They want micromanagement, and they will get it. I think obsoleting public API immediately is okay. If you don't do that, at least clearly document the value as deprecated in IntelliSense. Personally I leave it off until the implementation is past primary development. I turn it on during the stage where I don't expect to be making more changes myself besides potential dependency updating. |
@jnm2 the Visual Studio experience is the main concern. I can't speak for the compiler team but I understand they wish to avoid making new warnings appear in the VS error list on upgrade as over the years it has proven to deter people from upgrading their Visual Studio. Historically we have tried to encourage folks to disable specific warnings to get past this but it hasn't been sufficient. I completely agree with the sentiment because there are types we really want to steer folks away from but that's the current policy. |
@danmosemsft but CoreFx and VS are decoupled now, no? |
@whoisj They are decoupled - but VS needs to have a policy decision on which version of CoreFX meta package as the default in VS projects, and this has wide impact to developers using VS. At the end of the day, I think there is a trade off to be made case-by-case. In this case I don't see a strong reason. There will be some confusion, yes, but given that rest of the .NET escosystem (char, string, etc) is pretty much UTF-16 based, it is probably not worth impacting anyone that has CharSet.Unicode in their code. We could simply add a new value UTF16 that has the same value without deprecating the Unicode value. |
@whoisj Visual Studio does not run on CoreFX, but it does offer a tooling experience for CoreFX which is what matters here. Possibly I misunderstand. |
@yizhang82 I think this:
Is the right answer. For those coming from non-Windows platforms, seeing the options Ascii, Unicode, and UTF8 as options is kind of mind boggling. For many, Unicode is UTF8. 😏 |
assign this to @jeffschwMSFT and @luqunl |
I love that this is being looked at! Wouldn't the correct casing be |
Re casing, The design guidelines require Utf8. Looking in CoreFXLabs, they consistently use Utf8. However there is plenty of use of upper casings eg if you look in CoreFX public API and in https://apisof.net. @stephentoub did you choose upper case to be consistent with UTF8Encoding? |
With Encoding.UTF8 |
@brian-armstrong-discord , We are discussing whether it is necessary to add Charset.UTF8. Most of case(except array), User can just add MarshalAs(UnmanagedType.LPUTF8Str) to each appropriate arguments/fields to use UTF8. |
@luqun Besides char, char[], CharSet also affects default String/StringBuilder marshaling behavior. |
@jeffschwMSFT what's the current status of this approved API? do you need anything from the corefx team to continue making progress here? /cc @layomia and @GrabYourPitchforks Thanks |
Any progress on this? The current behavior is confusing. |
This is not currently on our .NET Core 3.0 list of final items. We can take a look and see if this would align with 3.1. |
After 5 long years, why has this feature still not been implemented? Is there some technical or policy-related issue that is holding this up or is the can just being kicked down the road as a result of other priorities? It appears that most of the necessary discussions concerning technical matters ended in 2017 yet its implementation was skipped over in .NET Core 3.0, then skipped over in .NET Core 3.1, it was then added as a "5.0 milestone", then several months later removed as a "5.0 milestone", and finally was added as a "Future milestone" (which I have to assume just means "something we are going to ignore and don't want to worry about any time in the foreseeable future"). Not having this functionality implemented unnecessarily complicates both cross-platform development and working with native libraries which use/expect UTF8 strings. UTF8 is the most common character encoding method in use and has been for a long time... Will this ever be implemented? |
@lahma0 due to a number of implementation difficulties, we have generally been exploring other improvements. We have been working on a source-generated interop solution (see #4257 (comment) for a little more information) that should enable solutions similar to this issue without the far reaching changes in the runtime itself that the tentative implementation of this API had. In the meantime, you can use |
@jkoritzinsky I appreciate your very quick response and I apologize if my comment came across as a little bitter. While implementing these type of things might not look like that large of a hurdle to outsiders (such as myself), I know they tend to be much more nuanced/difficult and can create a cascade of other problems. I read about the plans for the source-generated interop stuff quite a while back but I haven't kept up to date on its progress. I will definitely check out the links your provided. Thanks again. |
@dotnet/interop-contrib, have we decided not to do this? If so, we should close it. |
@stephentoub The plan was to wait till we get the source generator for DllImport productized and then close these issues. However, I suppose we could just point to the productization issue now. |
Closing this request in lieu of first-class support we are adding to the DllImport source generator. See #60595 |
The CharSet enumeration is used to specify how strings should be marshaled:
https://github.com/dotnet/corefx/blob/master/src/System.Runtime/ref/System.Runtime.cs#L3019-L3023
Unicode
specifies that UTF16 should be used, regardless of platform, butAnsi
is interpreted differently based on platform: on Windows it's interpreted to mean the ANSI format, whereas on Unix it's interpreted to mean UTF8. This means that on Windows we lack the ability to specify UTF8 as the marshaling, and more generally we lack the ability to specify UTF8 marshaling regardless of platform, making writing cross-platform managed components more difficult.We should add a new UTF8 enum value:
that when used will cause the runtime's marshaling to be done with UTF8, which is the standard for modern services.
[Added-by-Yi]
We should also add a new corresponding UnmanagedType enum in UnmanagedType for UTF8 as well, for finer control and parity (between UnmanagedType and CharSet):
https://github.com/dotnet/corefx/blob/master/src/System.Runtime.InteropServices.PInvoke/ref/System.Runtime.InteropServices.PInvoke.cs
And new Marshal helpers while we are at it:
The text was updated successfully, but these errors were encountered: