-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #121 #124
Fix #121 #124
Conversation
case VTPropertyType.VT_LPSTR: | ||
case VTPropertyType.VT_LPWSTR: | ||
if (value is string str && !String.IsNullOrEmpty(str)) | ||
return str.Trim('\0'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be using Trim
or TrimEnd
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my interpretation of specs, it sounds better as a fulltrim. Only embedded nulls are preserved in the presentation. Any thought on this? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for CodePageString
says
the manner in which strings with embedded or additional trailing null characters are presented by the implementation to an application is implementation-specific
Which sounds like you can take whichever approach you want, and I do like the idea of being able to see as much of the data as possible.
However, for comparison, it does look to me like if I use OpenMcdf to write a user property with leading and trailing nulls then niether Windows Explorer nor Word will display the rest of the data:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is anyone worried about performance for the check and trim on every call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my interpretation of specs, it sounds better as a fulltrim. Only embedded nulls are preserved in the presentation. Any thought on this? Thanks
The string represented by this field SHOULD NOT contain embedded or additional trailing null characters.
So, I think you're safe with TrimEnd
|
||
if (String.IsNullOrEmpty(pValue)) //|| String.IsNullOrEmpty(pValue.Trim(new char[] { '\0' }))) | ||
{ | ||
bw.Write((uint)0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the zero length string case still need to be null terminated? (the docs at https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-oshared/fac324c9-ff39-442e-bd18-1a91a723a818 sound unclear to me when it says "SHOULD specify the number of characters in the value field including the terminating NULL character"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No: if i'm reading the correctly, current ms-oleps states that if string is zero length, characters field shouldn't be present (2.5) so the zero length string IS a valid case imho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, after reading the docs for CodePageString rather than Lpstr I was going to agree with you, but then I tried writing a .doc file with a zero length user defined property string using this branch, and when I tried looking at the properties with Windows Explorer, the whole of Explorer crashed :-( (Though the file displays correctly in Word)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's different verbiage here: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/a4c32611-5b79-4965-8f50-50639c138e16
That seems to say that if the length is zero, you write nothing.
Characters (variable): If Size is zero, this field MUST be zero bytes in length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior potentially depends on if you're treating the strings as CodePageString
as per https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/a4c32611-5b79-4965-8f50-50639c138e16 or Lpstr
as per https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-oshared/fac324c9-ff39-442e-bd18-1a91a723a818 - Microsoft don't make some of this stuff easy :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, some of these specs are really confusing to interpret.
@@ -412,26 +412,86 @@ public override string ReadScalarValue(System.IO.BinaryReader br) | |||
uint size = br.ReadUInt32(); | |||
data = br.ReadBytes((int)size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it actually is possible for size
to be 0, then is there any value in just returning an empty string in that case, and skipping the GetEncoding/GetString calls?
// result = result.Substring(0, result.Length - 1); | ||
//} | ||
|
||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had taken a stab at this a few weeks ago, and ended up with this:
/// VT_LPSTR
public override string ReadScalarValue(BinaryReader br)
{
uint size = br.ReadUInt32();
data = br.ReadBytes((int)size);
while (size > 0 && data[size - 1] == 0)
{
--size;
}
return Encoding.GetEncoding(codePage).GetString(data, 0, (int)size);
}
I thought this approach would save some allocations, rather than relying on OLEProperty.Value to Trim every time it's called.
//if (addNullTerminator) | ||
dataLength += 2; // null terminator \u+0000 | ||
|
||
// var mod = dataLength % 4; // pad to multiple of 4 bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't unicode strings also have to be padded to 4 bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another set of padding code up where this is called from (
WriteScalarValue(bw, (T)this.propertyValue); |
uint dataLength = (uint)data.Length; | ||
|
||
//if (addNullTerminator) | ||
dataLength += 2; // null terminator \u+0000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSOLEPS says this should be the length in characters (not bytes) including the null terminator but not including the padding (if any)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly the write should be bw.Write((uint)dataLength / 2);
as on line 533? (or possibly it could share the writing code rather than duplicating it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSOLEPS says this should be the length in characters (not bytes) including the null terminator but not including the padding (if any)
No, MS-OLEPS for CodeString says this:
Size (4 bytes): The size in bytes of the Characters field, including the null terminator, but not
including padding (if any). If the property set's CodePage property has the value CP_WINUNICODE
(0x04B0), then the value MUST be a multiple of 2.
The semantic of Size field is different between CodePage String with CP_WINUNICODE and Unicode String
//{ | ||
bw.Write('\0'); // first byte of null unicode char | ||
bw.Write('\0'); // second byte of null unicode char | ||
//} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for writing the null terminator + padding, what about this for non-unicode LPSTR ...
byte[] zeroes = [0, 0, 0, 0];
data = Encoding.GetEncoding(codePage).GetBytes(pValue);
uint dataLength = (uint)data.Length + 1; // Add null terminator to length
bw.Write(dataLength);
bw.Write(data);
dataLength = ((4 - (dataLength % 4)) % 4) + 1; // determine padding plus null terminator
bw.Write(zeroes, 0, (int)dataLength);
This way you're not issuing individual writes for each null padding, and it includes the null terminator.
For unicode LPWSTR just needs minimal modification:
byte[] zeroes = [0, 0, 0, 0];
data = Encoding.Unicode.GetBytes(pValue);
uint dataLength = (uint)data.Length + 2; // Add 2-byte null terminator
bw.Write(dataLength / 2);
bw.Write(data);
dataLength = ((4 - (dataLength % 4)) % 4) + 2; // Determine padding plus 2-byte null terminator
bw.Write(zeroes, 0, (int)dataLength);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Padding is generalized in Parent method TypedPropertyValue.Write
for manteinabilty.
This pr should fix #121 .
I've partially followed specifications [MS-OLEPS]
I've chosen not to expose any trailing null char BUT embedded nulls will be left untouched.
As stated in note 1 to section 2.5
client application cannot detect trailing nulls after string terminator.
Waiting for some thoughts or comment before merging.
Many thanks,
Federico