Fix #121 #124

ironfede · 2024-03-07T22:00:23Z

This pr should fix #121 .
I've partially followed specifications [MS-OLEPS]
I've chosen not to expose any trailing null char BUT embedded nulls will be left untouched.

As stated in note 1 to section 2.5

Windows presents properties with PropertySet VT_LPSTR (0x001E) to applications
as null-terminated string values such that the application cannot reliably detect the presence of
trailing null characters or any characters following the first embedded null character.

client application cannot detect trailing nulls after string terminator.

Waiting for some thoughts or comment before merging.

Many thanks,
Federico

Numpsy · 2024-03-07T23:42:04Z

sources/OpenMcdf.Extensions/OLEProperties/OLEProperty.cs

+                    case VTPropertyType.VT_LPSTR:
+                    case VTPropertyType.VT_LPWSTR:
+                        if (value is string str && !String.IsNullOrEmpty(str))
+                            return str.Trim('\0');


Should it be using Trim or TrimEnd?

In my interpretation of specs, it sounds better as a fulltrim. Only embedded nulls are preserved in the presentation. Any thought on this? Thanks

The documentation for CodePageString says

the manner in which strings with embedded or additional trailing null characters are presented by the implementation to an application is implementation-specific

Which sounds like you can take whichever approach you want, and I do like the idea of being able to see as much of the data as possible.

However, for comparison, it does look to me like if I use OpenMcdf to write a user property with leading and trailing nulls then niether Windows Explorer nor Word will display the rest of the data:

Is anyone worried about performance for the check and trim on every call?

In my interpretation of specs, it sounds better as a fulltrim. Only embedded nulls are preserved in the presentation. Any thought on this? Thanks

https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/9660cb24-953a-4e60-adf2-37cc0e779d19

The string represented by this field SHOULD NOT contain embedded or additional trailing null characters.

So, I think you're safe with TrimEnd

Numpsy · 2024-03-07T23:46:16Z

sources/OpenMcdf.Extensions/OLEProperties/PropertyFactory.cs

+
+                if (String.IsNullOrEmpty(pValue)) //|| String.IsNullOrEmpty(pValue.Trim(new char[] { '\0' })))
+                {
+                    bw.Write((uint)0);


Does the zero length string case still need to be null terminated? (the docs at https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-oshared/fac324c9-ff39-442e-bd18-1a91a723a818 sound unclear to me when it says "SHOULD specify the number of characters in the value field including the terminating NULL character"

No: if i'm reading the correctly, current ms-oleps states that if string is zero length, characters field shouldn't be present (2.5) so the zero length string IS a valid case imho

Hmm, after reading the docs for CodePageString rather than Lpstr I was going to agree with you, but then I tried writing a .doc file with a zero length user defined property string using this branch, and when I tried looking at the properties with Windows Explorer, the whole of Explorer crashed :-( (Though the file displays correctly in Word)

There's different verbiage here: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/a4c32611-5b79-4965-8f50-50639c138e16

That seems to say that if the length is zero, you write nothing.

Characters (variable): If Size is zero, this field MUST be zero bytes in length.

The behavior potentially depends on if you're treating the strings as CodePageString as per https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/a4c32611-5b79-4965-8f50-50639c138e16 or Lpstr as per https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-oshared/fac324c9-ff39-442e-bd18-1a91a723a818 - Microsoft don't make some of this stuff easy :-(

No, some of these specs are really confusing to interpret.

Numpsy · 2024-03-08T11:18:44Z

sources/OpenMcdf.Extensions/OLEProperties/PropertyFactory.cs

@@ -412,26 +412,86 @@ public override string ReadScalarValue(System.IO.BinaryReader br)
                uint size = br.ReadUInt32();
                data = br.ReadBytes((int)size);


If it actually is possible for size to be 0, then is there any value in just returning an empty string in that case, and skipping the GetEncoding/GetString calls?

rmsimpson · 2024-03-08T18:28:52Z

sources/OpenMcdf.Extensions/OLEProperties/PropertyFactory.cs

+                //    result = result.Substring(0, result.Length - 1);
+                //}
+
+                return result;


I had taken a stab at this a few weeks ago, and ended up with this:

/// VT_LPSTR public override string ReadScalarValue(BinaryReader br) { uint size = br.ReadUInt32(); data = br.ReadBytes((int)size); while (size > 0 && data[size - 1] == 0) { --size; } return Encoding.GetEncoding(codePage).GetString(data, 0, (int)size); }

I thought this approach would save some allocations, rather than relying on OLEProperty.Value to Trim every time it's called.

rmsimpson · 2024-03-08T18:42:13Z

sources/OpenMcdf.Extensions/OLEProperties/PropertyFactory.cs

+                    //if (addNullTerminator)
+                    dataLength += 2;            // null terminator \u+0000
+
+                   // var mod = dataLength % 4;       // pad to multiple of 4 bytes


Don't unicode strings also have to be padded to 4 bytes?

https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/9660cb24-953a-4e60-adf2-37cc0e779d19

There's another set of padding code up where this is called from (

openmcdf/sources/OpenMcdf.Extensions/OLEProperties/TypedPropertyValue.cs

Line 132 in c58985f

WriteScalarValue(bw, (T)this.propertyValue);

) - so I think the code up their should handle both Lpstr and Lpwstr without needing to have multiple sets of padding code here.

rmsimpson · 2024-03-08T18:45:42Z

sources/OpenMcdf.Extensions/OLEProperties/PropertyFactory.cs

+                    uint dataLength = (uint)data.Length;
+
+                    //if (addNullTerminator)
+                    dataLength += 2;            // null terminator \u+0000


MSOLEPS says this should be the length in characters (not bytes) including the null terminator but not including the padding (if any)

Possibly the write should be bw.Write((uint)dataLength / 2); as on line 533? (or possibly it could share the writing code rather than duplicating it?)

MSOLEPS says this should be the length in characters (not bytes) including the null terminator but not including the padding (if any)

No, MS-OLEPS for CodeString says this:

Size (4 bytes): The size in bytes of the Characters field, including the null terminator, but not including padding (if any). If the property set's CodePage property has the value CP_WINUNICODE (0x04B0), then the value MUST be a multiple of 2.

The semantic of Size field is different between CodePage String with CP_WINUNICODE and Unicode String

rmsimpson · 2024-03-08T19:47:36Z

sources/OpenMcdf.Extensions/OLEProperties/PropertyFactory.cs

+                    //{
+                    bw.Write('\0');                 // first byte of null unicode char
+                    bw.Write('\0');                 // second byte of null unicode char
+                    //}


So for writing the null terminator + padding, what about this for non-unicode LPSTR ...

byte[] zeroes = [0, 0, 0, 0]; data = Encoding.GetEncoding(codePage).GetBytes(pValue); uint dataLength = (uint)data.Length + 1; // Add null terminator to length bw.Write(dataLength); bw.Write(data); dataLength = ((4 - (dataLength % 4)) % 4) + 1; // determine padding plus null terminator bw.Write(zeroes, 0, (int)dataLength);

This way you're not issuing individual writes for each null padding, and it includes the null terminator.

For unicode LPWSTR just needs minimal modification:

byte[] zeroes = [0, 0, 0, 0]; data = Encoding.Unicode.GetBytes(pValue); uint dataLength = (uint)data.Length + 2; // Add 2-byte null terminator bw.Write(dataLength / 2); bw.Write(data); dataLength = ((4 - (dataLength % 4)) % 4) + 2; // Determine padding plus 2-byte null terminator bw.Write(zeroes, 0, (int)dataLength);

Padding is generalized in Parent method TypedPropertyValue.Write for manteinabilty.

Numpsy · 2024-03-17T20:43:41Z

As I mentioned previously, I was able to use this branch to create a file which made Windows Explorer on my laptop crash when looking at the custom file properties (although Word seems to open the file without problems) - If I run this test

        // Modify some document summary information properties, save to a file, and then validate the expected results
        [TestMethod]
        public void Test_Empty_User_Property()
        {
            using (CompoundFile cf = new CompoundFile("2custom.doc"))
            {
                var dsiStream = cf.RootStorage.GetStream("\u0005DocumentSummaryInformation");
                var co = dsiStream.AsOLEPropertiesContainer();
                var userProperties = co.UserDefinedProperties;

                userProperties.Properties.First(prop => prop.PropertyName == "prop1").Value = "";

                co.Save(dsiStream);
                cf.SaveAs("zero_length_property.doc");
                cf.Close();
            }
        }

Does anyone else see that when looking at the custom properties sheet on the output file?
e.g.

where 'prop1' should be empty now?

Fix #121

485e865

Numpsy reviewed Mar 7, 2024

View reviewed changes

Numpsy reviewed Mar 8, 2024

View reviewed changes

rmsimpson reviewed Mar 8, 2024

View reviewed changes

ironfede merged commit 591eaea into master Apr 17, 2024
2 checks passed

jeremy-visionaid deleted the pr/fix-121-trailing-chars branch November 16, 2024 08:24

Numpsy mentioned this pull request Nov 17, 2024

Simplify null terminator handling in DictionaryProperty.Write #221

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #121 #124

Fix #121 #124

ironfede commented Mar 7, 2024

Numpsy Mar 7, 2024

ironfede Mar 8, 2024

Numpsy Mar 8, 2024

rmsimpson Mar 8, 2024

rmsimpson Mar 8, 2024

Numpsy Mar 7, 2024

ironfede Mar 8, 2024

Numpsy Mar 8, 2024

rmsimpson Mar 8, 2024

Numpsy Mar 8, 2024

rmsimpson Mar 8, 2024

Numpsy Mar 8, 2024

rmsimpson Mar 8, 2024

rmsimpson Mar 8, 2024

Numpsy Mar 8, 2024 •

edited

Loading

rmsimpson Mar 8, 2024

Numpsy Mar 8, 2024

ironfede Mar 8, 2024

rmsimpson Mar 8, 2024

ironfede Mar 8, 2024

Numpsy commented Mar 17, 2024

		@@ -412,26 +412,86 @@ public override string ReadScalarValue(System.IO.BinaryReader br)
		uint size = br.ReadUInt32();
		data = br.ReadBytes((int)size);

Fix #121 #124

Fix #121 #124

Conversation

ironfede commented Mar 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Numpsy Mar 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Numpsy commented Mar 17, 2024

Numpsy Mar 8, 2024 •

edited

Loading