Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #121 #124

Merged
merged 1 commit into from
Apr 17, 2024
Merged

Fix #121 #124

merged 1 commit into from
Apr 17, 2024

Conversation

ironfede
Copy link
Owner

@ironfede ironfede commented Mar 7, 2024

This pr should fix #121 .
I've partially followed specifications [MS-OLEPS]
I've chosen not to expose any trailing null char BUT embedded nulls will be left untouched.

As stated in note 1 to section 2.5

Windows presents properties with PropertySet VT_LPSTR (0x001E) to applications
as null-terminated string values such that the application cannot reliably detect the presence of
trailing null characters or any characters following the first embedded null character.

client application cannot detect trailing nulls after string terminator.

Waiting for some thoughts or comment before merging.

Many thanks,
Federico

case VTPropertyType.VT_LPSTR:
case VTPropertyType.VT_LPWSTR:
if (value is string str && !String.IsNullOrEmpty(str))
return str.Trim('\0');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be using Trim or TrimEnd?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my interpretation of specs, it sounds better as a fulltrim. Only embedded nulls are preserved in the presentation. Any thought on this? Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for CodePageString says

the manner in which strings with embedded or additional trailing null characters are presented by the implementation to an application is implementation-specific

Which sounds like you can take whichever approach you want, and I do like the idea of being able to see as much of the data as possible.

However, for comparison, it does look to me like if I use OpenMcdf to write a user property with leading and trailing nulls then niether Windows Explorer nor Word will display the rest of the data:
image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anyone worried about performance for the check and trim on every call?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my interpretation of specs, it sounds better as a fulltrim. Only embedded nulls are preserved in the presentation. Any thought on this? Thanks

https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/9660cb24-953a-4e60-adf2-37cc0e779d19

The string represented by this field SHOULD NOT contain embedded or additional trailing null characters.

So, I think you're safe with TrimEnd


if (String.IsNullOrEmpty(pValue)) //|| String.IsNullOrEmpty(pValue.Trim(new char[] { '\0' })))
{
bw.Write((uint)0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the zero length string case still need to be null terminated? (the docs at https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-oshared/fac324c9-ff39-442e-bd18-1a91a723a818 sound unclear to me when it says "SHOULD specify the number of characters in the value field including the terminating NULL character"

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No: if i'm reading the correctly, current ms-oleps states that if string is zero length, characters field shouldn't be present (2.5) so the zero length string IS a valid case imho

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, after reading the docs for CodePageString rather than Lpstr I was going to agree with you, but then I tried writing a .doc file with a zero length user defined property string using this branch, and when I tried looking at the properties with Windows Explorer, the whole of Explorer crashed :-( (Though the file displays correctly in Word)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's different verbiage here: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/a4c32611-5b79-4965-8f50-50639c138e16

That seems to say that if the length is zero, you write nothing.

Characters (variable): If Size is zero, this field MUST be zero bytes in length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior potentially depends on if you're treating the strings as CodePageString as per https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/a4c32611-5b79-4965-8f50-50639c138e16 or Lpstr as per https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-oshared/fac324c9-ff39-442e-bd18-1a91a723a818 - Microsoft don't make some of this stuff easy :-(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, some of these specs are really confusing to interpret.

@@ -412,26 +412,86 @@ public override string ReadScalarValue(System.IO.BinaryReader br)
uint size = br.ReadUInt32();
data = br.ReadBytes((int)size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it actually is possible for size to be 0, then is there any value in just returning an empty string in that case, and skipping the GetEncoding/GetString calls?

// result = result.Substring(0, result.Length - 1);
//}

return result;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had taken a stab at this a few weeks ago, and ended up with this:

/// VT_LPSTR
public override string ReadScalarValue(BinaryReader br)
{
	uint size = br.ReadUInt32();
	data = br.ReadBytes((int)size);
	while (size > 0 && data[size - 1] == 0)
	{
		--size;
	}
	return Encoding.GetEncoding(codePage).GetString(data, 0, (int)size);
}

I thought this approach would save some allocations, rather than relying on OLEProperty.Value to Trim every time it's called.

//if (addNullTerminator)
dataLength += 2; // null terminator \u+0000

// var mod = dataLength % 4; // pad to multiple of 4 bytes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@Numpsy Numpsy Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another set of padding code up where this is called from (

WriteScalarValue(bw, (T)this.propertyValue);
) - so I think the code up their should handle both Lpstr and Lpwstr without needing to have multiple sets of padding code here.

uint dataLength = (uint)data.Length;

//if (addNullTerminator)
dataLength += 2; // null terminator \u+0000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSOLEPS says this should be the length in characters (not bytes) including the null terminator but not including the padding (if any)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly the write should be bw.Write((uint)dataLength / 2); as on line 533? (or possibly it could share the writing code rather than duplicating it?)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSOLEPS says this should be the length in characters (not bytes) including the null terminator but not including the padding (if any)

No, MS-OLEPS for CodeString says this:

Size (4 bytes): The size in bytes of the Characters field, including the null terminator, but not 
including padding (if any). If the property set's CodePage property has the value CP_WINUNICODE 
(0x04B0), then the value MUST be a multiple of 2.

The semantic of Size field is different between CodePage String with CP_WINUNICODE and Unicode String

//{
bw.Write('\0'); // first byte of null unicode char
bw.Write('\0'); // second byte of null unicode char
//}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for writing the null terminator + padding, what about this for non-unicode LPSTR ...

byte[] zeroes = [0, 0, 0, 0];
data = Encoding.GetEncoding(codePage).GetBytes(pValue);
uint dataLength = (uint)data.Length + 1; // Add null terminator to length
bw.Write(dataLength);
bw.Write(data);

dataLength = ((4 - (dataLength % 4)) % 4) + 1; // determine padding plus null terminator
bw.Write(zeroes, 0, (int)dataLength);

This way you're not issuing individual writes for each null padding, and it includes the null terminator.

For unicode LPWSTR just needs minimal modification:

byte[] zeroes = [0, 0, 0, 0];
data = Encoding.Unicode.GetBytes(pValue);
uint dataLength = (uint)data.Length + 2; // Add 2-byte null terminator
bw.Write(dataLength / 2);
bw.Write(data);
dataLength = ((4 - (dataLength % 4)) % 4) + 2; // Determine padding plus 2-byte null terminator
bw.Write(zeroes, 0, (int)dataLength);

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padding is generalized in Parent method TypedPropertyValue.Write for manteinabilty.

@Numpsy
Copy link
Contributor

Numpsy commented Mar 17, 2024

As I mentioned previously, I was able to use this branch to create a file which made Windows Explorer on my laptop crash when looking at the custom file properties (although Word seems to open the file without problems) - If I run this test

        // Modify some document summary information properties, save to a file, and then validate the expected results
        [TestMethod]
        public void Test_Empty_User_Property()
        {
            using (CompoundFile cf = new CompoundFile("2custom.doc"))
            {
                var dsiStream = cf.RootStorage.GetStream("\u0005DocumentSummaryInformation");
                var co = dsiStream.AsOLEPropertiesContainer();
                var userProperties = co.UserDefinedProperties;

                userProperties.Properties.First(prop => prop.PropertyName == "prop1").Value = "";

                co.Save(dsiStream);
                cf.SaveAs("zero_length_property.doc");
                cf.Close();
            }
        }

Does anyone else see that when looking at the custom properties sheet on the output file?
e.g.
image
where 'prop1' should be empty now?

@ironfede ironfede merged commit 591eaea into master Apr 17, 2024
2 checks passed
@jeremy-visionaid jeremy-visionaid deleted the pr/fix-121-trailing-chars branch November 16, 2024 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Null padding of strings is an implementation detail, should not be exposed to the consumer
3 participants