Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #121 #124

Merged
merged 1 commit into from
Apr 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion sources/OpenMcdf.Extensions/OLEProperties/DictionaryEntry.cs
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace OpenMcdf.Extensions.OLEProperties
Expand Down Expand Up @@ -55,7 +56,13 @@ public void Write(BinaryWriter bw)

private string GetName()
{
return Encoding.GetEncoding(this.codePage).GetString(nameBytes);

var result = Encoding.GetEncoding(this.codePage).GetString(nameBytes);

result = result.Trim('\0');

return result;

}


Expand Down
24 changes: 22 additions & 2 deletions sources/OpenMcdf.Extensions/OLEProperties/OLEProperty.cs
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,30 @@ public VTPropertyType VTType
internal set;
}

object value;

public object Value
{
get;
set;
get
{
switch (VTType)
{
case VTPropertyType.VT_LPSTR:
case VTPropertyType.VT_LPWSTR:
if (value is string str && !String.IsNullOrEmpty(str))
return str.Trim('\0');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be using Trim or TrimEnd?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my interpretation of specs, it sounds better as a fulltrim. Only embedded nulls are preserved in the presentation. Any thought on this? Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for CodePageString says

the manner in which strings with embedded or additional trailing null characters are presented by the implementation to an application is implementation-specific

Which sounds like you can take whichever approach you want, and I do like the idea of being able to see as much of the data as possible.

However, for comparison, it does look to me like if I use OpenMcdf to write a user property with leading and trailing nulls then niether Windows Explorer nor Word will display the rest of the data:
image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anyone worried about performance for the check and trim on every call?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my interpretation of specs, it sounds better as a fulltrim. Only embedded nulls are preserved in the presentation. Any thought on this? Thanks

https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/9660cb24-953a-4e60-adf2-37cc0e779d19

The string represented by this field SHOULD NOT contain embedded or additional trailing null characters.

So, I think you're safe with TrimEnd

break;
default:
return this.value;

}

return this.value;
}
set
{
this.value = value;
}
}

public override bool Equals(object obj)
Expand Down
116 changes: 92 additions & 24 deletions sources/OpenMcdf.Extensions/OLEProperties/PropertyFactory.cs
Original file line number Diff line number Diff line change
Expand Up @@ -412,26 +412,86 @@ public override string ReadScalarValue(System.IO.BinaryReader br)
uint size = br.ReadUInt32();
data = br.ReadBytes((int)size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it actually is possible for size to be 0, then is there any value in just returning an empty string in that case, and skipping the GetEncoding/GetString calls?


return Encoding.GetEncoding(codePage).GetString(data);
var result = Encoding.GetEncoding(codePage).GetString(data);
//result = result.Trim(new char[] { '\0' });

//if (this.codePage == CodePages.CP_WINUNICODE)
//{
// result = result.Substring(0, result.Length - 2);
//}
//else
//{
// result = result.Substring(0, result.Length - 1);
//}

return result;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had taken a stab at this a few weeks ago, and ended up with this:

/// VT_LPSTR
public override string ReadScalarValue(BinaryReader br)
{
	uint size = br.ReadUInt32();
	data = br.ReadBytes((int)size);
	while (size > 0 && data[size - 1] == 0)
	{
		--size;
	}
	return Encoding.GetEncoding(codePage).GetString(data, 0, (int)size);
}

I thought this approach would save some allocations, rather than relying on OLEProperty.Value to Trim every time it's called.

}

public override void WriteScalarValue(BinaryWriter bw, string pValue)
{
data = Encoding.GetEncoding(codePage).GetBytes(pValue);
uint dataLength = (uint)data.Length;
//bool addNullTerminator = true;

if (String.IsNullOrEmpty(pValue)) //|| String.IsNullOrEmpty(pValue.Trim(new char[] { '\0' })))
{
bw.Write((uint)0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the zero length string case still need to be null terminated? (the docs at https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-oshared/fac324c9-ff39-442e-bd18-1a91a723a818 sound unclear to me when it says "SHOULD specify the number of characters in the value field including the terminating NULL character"

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No: if i'm reading the correctly, current ms-oleps states that if string is zero length, characters field shouldn't be present (2.5) so the zero length string IS a valid case imho

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, after reading the docs for CodePageString rather than Lpstr I was going to agree with you, but then I tried writing a .doc file with a zero length user defined property string using this branch, and when I tried looking at the properties with Windows Explorer, the whole of Explorer crashed :-( (Though the file displays correctly in Word)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's different verbiage here: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/a4c32611-5b79-4965-8f50-50639c138e16

That seems to say that if the length is zero, you write nothing.

Characters (variable): If Size is zero, this field MUST be zero bytes in length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior potentially depends on if you're treating the strings as CodePageString as per https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-oleps/a4c32611-5b79-4965-8f50-50639c138e16 or Lpstr as per https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-oshared/fac324c9-ff39-442e-bd18-1a91a723a818 - Microsoft don't make some of this stuff easy :-(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, some of these specs are really confusing to interpret.

}
else if (this.codePage == CodePages.CP_WINUNICODE)
{

// The string data must be null terminated, so add a null byte if there isn't one already
bool addNullTerminator =
dataLength == 0 || data[dataLength - 1] != '\0';
data = Encoding.GetEncoding(codePage).GetBytes(pValue);

//if (data.Length >= 2 && data[data.Length - 2] == '\0' && data[data.Length - 1] == '\0')
// addNullTerminator = false;

uint dataLength = (uint)data.Length;

//if (addNullTerminator)
dataLength += 2; // null terminator \u+0000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSOLEPS says this should be the length in characters (not bytes) including the null terminator but not including the padding (if any)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly the write should be bw.Write((uint)dataLength / 2); as on line 533? (or possibly it could share the writing code rather than duplicating it?)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSOLEPS says this should be the length in characters (not bytes) including the null terminator but not including the padding (if any)

No, MS-OLEPS for CodeString says this:

Size (4 bytes): The size in bytes of the Characters field, including the null terminator, but not 
including padding (if any). If the property set's CodePage property has the value CP_WINUNICODE 
(0x04B0), then the value MUST be a multiple of 2.

The semantic of Size field is different between CodePage String with CP_WINUNICODE and Unicode String


// var mod = dataLength % 4; // pad to multiple of 4 bytes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@Numpsy Numpsy Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another set of padding code up where this is called from (

WriteScalarValue(bw, (T)this.propertyValue);
) - so I think the code up their should handle both Lpstr and Lpwstr without needing to have multiple sets of padding code here.


bw.Write(dataLength); // datalength of string + null char (unicode)
bw.Write(data); // string


//if (addNullTerminator)
//{
bw.Write('\0'); // first byte of null unicode char
bw.Write('\0'); // second byte of null unicode char
//}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for writing the null terminator + padding, what about this for non-unicode LPSTR ...

byte[] zeroes = [0, 0, 0, 0];
data = Encoding.GetEncoding(codePage).GetBytes(pValue);
uint dataLength = (uint)data.Length + 1; // Add null terminator to length
bw.Write(dataLength);
bw.Write(data);

dataLength = ((4 - (dataLength % 4)) % 4) + 1; // determine padding plus null terminator
bw.Write(zeroes, 0, (int)dataLength);

This way you're not issuing individual writes for each null padding, and it includes the null terminator.

For unicode LPWSTR just needs minimal modification:

byte[] zeroes = [0, 0, 0, 0];
data = Encoding.Unicode.GetBytes(pValue);
uint dataLength = (uint)data.Length + 2; // Add 2-byte null terminator
bw.Write(dataLength / 2);
bw.Write(data);
dataLength = ((4 - (dataLength % 4)) % 4) + 2; // Determine padding plus 2-byte null terminator
bw.Write(zeroes, 0, (int)dataLength);

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padding is generalized in Parent method TypedPropertyValue.Write for manteinabilty.


//for (int i = 0; i < (4 - mod); i++) // padding
// bw.Write('\0');
}
else
{
data = Encoding.GetEncoding(codePage).GetBytes(pValue);

//if (data.Length >= 1 && data[data.Length - 1] == '\0')
// addNullTerminator = false;

uint dataLength = (uint)data.Length;

//if (addNullTerminator)
dataLength += 1; // null terminator \u+0000

var mod = dataLength % 4; // pad to multiple of 4 bytes

bw.Write(dataLength); // datalength of string + null char (unicode)
bw.Write(data); // string


//if (addNullTerminator)
//{
bw.Write('\0'); // null terminator'\0'
//}

//for (int i = 0; i < (4 - mod); i++) // padding
// bw.Write('\0');
}

if (addNullTerminator)
dataLength += 1;

bw.Write(dataLength);
bw.Write(data);

if (addNullTerminator)
bw.Write((byte)0);
}
}

Expand All @@ -449,8 +509,11 @@ public VT_LPWSTR_Property(VTPropertyType vType, int codePage, bool isVariant) :
public override string ReadScalarValue(System.IO.BinaryReader br)
{
uint nChars = br.ReadUInt32();
data = br.ReadBytes((int)(nChars * 2)); //WChar
return Encoding.Unicode.GetString(data);
data = br.ReadBytes((int)((nChars * 2) - 2)); //WChar- null terminator
var result = Encoding.Unicode.GetString(data);
//result = result.Trim(new char[] { '\0' });

return result;
}

public override void WriteScalarValue(BinaryWriter bw, string pValue)
Expand All @@ -460,20 +523,25 @@ public override void WriteScalarValue(BinaryWriter bw, string pValue)
// The written data length field is the number of characters (not bytes) and must include a null terminator
// add a null terminator if there isn't one already
var byteLength = data.Length;
bool addNullTerminator =
byteLength == 0 || data[byteLength - 1] != '\0' || data[byteLength - 2] != '\0';

if (addNullTerminator)
byteLength += 2;
//bool addNullTerminator =
// byteLength == 0 || data[byteLength - 1] != '\0' || data[byteLength - 2] != '\0';

//if (addNullTerminator)
byteLength += 2;

bw.Write((uint)byteLength / 2);
bw.Write(data);

if (addNullTerminator)
{
bw.Write((byte)0);
bw.Write((byte)0);
}
//if (addNullTerminator)
//{
bw.Write((byte)0);
bw.Write((byte)0);
//}

//var mod = byteLength % 4; // pad to multiple of 4 bytes
//for (int i = 0; i < (4 - mod); i++) // padding
// bw.Write('\0');
}
}

Expand Down Expand Up @@ -645,7 +713,7 @@ public override void WriteScalarValue(BinaryWriter bw, object pValue)
}
}

#endregion
#endregion

}
}
2 changes: 1 addition & 1 deletion sources/OpenMcdf.Extensions/Properties/AssemblyInfo.cs
Original file line number Diff line number Diff line change
Expand Up @@ -38,4 +38,4 @@
// http://blog.paranoidcoding.com/2016/04/05/deterministic-builds-in-roslyn.html
// and Compilers should be deterministic: same inputs generate same outputs
// https://stackoverflow.com/questions/43019832/auto-versioning-in-visual-studio-2017-net-core/46985624#46985624
[assembly: AssemblyVersion("2.3.1.0")]
[assembly: AssemblyVersion("2.3.2.0")]
Original file line number Diff line number Diff line change
Expand Up @@ -143,14 +143,15 @@ public void Test_DOCUMENT_SUMMARY_INFO_MODIFY()

// The company property should exist but be empty
var companyProperty = co.Properties.First(prop => prop.PropertyName == "PIDDSI_COMPANY");
Assert.AreEqual("\0\0\0\0", companyProperty.Value);
Assert.AreEqual("", companyProperty.Value);

// As a sanity check, check that the value of a property that we don't change remains the same
var formatProperty = co.Properties.First(prop => prop.PropertyName == "PIDDSI_PRESFORMAT");
Assert.AreEqual("A4 Paper (210x297 mm)\0\0\0", formatProperty.Value);
Assert.AreEqual("A4 Paper (210x297 mm)", formatProperty.Value);

// The manager property shouldn't exist, and we'll add it
Assert.IsFalse(co.Properties.Any(prop => prop.PropertyName == "PIDDSI_MANAGER"));

var managerProp = co.NewProperty(VTPropertyType.VT_LPSTR, 0x0000000E, "PIDDSI_MANAGER");
co.AddProperty(managerProp);

Expand All @@ -159,20 +160,21 @@ public void Test_DOCUMENT_SUMMARY_INFO_MODIFY()

co.Save(dsiStream);
cf.SaveAs(@"test_modify_summary.ppt");
cf.Close() ;
}

using (CompoundFile cf = new CompoundFile("test_modify_summary.ppt"))
{
var co = cf.RootStorage.GetStream("\u0005DocumentSummaryInformation").AsOLEPropertiesContainer();

var companyProperty = co.Properties.First(prop => prop.PropertyName == "PIDDSI_COMPANY");
Assert.AreEqual("My Company\0", companyProperty.Value);
Assert.AreEqual("My Company", companyProperty.Value);

var formatProperty = co.Properties.First(prop => prop.PropertyName == "PIDDSI_PRESFORMAT");
Assert.AreEqual("A4 Paper (210x297 mm)\0\0\0", formatProperty.Value);
Assert.AreEqual("A4 Paper (210x297 mm)", formatProperty.Value);

var managerProperty = co.Properties.First(prop => prop.PropertyName == "PIDDSI_MANAGER");
Assert.AreEqual("The Boss\0", managerProperty.Value);
Assert.AreEqual("The Boss", managerProperty.Value);
}
}

Expand Down Expand Up @@ -306,11 +308,11 @@ public void Test_SUMMARY_INFO_MODIFY_LPWSTRING()

var authorProperty = co.Properties.First(prop => prop.PropertyName == "PIDSI_AUTHOR");
Assert.AreEqual(VTPropertyType.VT_LPWSTR, authorProperty.VTType);
Assert.AreEqual("zkyiqpqoroxnbdwhnjfqroxlgylpbgcwuhjfifpkvycugvuecoputqgknnbs\0", authorProperty.Value);
Assert.AreEqual("zkyiqpqoroxnbdwhnjfqroxlgylpbgcwuhjfifpkvycugvuecoputqgknnbs", authorProperty.Value);

var keyWordsProperty = co.Properties.First(prop => prop.PropertyName == "PIDSI_KEYWORDS");
Assert.AreEqual(VTPropertyType.VT_LPWSTR, keyWordsProperty.VTType);
Assert.AreEqual("abcdefghijk\0", keyWordsProperty.Value);
Assert.AreEqual("abcdefghijk", keyWordsProperty.Value);

authorProperty.Value = "ABC";
keyWordsProperty.Value = "";
Expand All @@ -325,11 +327,11 @@ public void Test_SUMMARY_INFO_MODIFY_LPWSTRING()

var authorProperty = co.Properties.First(prop => prop.PropertyName == "PIDSI_AUTHOR");
Assert.AreEqual(VTPropertyType.VT_LPWSTR, authorProperty.VTType);
Assert.AreEqual("ABC\0", authorProperty.Value);
Assert.AreEqual("ABC", authorProperty.Value);

var keyWordsProperty = co.Properties.First(prop => prop.PropertyName == "PIDSI_KEYWORDS");
Assert.AreEqual(VTPropertyType.VT_LPWSTR, keyWordsProperty.VTType);
Assert.AreEqual("\0", keyWordsProperty.Value);
Assert.AreEqual("", keyWordsProperty.Value);
}
}

Expand Down Expand Up @@ -359,16 +361,16 @@ public void Test_Read_Unicode_User_Properties_Dictionary()
Assert.AreEqual((short)1200, propArray[0].Value);

// String properties
Assert.AreEqual("A\0", propArray[1].PropertyName);
Assert.AreEqual("\0", propArray[1].Value);
Assert.AreEqual("AB\0", propArray[2].PropertyName);
Assert.AreEqual("X\0", propArray[2].Value);
Assert.AreEqual("ABC\0", propArray[3].PropertyName);
Assert.AreEqual("XY\0", propArray[3].Value);
Assert.AreEqual("ABCD\0", propArray[4].PropertyName);
Assert.AreEqual("XYZ\0", propArray[4].Value);
Assert.AreEqual("ABCDE\0", propArray[5].PropertyName);
Assert.AreEqual("XYZ!\0", propArray[5].Value);
Assert.AreEqual("A", propArray[1].PropertyName);
Assert.AreEqual("", propArray[1].Value);
Assert.AreEqual("AB", propArray[2].PropertyName);
Assert.AreEqual("X", propArray[2].Value);
Assert.AreEqual("ABC", propArray[3].PropertyName);
Assert.AreEqual("XY", propArray[3].Value);
Assert.AreEqual("ABCD", propArray[4].PropertyName);
Assert.AreEqual("XYZ", propArray[4].Value);
Assert.AreEqual("ABCDE", propArray[5].PropertyName);
Assert.AreEqual("XYZ!", propArray[5].Value);
}
}

Expand Down Expand Up @@ -426,16 +428,16 @@ public void Test_DOCUMENT_SUMMARY_INFO_ADD_CUSTOM()
Assert.AreEqual((short)-535, propArray[0].Value);

// User properties
Assert.AreEqual("StringProperty\0", propArray[1].PropertyName);
Assert.AreEqual("Hello\0", propArray[1].Value);
Assert.AreEqual("StringProperty", propArray[1].PropertyName);
Assert.AreEqual("Hello", propArray[1].Value);
Assert.AreEqual(VTPropertyType.VT_LPSTR, propArray[1].VTType);
Assert.AreEqual("BooleanProperty\0", propArray[2].PropertyName);
Assert.AreEqual("BooleanProperty", propArray[2].PropertyName);
Assert.AreEqual(true, propArray[2].Value);
Assert.AreEqual(VTPropertyType.VT_BOOL, propArray[2].VTType);
Assert.AreEqual("IntegerProperty\0", propArray[3].PropertyName);
Assert.AreEqual("IntegerProperty", propArray[3].PropertyName);
Assert.AreEqual(3456, propArray[3].Value);
Assert.AreEqual(VTPropertyType.VT_I4, propArray[3].VTType);
Assert.AreEqual("DateProperty\0", propArray[4].PropertyName);
Assert.AreEqual("DateProperty", propArray[4].PropertyName);
Assert.AreEqual(testNow, propArray[4].Value);
Assert.AreEqual(VTPropertyType.VT_FILETIME, propArray[4].VTType);
}
Expand Down