Skip to content

Technology and Internals

Tornhoof edited this page Jul 16, 2018 · 8 revisions

Reader/Writer

JsonReader<> and JsonWriter<> are the basic types for reading/writing data, both work on arrays of the underlying symbol type (byte for UTF8 and char for UTF16). Both types are ref structs to be able to have a Span field for the data. Both types are marked partial and the actual implementations of the individual Read/Write methods are in:

  1. https://github.com/Tornhoof/SpanJson/blob/master/SpanJson/JsonWriter.Utf16.cs
  2. https://github.com/Tornhoof/SpanJson/blob/master/SpanJson/JsonWriter.Utf8.cs
  3. https://github.com/Tornhoof/SpanJson/blob/master/SpanJson/JsonReader.Utf16.cs
  4. https://github.com/Tornhoof/SpanJson/blob/master/SpanJson/JsonReader.Utf8.cs

The respective Read/Write implementations are mostly using Utf8Formatter/Utf8Parser and/or the <type>.TryParse/TryFormat API from .NET Core 2.1.

Formatters

BCL Types

Using JsonReader<> and JsonWriter<> directly are the formatters for the individual types. Several built-in types from the base class library (BCL) are generated via T4 template, including versions for Nullable (where applicable), Arrays, Lists and obvious UTF8 and UTF16. Each generated Formatter has a static Default property for easier access.

Lists/Arrays/Enums/Dictionaries/Enumerables/Nullable

Using the BCL Formatters directly are the more specialized formatters, which solve a specific problem, e.g. list serialization/deserialization. The technical concept is identical to the formatters for the bcl types, e.g. static Default property. There is an enum formatter for strings and one for integers.

Dynamic

Dynamic is a special formatter, where the actual casting to the concrete type is delayed until the user needs it, internally SpanJsonDynamicObject et. al. work on either byte arrays or char arrays.

Complex Types

The ComplexFormatter is by far the most complex formatter in SpanJson. Expression Trees are used to generate optimized serialization/deserialization methods.

Serialization

The serialization part is straight forward, it writes the object braces and then for each member (field or property) writes the name (already optimized to include double quotes and the separator ':') and then invokes the appropriate serializer for the member type. If the type-to-be-serialized is not sealed or a struct a runtime check is done against the actual type of the value supplied and the runtime formatter is invoked if the type is e.g. a derived one and not the member type anymore.

Attribute Name writing on UTF8

Instead of referencing an utf8 byte array to write the attribute name, the bytes are instead converted to unsigned integers (ulong, uint, ushort, byte) and written directly, up to a name length of 32. This algorithm is similar to the Name reading algorithm below (3. Comparison against integers). The algorithm not used for UTF16 as the name length would be at most 16 characters and actually even less as ExpressionTree.Constant for strings is apparently quite a bit more efficient than for arrays.

Deserialization

The deserialization part is quite complex as it needs to support both constructor based and normal member assignment based activation of results. It's a loop and each iteration reads the Json's attribute name (the member name) and tries to find the appropriate member in the result type to deserialize the value, see Member Name logic below for details. After matching the right member it calls again the appropriate deserializer for that member type and assigns it to the member (or writes it into a temp variable for constructor assignment). If no member can be found the relevant value part is skipped.

Runtime

Responsible for invoking the appropriate formatter for the concrete type supplied at runtime.

Resolver

The individual resolvers are necessary to allow for different behaviour during serialization/deserialization, e.g. ignoring nulls or changing the case to CamelCase.

Member Name logic

Finding the appropriate member during deserialization is a complex problem. SpanJson went through three different iterations for this:

1. Naive approach

Compare the Json's attribute name with each member name for each iteration, this is extremly slow as the order of the attributes is probably not the same as in the source code, as already assigned member names are recompared for each iteration. This can be O(n²) complexity, with n being the member count.

2. Nested If approach

This approach is similar to the one used in Jil. Optimized if statement ladddersfor the individual characters/bytes of the member names are generated, e.g. if two members start with the same letter they are grouped under the first letter together and then evaluated.

// Assume following members: Hello, World and Wife
if(attributeName[0] = 'H')
{
  if(attributeName[1] = 'e')
  {
     // continue until all characters are matched
  }
}
else if(attributeName[0] = 'W')
{
  if(attributeName[1] = 'i') // Wife
  {
     // continue until all characters are matched
  }
  else if(attributeName[1] = 'o') // World
  {
     // continue until all characters are matched
  }
}

The above method works fairly well and is several times faster than the naive approach and is fairly easy to debug in Expression Trees as it's easy to follow the flow.

Optimization

It's possible to optimize the above by comparing the length of the attribute name first.

if(attributeNameLength = 5 && attributeName[0] = 'H')
{
  if(attributeName[1] = 'e')
  {
     // continue until all characters are matched
  }
}

The comparison against the length before the actual nested ladder gives additional performance.

3. Comparison against integers

A downside of the nested if ladder approach is that each individual character is compared even though, e.g. Hello has no shared characters with any other member name. The most recent versions use a variation of the automata approach in Utf8Json. Instead of grouping by single characters, we now group by the longest common starting substring and only then a nested if ladder is created.

  • Example: Hello_World and Hello_Universe are now grouped by Hello_

Instead of string comparing the grouping, an integer based key is created and compared, this integer key is just the Utf8/Unicode value converted to integer (https://github.com/Tornhoof/SpanJson/blob/master/SpanJson/Helpers/MemberComparisonBuilder.cs), this can be done very quickly via Unsafe.ReadUnaligned at runtime.

// assume the member name is scope and account_id
var name = reader.ReadUtf16NameSpan();
var length = name.Length;
ref var b = ref MemoryMarshal.GetReference(MemoryMarshal.AsBytes(name));
if (length == 5 && ReadUInt64(ref b, 0) == 31525674139451507UL && ReadUInt16(ref b, 8) == 101)
{
    result.scope = StringUtf16ListFormatter<ExcludeNullsOriginalCaseResolver<char>>.Default.Deserialize(ref reader);
    continue;
}
if (length == 10 && ReadUInt64(ref b, 0) == 31244147623133281UL && ReadUInt64(ref b, 8) == 26740621010927733UL &&
    ReadUInt32(ref b, 16) == 6553705U)
{
    result.account_id = NullableInt32Utf16Formatter<ExcludeNullsOriginalCaseResolver<char>>.Default.Deserialize(ref reader);
    continue;
}

The above source code example is from the development branch of this feature where it still was source code generation (offline) and not in-memory expression trees CodeGen and Full Example.

Switching from the nested if approach to the integer comparison approach resulted in ~30-50% increased deserialization performance and the code is basically the same for UTF8 and UTF16.