Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvalidCastException in mixed-host cluster #5102

Closed
evenbrenden opened this issue Jun 22, 2021 · 45 comments · Fixed by akkadotnet/Hyperion#236
Closed

InvalidCastException in mixed-host cluster #5102

evenbrenden opened this issue Jun 22, 2021 · 45 comments · Fixed by akkadotnet/Hyperion#236
Assignees
Milestone

Comments

@evenbrenden
Copy link

Version Information
Version of Akka.NET?

1.4.21

Which Akka.NET Modules?

Akka.Cluster

Describe the bug

As part of a seamless migration from a Windows VM cluster to a K8s cluster, we are running our Akka cluster on a mix of Windows and Linux hosts. The seed nodes are running on Windows hosts. The/a node running in the K8s cluster is able to establish a connection with the cluster, but crashes on a serialization error:

System.InvalidCastException: Unable to cast object of type 'System.Int32' to type 'System.Int64'.
   at lambda_method132(Closure , Stream , DeserializerSession )
   at Hyperion.ValueSerializers.ObjectSerializer.ReadValue(Stream stream, DeserializerSession session)
   at lambda_method130(Closure , Stream , DeserializerSession )
   at Hyperion.ValueSerializers.ObjectSerializer.ReadValue(Stream stream, DeserializerSession session)
   at Hyperion.Serializer.Deserialize[T](Stream stream)
   at Akka.Serialization.HyperionSerializer.FromBinary(Byte[] bytes, Type type)
   at Akka.Serialization.Serialization.Deserialize(Byte[] bytes, Int32 serializerId, String manifest)
   at Akka.Remote.MessageSerializer.Deserialize(ExtendedActorSystem system, Payload messageProtocol)
   at Akka.Remote.DefaultMessageDispatcher.Dispatch(IInternalActorRef recipient, Address recipientAddress, Payload message, IActorRef senderOption)
   at Akka.Remote.EndpointReader.<Reading>b__11_0(InboundPayload inbound)
   at lambda_method60(Closure , Object , Action`1 , Action`1 , Action`1 )
   at Akka.Actor.ReceiveActor.OnReceive(Object message)
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
--- End of stack trace from previous location ---
   at Akka.Actor.ActorCell.HandleFailed(Failed f)
   at Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList messages, Int32 currentState)

Since the stack trace is clear of any application code, I can't be sure exactly what kind of message this is, but I assume it's a system message (as indicated by the last line in the stack trace). This could of course be a problem with Hyperion, but how do I figure out exactly what is being deserialized?

Environment

Windows Server 2016 Standard (x64)
.NET 5.0.301 Ubuntu 20.04 Docker image

Additional context

Log excerpt:

2021-06-22 07:47:10 ERR AssociationError [akka.tcp://Oddjob@distribution-oddjob-webapi-f446d8c48-nk2ms:1968] -> akka.tcp://Oddjob@maodatest02.felles.ds.nrk.no:1966: Error [Unable to cast obje
ct of type 'System.Int32' to type 'System.Int64'.] [   at lambda_method132(Closure , Stream , DeserializerSession )
   at Hyperion.ValueSerializers.ObjectSerializer.ReadValue(Stream stream, DeserializerSession session)
   at lambda_method130(Closure , Stream , DeserializerSession )
   at Hyperion.ValueSerializers.ObjectSerializer.ReadValue(Stream stream, DeserializerSession session)
   at Hyperion.Serializer.Deserialize[T](Stream stream)
   at Akka.Serialization.HyperionSerializer.FromBinary(Byte[] bytes, Type type)
   at Akka.Serialization.Serialization.Deserialize(Byte[] bytes, Int32 serializerId, String manifest)
   at Akka.Remote.MessageSerializer.Deserialize(ExtendedActorSystem system, Payload messageProtocol)
   at Akka.Remote.DefaultMessageDispatcher.Dispatch(IInternalActorRef recipient, Address recipientAddress, Payload message, IActorRef senderOption)
   at Akka.Remote.EndpointReader.<Reading>b__11_0(InboundPayload inbound)
   at lambda_method60(Closure , Object , Action`1 , Action`1 , Action`1 )
   at Akka.Actor.ReceiveActor.OnReceive(Object message)
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)]
2021-06-22 07:47:10 WRN Association with remote system "akka.tcp://Oddjob@maodatest02.felles.ds.nrk.no:1966" has failed; address is now gated for 5000 ms. Reason is: ["System.InvalidCastExcep
tion: Unable to cast object of type 'System.Int32' to type 'System.Int64'.\n   at lambda_method132(Closure , Stream , DeserializerSession )\n   at Hyperion.ValueSerializers.ObjectSerializer.R
eadValue(Stream stream, DeserializerSession session)\n   at lambda_method130(Closure , Stream , DeserializerSession )\n   at Hyperion.ValueSerializers.ObjectSerializer.ReadValue(Stream stream
, DeserializerSession session)\n   at Hyperion.Serializer.Deserialize[T](Stream stream)\n   at Akka.Serialization.HyperionSerializer.FromBinary(Byte[] bytes, Type type)\n   at Akka.Serialization.Serialization.Deserialize(Byte[] bytes, Int32 serializerId, String manifest)\n   at Akka.Remote.MessageSerializer.Deserialize(ExtendedActorSystem system, Payload messageProtocol)\n   at Akka.Remote.DefaultMessageDispatcher.Dispatch(IInternalActorRef recipient, Address recipientAddress, Payload message, IActorRef senderOption)\n   at Akka.Remote.EndpointReader.<Reading>b__11_0(InboundPayload inbound)\n   at lambda_method60(Closure , Object , Action`1 , Action`1 , Action`1 )\n   at Akka.Actor.ReceiveActor.OnReceive(Object message)\n   at Akka.Actor.UntypedActor.Receive(Object message)\n   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)\n   at Akka.Actor.ActorCell.ReceiveMessage(Object message)\n   at Akka.Actor.ActorCell.Invoke(Envelope envelope)\n--- End of stack trace from previous location ---\n   at Akka.Actor.ActorCell.HandleFailed(Failed f)\n   at Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList messages, Int32 currentState)"]
2021-06-22 07:47:10 ERR Unable to cast object of type 'System.Int32' to type 'System.Int64'.
System.InvalidCastException: Unable to cast object of type 'System.Int32' to type 'System.Int64'.
   at lambda_method132(Closure , Stream , DeserializerSession )
   at Hyperion.ValueSerializers.ObjectSerializer.ReadValue(Stream stream, DeserializerSession session)
   at lambda_method130(Closure , Stream , DeserializerSession )
   at Hyperion.ValueSerializers.ObjectSerializer.ReadValue(Stream stream, DeserializerSession session)
   at Hyperion.Serializer.Deserialize[T](Stream stream)
   at Akka.Serialization.HyperionSerializer.FromBinary(Byte[] bytes, Type type)
   at Akka.Serialization.Serialization.Deserialize(Byte[] bytes, Int32 serializerId, String manifest)
   at Akka.Remote.MessageSerializer.Deserialize(ExtendedActorSystem system, Payload messageProtocol)
   at Akka.Remote.DefaultMessageDispatcher.Dispatch(IInternalActorRef recipient, Address recipientAddress, Payload message, IActorRef senderOption)
   at Akka.Remote.EndpointReader.<Reading>b__11_0(InboundPayload inbound)
   at lambda_method60(Closure , Object , Action`1 , Action`1 , Action`1 )
   at Akka.Actor.ReceiveActor.OnReceive(Object message)
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
--- End of stack trace from previous location ---
   at Akka.Actor.ActorCell.HandleFailed(Failed f)
   at Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList messages, Int32 currentState)
@Arkatufus
Copy link
Contributor

Are you sure that both version are sharing the same message library to talk to each other?

@evenbrenden
Copy link
Author

Are you sure that both version are sharing the same message library to talk to each other?

Yes, Hyperion is at 0.10.1 and Akka.Serialization.Hyperion at 1.4.21 on all deployments.

@Aaronontheweb
Copy link
Member

@Arkatufus this is a classic issue in other polymorphic serializers, especially JSON - where 32bit and 64bit integers get confused on the wire. What I don't understand is though: I thought Hyperion's wire type made this information explicit?

@Arkatufus
Copy link
Contributor

Arkatufus commented Jun 23, 2021

@Aaronontheweb Yes, it was explicit.

@evenbrenden I mean your messages, the ones that you use in your application. Are all the instances uses the same shared library of your messages, or are all of them built from a single source.
Have anyone in your team changed a property or field from long to an int or vice verrsa.

@evenbrenden
Copy link
Author

I mean your messages, the ones that you use in your application. Are all the instances uses the same shared library of your messages, or are all of them built from a single source.

All instances are built from the same revision of a shared library, so there shouldn't be any differences in the message contracts. We also source events from persistence, but any discrepancies would have shown up on other nodes with the same role too, which is not the case. There are nodes with the role in question on both Windows and Linux hosts, but they only fail on the K8s/Linux side. Which is why my hypothesis is that there is something about the mix of platforms that a causes this.

I'll try to reproduce this with #5105 once it's released.

@Aaronontheweb
Copy link
Member

@evenbrenden are both platforms running .NET Core? Or is one .NET Framework and the other is .NET Core (we've solved a bunch of "x-plat" issues in this area in the recent past)?

Could you include a small example of what this message looks like too? That would really help us reproduce.

@evenbrenden
Copy link
Author

Yes, all deployments are built with .NET (Core) SDK 5.0.

The problem is that I can't tell what the message is - there's no sign of any application code in the stack trace and no hints in the logs prior to the crash. I'll try to reproduce with #5105.

@Aaronontheweb
Copy link
Member

Ah got it! That build should be available in our most recent nightly: https://getakka.net/community/getting-access-to-nightly-builds.html

@evenbrenden
Copy link
Author

With Akka.Serialization.Hyperion 1.4.22-beta637601761567607228:

WRN Serializer not defined for message with serializer id [-5] and manifest [""]. Transient association error (association remains live). "Failed to deserialize instance of type . Unable to cast object of type 'System.Int32' to type 'System.Int64'."

Not much to go by, but what's interesting is that there's no exception and the node stays up and reachable. The warnings keep coming, maybe 3/second.

@Aaronontheweb
Copy link
Member

Looks to me like the serializer data is bad too - can't have a -5 serializer id for built-in messages.

What does your akka.actor.serialization-bindings hocon look like?

@evenbrenden
Copy link
Author

Actually, the ID for Hyperion is consistently set to -5:

akka {
  actor {
    serializers {
      hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
      protobuf-livetovod = "Nrk.Oddjob.LiveToVod.PersistentEvents+ProtobufSerializer, LiveToVod"
      protobuf-mediaset = "Nrk.Oddjob.Core.Dto.Serialization+ProtobufSerializer, Core"
      protobuf-delivery = "Nrk.Oddjob.Core.GuaranteedDelivery+Persistence+ProtobufSerializer, Core"
      protobuf-transcoding = "Nrk.Oddjob.Ps.PsPersistence+ProtobufSerializer, Ps"
    }
    serialization-bindings {
      "System.Object" = hyperion
      "Nrk.Oddjob.LiveToVod.PersistentEvents+IProtoBufSerializable, LiveToVod" = protobuf-livetovod
      "Nrk.Oddjob.Core.Dto.IProtoBufSerializable, Core" = protobuf-mediaset
      "Nrk.Oddjob.Core.GuaranteedDelivery+Persistence+IProtoBufSerializable, Core" = protobuf-delivery
      "Nrk.Oddjob.Ps.PsPersistence+IProtoBufSerializable, Ps" = protobuf-transcoding
    }
    serialization-identifiers {
      "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion" = -5
      "Akka.Serialization.NewtonSoftJsonSerializer, Akka" = 1
      "Nrk.Oddjob.LiveToVod.PersistentEvents+ProtobufSerializer, LiveToVod" = 127
      "Nrk.Oddjob.Core.Dto.Serialization+ProtobufSerializer, Core" = 126
      "Nrk.Oddjob.Core.GuaranteedDelivery+Persistence+ProtobufSerializer, Core" = 128
      "Nrk.Oddjob.Ps.PsPersistence+ProtobufSerializer, Ps" = 129
    }
  }
}

This has been our working config for a while now. @object please chime in here if you know something that I don't.

@evenbrenden
Copy link
Author

@Aaronontheweb We can't be sure where we got the -5 from - it dates years back. Could this the cause of the error, and, if so, what should we change it to? I can see from other repos that 13 is used, which is within the reserved space (0, 40).

@Aaronontheweb
Copy link
Member

Eh, if the -5 is in your configuration and it works then I wouldn't change it - I assumed it was an error but apparently it wasn't.

@Arkatufus can we try to do the following in Hyperion?

  1. Reproduce a simple int32 -> int64 deserialization error;
  2. If there's a casting problem with a specific field for a type being deserialized, can we mention which root type the deserialization operation failed for? Do we not do this already?

@Arkatufus
Copy link
Contributor

Are you sure it isn't a malformed serialized data? I see that the manifest is blank.

@Aaronontheweb
Copy link
Member

@Arkatufus normal for HyperionSerializer

/// <summary>
/// Completely unique value to identify this implementation of Serializer, used to optimize network traffic
/// </summary>
public override int Identifier => -5;
/// <summary>
/// Returns whether this serializer needs a manifest in the fromBinary method
/// </summary>
public override bool IncludeManifest => false;

@Aaronontheweb
Copy link
Member

Also, that's where the -5 came from originally. Weird.

@object
Copy link
Contributor

object commented Jun 29, 2021

Yes, we found the Id for Hyperion serializer in one of such files (or sample project). I was puzzled what was the reason for using negative value for serializer Id, but it worked in our code so we just kept it.

@object
Copy link
Contributor

object commented Jun 29, 2021

But if would be great to explain more of internals behind these decisions so they won't look like black magic.

@Arkatufus
Copy link
Contributor

I can reproduce the error with version-tolerance set to true (akka default setting) and changed a property in the serialized object from int to long.

@evenbrenden are you sure that none of your remote messages are modified between the linux and windows system.

@evenbrenden
Copy link
Author

@evenbrenden are you sure that none of your remote messages are modified between the linux and windows system.

AFAIK there's nothing changing the messages in transit, and both ends are running the same code. But I can't be sure, since I don't know what the message is or where it is coming from. Do you know of any way for me get more context?

@Aaronontheweb
Copy link
Member

Going to push a new nightly build that includes this #5115

This should capture some more data about what message type failed to be deserialized

@Aaronontheweb
Copy link
Member

New nightly has been pushed with this change

@evenbrenden
Copy link
Author

Thanks @Aaronontheweb and @Arkatufus, that is very helpful. We can now see the type of the offending message:

2021-07-02 08:57:11 WRN Serializer not defined for message with serializer id [-5] and manifest [""]. Transient association error (association remains live). "Failed to deserialize instance of type . Failed to deserialize object of type [Nrk.Oddjob.Core.PubSub.PubSubMessages+MediaSetRemoteFileUpdate] from the stream. Cause: Failed to deserialize object of type [Nrk.Oddjob.Core.Dto.MediaSet+RemoteResult] from the stream. Cause: Unable to cast object of type 'System.Int32' to type 'System.Int64'."

This is the record/struct RemoteResult:

    type RemoteResult = {
          [<ProtoMember(1)>] ResultCode_Removed : int64
          [<ProtoMember(2)>] ResultMessage : string
          [<ProtoMember(3)>] ResultCode : int
    }

RemoteResult is used for two things:

  • Event sourcing (hence the ProtoMembers), where ResultCode_Removed : int64 is a deprecated member.
  • Distributed PubSub, where the node in question is a subscriber, and the message is being deserialized.

Since the actor is up and running and since this is about Hyperion, event sourcing can not be the problem here, so the failure must occur in the context of PubSub. When passing this message, ResultCode_Removed is consistently set to 0L in our application. One hypothesis is that this value is somehow coerced into a 0 : int (int32) somewhere, which fails conversion back to int64 on the receiving end. Does that make sense to follow up on the framework side?

I want to stress that fact that this does not happen with other nodes with the same role (and the same code), only the one whose host platform differs from the others.

@object
Copy link
Contributor

object commented Jul 2, 2021

Just want to add to what Even wrote: the error occurs during deserialization of MediaSetRemoteFileUpdate which is a stuct that include a property of a type RemoteResult. So the top level message type looks like this:

type MediaSetRemoteFileUpdate = {
    MediaSetId : string
    RemoteState : int
    RemoteResult : RemoteResult
    Timestamp : DateTimeOffset
    // skipped unrelated fields
}

Since RemoteResult is the only type that contains a property of int64 and the exception is raised when casting int32 as int64, it must be RemoteResult.ResultCode_Removed that causes the problem. It is an obsolete field which is no longer in use, so it is always set to a zero. It must be a conversion between 32 and 64 bit zeros that causes an exception, but only when Windows and Linux talks to each other during distributed PubSub communication.

@Aaronontheweb
Copy link
Member

Thanks @object and @evenbrenden - this is helpful. We'll take a look to see how this might be happening inside Hyperion. Endianness is the same for .NET on both platforms, so I wonder what the issue could be...

@object
Copy link
Contributor

object commented Jul 2, 2021

I looked through the code, and the only place the field ResultCode_Removed is used is when it was set to 0L (it's obsolete, left from earlier days). But the big question is why does it need to cast it because it's same version of the contract at the both ends.

@Arkatufus
Copy link
Contributor

I really tried to reproduce the problem, it works just fine as far as i can see.
@evenbrenden and @object, I would be grateful if you can chip in and point out what is missing in the reproduction branch to reproduce the problem.

@rogeralsing
Copy link
Contributor

rogeralsing commented Jul 7, 2021

@object you could diagnose this by fetching the field-infos from the type on the different node types.
See how it's handled internally: https://github.com/akkadotnet/Hyperion/blob/dev/src/Hyperion/Extensions/ReflectionEx.cs#L26

From a first glance, the problem you see here seems odd, Hyperion writes a byte tag for each field type. e.g. a specific byte to signal that the value to read/write is an int64 or int32 respectively.

Some thoughts;

  1. The message types are different. e.g. outdated binary deployed or something
  2. Different sort order by culture or something like that. all fields are sorted by name in the method linked above, payloads are written in that same order.
  3. Unlikely, but maybe some F# specific thing? e.g. emitting some extra field for better byte alignment of types?

Anyway way, fetching the field-infos the same way as the code above, should give you the truth.

If they yield the same result of both platforms, then the issue is in the serializer itself

@Arkatufus
Copy link
Contributor

@rogeralsing Different sort order might be the root cause, thank you for pointing that out.

@rogeralsing
Copy link
Contributor

rogeralsing commented Jul 7, 2021

Field names generated by the F# compiler used here:

Field names ordered on Mac:

ResultCode_Removed@
ResultCode@
ResultMessage@

Same code running in .NET fiddle:

ResultCode@
ResultCode_Removed@
ResultMessage@

Code:

var strings = new[]
{
    "ResultCode@",
    "ResultCode_Removed@",
    "ResultMessage@",
};
var res = strings.OrderBy(s => s);
foreach (var x in res)
{
    Console.WriteLine(x);
}

@object
Copy link
Contributor

object commented Jul 8, 2021

@rogeralsing If 64 bit ResultCode_Removed is swapped with 32 bit ResultCode, then the error is exactly what's going to happen. But what puzzles me then how this could be unnoticed for so long? More or less everything in a mixed environment should be failing. Or is mixed environment extremely rarely used?

@rogeralsing
Copy link
Contributor

rogeralsing commented Jul 8, 2021

The reason is the edge-case with the names here:

"ResultCode@"
"ResultCode_Removed@"
           ^

It will only happen because the sort order of @ and _ differs between platforms.
And it will only happen if you have two fields starting with the same name, where one has the underscore.
[edit] and only in F#, as C# does not add @ to the end of field names

@object
Copy link
Contributor

object commented Jul 8, 2021

Aha, I guess that explains everything. Even F# came to the picture!

@object
Copy link
Contributor

object commented Jul 8, 2021

Hmm, but we are not using "@" in our property names. The original data structure (shown above) have names "ResultCode" and "ResultCode_Removed", so why did you suffix them with "@" character?

@rogeralsing
Copy link
Contributor

Those are the reflection field names

@object
Copy link
Contributor

object commented Jul 8, 2021

Ah I see now.

@object
Copy link
Contributor

object commented Jul 8, 2021

It's ironic that even though we switched from JSON to Protobuf partly to stop thinking about names on the wire, our Protobuf-friendly data structures are beaten by naming issues. So I guess there are two options to handle this:

  1. A proper fix in Hyperion (don't know how difficult it might be) so field order would match the payload.
  2. A workaround: rename ResultCode_Removed to something like ResultCode@Removed.

Obviously (1) is the ultimate fix, otherwise there's a ticking bomb that will explode next time.

@object
Copy link
Contributor

object commented Jul 8, 2021

@evenbrenden
Copy link
Author

Wow, edge case indeed! Checked the hypothesis with a field rename:

    type RemoteResult = {
          [<ProtoMember(1)>] Removed_ResultCode : int64
          [<ProtoMember(2)>] ResultMessage : string
          [<ProtoMember(3)>] ResultCode : int
    }

Can confirm that the warnings are gone. @rogeralsing thanks!

@Arkatufus this sounds like a fundamental problem for any serializer that uses reflection for deserializing fields. Maybe we are better off using a different one?

@Aaronontheweb
Copy link
Member

More on the subject: https://blog.jetbrains.com/dotnet/2020/05/13/sorting-order-depends-runtime-operating-system/

Might be able to do a StringComparison.Ordinal to fix that?

@object
Copy link
Contributor

object commented Jul 8, 2021

@Aaronontheweb yes, this is what my thought too.

@object
Copy link
Contributor

object commented Jul 8, 2021

I am looking at Hyperon code. Might send PR soon.

@object
Copy link
Contributor

object commented Jul 8, 2021

@Aaronontheweb PR is sent.

@Aaronontheweb
Copy link
Member

We'll do a new release of Hyperion to help resolve this.

@Arkatufus
Copy link
Contributor

@evenbrenden @object a new Hyperion release has been made that includes this fix: https://github.com/akkadotnet/Hyperion/releases/tag/0.11.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants