-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InvalidCastException in mixed-host cluster #5102
Comments
Are you sure that both version are sharing the same message library to talk to each other? |
Yes, Hyperion is at 0.10.1 and Akka.Serialization.Hyperion at 1.4.21 on all deployments. |
@Arkatufus this is a classic issue in other polymorphic serializers, especially JSON - where 32bit and 64bit integers get confused on the wire. What I don't understand is though: I thought Hyperion's wire type made this information explicit? |
@Aaronontheweb Yes, it was explicit. @evenbrenden I mean your messages, the ones that you use in your application. Are all the instances uses the same shared library of your messages, or are all of them built from a single source. |
All instances are built from the same revision of a shared library, so there shouldn't be any differences in the message contracts. We also source events from persistence, but any discrepancies would have shown up on other nodes with the same role too, which is not the case. There are nodes with the role in question on both Windows and Linux hosts, but they only fail on the K8s/Linux side. Which is why my hypothesis is that there is something about the mix of platforms that a causes this. I'll try to reproduce this with #5105 once it's released. |
@evenbrenden are both platforms running .NET Core? Or is one .NET Framework and the other is .NET Core (we've solved a bunch of "x-plat" issues in this area in the recent past)? Could you include a small example of what this message looks like too? That would really help us reproduce. |
Yes, all deployments are built with .NET (Core) SDK 5.0. The problem is that I can't tell what the message is - there's no sign of any application code in the stack trace and no hints in the logs prior to the crash. I'll try to reproduce with #5105. |
Ah got it! That build should be available in our most recent nightly: https://getakka.net/community/getting-access-to-nightly-builds.html |
With Akka.Serialization.Hyperion 1.4.22-beta637601761567607228:
Not much to go by, but what's interesting is that there's no exception and the node stays up and reachable. The warnings keep coming, maybe 3/second. |
Looks to me like the serializer data is bad too - can't have a What does your |
Actually, the ID for Hyperion is consistently set to -5:
This has been our working config for a while now. @object please chime in here if you know something that I don't. |
@Aaronontheweb We can't be sure where we got the -5 from - it dates years back. Could this the cause of the error, and, if so, what should we change it to? I can see from other repos that 13 is used, which is within the reserved space (0, 40). |
Eh, if the @Arkatufus can we try to do the following in Hyperion?
|
Are you sure it isn't a malformed serialized data? I see that the manifest is blank. |
@Arkatufus normal for akka.net/src/contrib/serializers/Akka.Serialization.Hyperion/HyperionSerializer.cs Lines 92 to 100 in dba9097
|
Also, that's where the |
Yes, we found the Id for Hyperion serializer in one of such files (or sample project). I was puzzled what was the reason for using negative value for serializer Id, but it worked in our code so we just kept it. |
But if would be great to explain more of internals behind these decisions so they won't look like black magic. |
I can reproduce the error with @evenbrenden are you sure that none of your remote messages are modified between the linux and windows system. |
AFAIK there's nothing changing the messages in transit, and both ends are running the same code. But I can't be sure, since I don't know what the message is or where it is coming from. Do you know of any way for me get more context? |
Going to push a new nightly build that includes this #5115 This should capture some more data about what message type failed to be deserialized |
New nightly has been pushed with this change |
Thanks @Aaronontheweb and @Arkatufus, that is very helpful. We can now see the type of the offending message:
This is the record/struct
Since the actor is up and running and since this is about Hyperion, event sourcing can not be the problem here, so the failure must occur in the context of PubSub. When passing this message, I want to stress that fact that this does not happen with other nodes with the same role (and the same code), only the one whose host platform differs from the others. |
Just want to add to what Even wrote: the error occurs during deserialization of MediaSetRemoteFileUpdate which is a stuct that include a property of a type RemoteResult. So the top level message type looks like this:
Since RemoteResult is the only type that contains a property of int64 and the exception is raised when casting int32 as int64, it must be RemoteResult.ResultCode_Removed that causes the problem. It is an obsolete field which is no longer in use, so it is always set to a zero. It must be a conversion between 32 and 64 bit zeros that causes an exception, but only when Windows and Linux talks to each other during distributed PubSub communication. |
Thanks @object and @evenbrenden - this is helpful. We'll take a look to see how this might be happening inside Hyperion. Endianness is the same for .NET on both platforms, so I wonder what the issue could be... |
I looked through the code, and the only place the field ResultCode_Removed is used is when it was set to 0L (it's obsolete, left from earlier days). But the big question is why does it need to cast it because it's same version of the contract at the both ends. |
I really tried to reproduce the problem, it works just fine as far as i can see. |
@object you could diagnose this by fetching the field-infos from the type on the different node types. From a first glance, the problem you see here seems odd, Hyperion writes a byte tag for each field type. e.g. a specific byte to signal that the value to read/write is an int64 or int32 respectively. Some thoughts;
Anyway way, fetching the field-infos the same way as the code above, should give you the truth. If they yield the same result of both platforms, then the issue is in the serializer itself |
@rogeralsing Different sort order might be the root cause, thank you for pointing that out. |
Field names generated by the F# compiler used here: Field names ordered on Mac:
Same code running in .NET fiddle:
Code:
|
@rogeralsing If 64 bit ResultCode_Removed is swapped with 32 bit ResultCode, then the error is exactly what's going to happen. But what puzzles me then how this could be unnoticed for so long? More or less everything in a mixed environment should be failing. Or is mixed environment extremely rarely used? |
The reason is the edge-case with the names here:
It will only happen because the sort order of @ and _ differs between platforms. |
Aha, I guess that explains everything. Even F# came to the picture! |
Hmm, but we are not using "@" in our property names. The original data structure (shown above) have names "ResultCode" and "ResultCode_Removed", so why did you suffix them with "@" character? |
Those are the reflection field names |
Ah I see now. |
It's ironic that even though we switched from JSON to Protobuf partly to stop thinking about names on the wire, our Protobuf-friendly data structures are beaten by naming issues. So I guess there are two options to handle this:
Obviously (1) is the ultimate fix, otherwise there's a ticking bomb that will explode next time. |
Wow, edge case indeed! Checked the hypothesis with a field rename:
Can confirm that the warnings are gone. @rogeralsing thanks! @Arkatufus this sounds like a fundamental problem for any serializer that uses reflection for deserializing fields. Maybe we are better off using a different one? |
Might be able to do a |
@Aaronontheweb yes, this is what my thought too. |
I am looking at Hyperon code. Might send PR soon. |
@Aaronontheweb PR is sent. |
We'll do a new release of Hyperion to help resolve this. |
@evenbrenden @object a new Hyperion release has been made that includes this fix: https://github.com/akkadotnet/Hyperion/releases/tag/0.11.0 |
Version Information
Version of Akka.NET?
1.4.21
Which Akka.NET Modules?
Akka.Cluster
Describe the bug
As part of a seamless migration from a Windows VM cluster to a K8s cluster, we are running our Akka cluster on a mix of Windows and Linux hosts. The seed nodes are running on Windows hosts. The/a node running in the K8s cluster is able to establish a connection with the cluster, but crashes on a serialization error:
Since the stack trace is clear of any application code, I can't be sure exactly what kind of message this is, but I assume it's a system message (as indicated by the last line in the stack trace). This could of course be a problem with Hyperion, but how do I figure out exactly what is being deserialized?
Environment
Windows Server 2016 Standard (x64)
.NET 5.0.301 Ubuntu 20.04 Docker image
Additional context
Log excerpt:
The text was updated successfully, but these errors were encountered: