-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Kryo to 3.0.1 #230
Update Kryo to 3.0.1 #230
Conversation
Given this is for java8 can you configure travis in this PR so its tested against java 8? -- http://docs.travis-ci.com/user/languages/java/ |
@@ -47,7 +47,7 @@ public void serialize(Object o) throws IOException { | |||
try { | |||
st.writeObject(o); | |||
// Copy from buffer to output stream. | |||
outputStream.writeInt(st.numOfWrittenBytes()); | |||
outputStream.writeLong(st.numOfWrittenBytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to write 4 extra bytes for each record, right? Even 4 bytes is more than we need, as we don't expect 4 GB records.
Why not cast this to an int here and save the data (consider serializing Unit with this, which will now take twice as much data as it did before).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a legit use case, should probably move this to a varInt encoding, would probably be a win over our usual int in some cases.
This makes me a bit nervous unfortunately. Maybe someone should help me through my anxiety. Here is the issue: storm, spark, scalding and heron all put Kryo on the classpath. Upgrading here means there is a real possibility for diamonds and lots of binary pain if you have two of them in scope at the same time. The easiest way (for us) for that to happen is via summingbird since that will have Kryo from chill and from storm/heron. How do we get out of this mess? |
* <p>Signed values are further encoded using so-called zig-zag encoding | ||
* in order to make them "compatible" with variable-length encoding.</p> | ||
*/ | ||
final class Varint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need tests for this. This is super testable with scalacheck.
Kryo is one of the JAR version troublespots -to deal with 2.21 & 2.22 import/link incpompatibilies spark 1.5 has to have a custom version of hive (downgraded against 2.21); ideally hive (unshaded) and spark would use the same release -but that would only work if chill used the same version (2.21+) Before a big switch to 3.x, would a smaller upgrade to 2.22 be possible? Maybe even 2.24 with some co-ordination across Hive and Spark for a consistent version |
The apis alas changed enough between those versions that the binary incompatibilities problems would drive lots of pain everywhere. Going direct to 3 seems better to me than repeating the process multiple times (given the level of coordination pain this will drive). I actually asked in June about some coordination on the spark dev list but got no response there. There is also the likes of Storm too. Unfortunately this bleeds everywhere pretty quickly. |
I'm trying to get Spark to upgrade their kryo to 3. It seems there's a recent issue at HIVE too. Would it be better if I try to create a Spark patch for 3.0.1 or 3.0.3? |
It seems there was an issue where upgrading Kryo in Spark would solve the issue, but it was stopped because of Chill. It would be great if we can coordinate activities. |
@ozawa-hi —we really need to get hive & spark to be in sync; it's the sole reason there's a mutant hive-exec JAR for spark. If Hive and Chill can agree on a version, and it doesn't cause regressions in Spark, then it would sync things up nicely |
I had asked on the spark dev list a few months ago for a point person on that side to coordinate an upgrade with... but got no response. Scalding/Summingbird need to synchronize on the big upgrade too, but I can drive most of that. Do we have a point person that can vet and drive any spark upgrade volunteer? |
...I'm not a person who can help with Spark updates, but could work to keep the Hive code in sync (they've generally been on the cutting edge of Kryo issues) |
I'm not a Spark committer but is strongly willing to put all effort to get this upgrade because I desperately need it. |
I dunno, I've not said no to any of the main spark committers that I'm aware of. We've generally moved slowly due to the wide web of dependencies around this. Especially as the version of kryo here now is right before some major binary version breakage (In which the versions after it, a long time ago now, had some bugs so we had to go back). But if someone wants to take over this PR (its got conflicts now), bring it up to the latest version of kryo released I think we can get the ball rolling and try get an updated release of chill out soon |
One q: how much coordinated update is needed? If a version of chill ships with Kryo 3, spark doesn't have to immediately update that until they want to. Same for Hive. One thing that will be trouble is if the kryo APIs change signficantly & Hive moves to Kryo 3.x, as then the ugly hacks needed to get spark and hive consistent (rebuilt hive with different kryo version, re-group-ID'd JAR without shading), won't work: upgrading is the only option. That would be an issue independent of Chill —a version of chill would simply make it possible to get hive and spark in sync. |
FWIW, I have a branch of Chill using Kryo 2.2; this is the version hive updated to for bug fixes ... they didn't find any problems worth rolling back for https://github.com/steveloughran/chill/tree/feature/support-kryo-2.22 |
So if the versions are not binary compatible then code cannot depend on new chill. Or say versions of algebird serializers can become a problem. Or for us we have a monorepo, so taking 1 version of kryo via our chill dep, then another via our spark deps, and other via our scalding ones is a mess alas. |
So, it may better to upgrade Kryo in Chill first, then Hive, then Spark. |
Hive will update on its own schedule, with its own needs.. looks like Hive 2 has the update down. In Hive 1.2.1 it is already ahead of Chill & spark —for Spark 1.5.1 we had to patch it to downgrade (POM file and a couple of source files with changed Its spark that needs to pick up a changed kryo via chill; then the resyncing process can begin again. I've just pushed up the patch #244 which patches chill for Kryo-2.22 instead; makes it consistent with Hive's current release. This is a lower risk update, which means that it's more likely to be picked up by spark in the 1.x line after a chill update; its clear that a kryo 3.x update won't happen until spark 2 (as a colleague who works on Hive once said to me, "Kryo is the guava of serialization") |
Upgrading to just support hive isn't a big win for most people and huge cost in effort. We should just be trying to get onto the latest stable kryo. We've tried a kryo 2.22 update, its was lots and lots of pain for little benefit, there was a chill release with it, long since reverted. |
didn't know about that update; will cancel the pull req. If even a minor release hurts that much, you may as well go to the latest version you are happy with, skipping the intermediate |
We've tried kryo2.22 but it really didn't solve the issues. We're currently using kryo 3.0 in our projects and it seems to be usable. Java is the default serdes for Spark and I don't think too many people are switching to use kryo as it is - we've found 2.22 to be insufficient in resolving the issues and not worth the pains to maintain it. Kryo3.x won't happen before Spark 2.x but as you've already mentioned, they're a lot of work and it's probably better to start early. I'm not just suggesting an upgrade because there's a newer version out there - I really need to to use kryo because we need performance and kryo 2.x just isn't usable.. |
This updates Kryo to the 3.0.1 shaded jar and resolves the deprecations this upgrade causes. Specifically:
org.esotericsoftware.kryo
toorg.esotericsoftware
output.total()
/input.total()
now returnlong
values instead ofint
ObjectInstantiator
is now a generic, in order to preserve type information. To retain the original library behavior,ObjectInstantiator[_]
andObjectInstantiator[AnyRef]
were usedAll tests are passing as far as I can tell. (Is there any other manner of testing other than running
sbt +test
?) And, assuming the tests are somewhat exhaustive, this should also allow Java 8 compatibility, which can be tested with the following patch:This excludes the requisite JavaDoc updates that are required to move to Java 8.