-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java] Investigate potential performance improvement of compression codec #388
Comments
Benjamin Wilhelm: |
Bob Tinsman / @bobtins: I am happy to help by profiling code. @emkornfield mentioned airlift as being Java based but still fast, so I checked it out. Its core code uses off-heap access which could explain its speed. For example, check out the core decompressor code: https://github.com/airlift/aircompressor/blob/master/src/main/java/io/airlift/compress/lz4/Lz4RawDecompressor.java This is similar to Arrow's vector implementations, which allocate an off-heap chunk of memory, then use Unsafe methods to access it. |
Liya Fan / @liyafan82: |
Micah Kornfield / @emkornfield:
[~benjamin.wilhelm@knime.com] for this specific Jira as starting place maybe we can checkin the benchmarks that you made for LZ4 compression. I think there a few follow-up issues (not asking any one in particular to work on them): 1. Provide JNI bindings that support framed compression 2. Provide a performant pure java decompression for those that don't want to use JNI 3. Use the existing LZ4 java bindings for compression.
|
Benjamin Wilhelm: |
Micah Kornfield / @emkornfield: |
Samuel Audet: Now, the C++ API doesn't always map very elegantly to Java, but it is tons faster, and maps a lot more functionality. This would be a discussion for another thread, but if the Java API of Arrow were to be based on JavaCPP, it would allow users to fall back easily on that API, instead of forcing them to start hacking stuff in JNI. Case in point, the I would be happy to maintain those presets as part of the Arrow project, just like I'm currently doing in the case of TensorFlow for Java: https://github.com/tensorflow/java/search?q=javacpp Previous discussions with people from Apache Arrow didn't elicit much interest, but in time the need for a tool like Cython in Java will become obvious to all, and JavaCPP already provides that! |
Benjamin Wilhelm: However, a significant problem with the Java API was/is the missing fast compression using LZ4. The JavaCPP project was the easiest and fastest way to get a very fast LZ4 API for Java (supporting frame compression as needed). I already implemented Seeing where the JavaCPP is used I think it is a viable project. I could contribute my |
Micah Kornfield / @emkornfield:
|
The Python API of Arrow isn't just automatically generated wrappers around the C++ API using Cython, right? It's the same for Java. We can use tools like Cython to make the life of Python developers easier, so why not do the same for Java developers? We were able to cut the wrapping code in half by rebasing the Java API of TensorFlow on JavaCPP, and performance increased to boot: We could do the same for Arrow!
I'm also publishing builds for Apache TVM as well, but again, not getting much traction: If you have some ideas as to why most engineers are OK using Cython in the case of Python, but not the equivalent in the case of Java, I would be very much interested in hearing your opinions. |
Micah Kornfield / @emkornfield:
|
Benjamin Wilhelm:
|
As for the non-ByteBuffer API, what you are looking for are the overloads taking Pointer, which is just a fancy wrapper around a long value: |
In response to the discussion in https://github.com/apache/arrow/pull/8949/files#r588046787
There are some performance penalties in the implementation of the compression codecs (e.g. data copying between heap/off-heap data). We need to revise the code to improve the performance.
We should also provide some benchmarks to validate that the performance actually improves.
Reporter: Liya Fan / @liyafan82
Note: This issue was originally created as ARROW-11901. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: