Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2432: Use ByteBufferAllocator over hardcoded heap allocation #1278

Merged
merged 4 commits into from
Feb 23, 2024

Conversation

gszadovszky
Copy link
Contributor

@gszadovszky gszadovszky commented Feb 21, 2024

  • Updated BytesInput implementations to rely on a ByteBufferAllocator instance for allocating/releasing ByteBuffer objects.
  • Extend the usage of a ByteBufferAllocator instead of the hardcoded usage of heap (e.g. byte[], ByteBuffer.allocate etc.)
  • parquet-cli related code parts including ParquetRewriter and tests are not changed in this effort

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines
    from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Style

  • My contribution adheres to the code style guidelines and Spotless passes.
    • To apply the necessary changes, run mvn spotless:apply -Pvector-plugins

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

* Updated BytesInput implementations to rely on a ByteBufferAllocator
  instance for allocating/releasing ByteBuffer objects.
* Extend the usage of a ByteBufferAllocator instead of the hardcoded
  usage of heap (e.g. byte[], ByteBuffer.allocate etc.)
* parquet-cli related code parts including ParquetRewriter and tests
  are not changed in this effort
@gszadovszky
Copy link
Contributor Author

@wgtmac, if you have some time, could you check this out?

@wgtmac
Copy link
Member

wgtmac commented Feb 21, 2024

Sure, I will take a look by the end of this week.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check the test thoroughly but this overall LGTM.

@@ -207,10 +211,18 @@ public static BytesInput copy(BytesInput bytesInput) throws IOException {
*/
public abstract void writeAllTo(OutputStream out) throws IOException;

/**
* For internal use only. It is expected that the buffer is large enough to fit the content of this {@link BytesInput}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a comment for what to expect if the content does not fit into the ByteBuffer?

* @return a text representation of the memory usage of this structure
*/
public String memUsageString(String prefix) {
return format("%s %s %d slabs, %,d bytes", prefix, getClass().getSimpleName(), slabs.size(), size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return format("%s %s %d slabs, %,d bytes", prefix, getClass().getSimpleName(), slabs.size(), size);
return format("%s %s %d slabs, %d bytes", prefix, getClass().getSimpleName(), slabs.size(), size);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just copy-pasted this from ConcatenatingByteArrayCollector but it seems to be intentional. %,d adds separators to the value representation (e.g. 123,456,789).

import java.nio.ByteBuffer;

/**
* A special {@link ByteBufferAllocator} implementation that keeps one {@link ByteBuffer} object and reuse it at the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A special {@link ByteBufferAllocator} implementation that keeps one {@link ByteBuffer} object and reuse it at the
* A special {@link ByteBufferAllocator} implementation that keeps one {@link ByteBuffer} object and reuses it at the

this.allocator = allocator;
this.toRelease = toRelease;
void setReleaser(ByteBufferReleaser releaser) {
this.releaser = releaser;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if the passed releaser is null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is internal (both the method and the class are package private). I wouldn't do additional checks.

int footerSignatureLength = AesCipher.NONCE_LENGTH + AesCipher.GCM_TAG_LENGTH;
byte[] serializedFooter = new byte[combinedFooterLength - footerSignatureLength];
System.arraycopy(footerAndSignature, 0, serializedFooter, 0, serializedFooter.length);
// Resetting to the beginning of the footer
from.reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check from.markSupported() before calling reset() and mark()?

allocator);
}

@Deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument list grows longer now. Should we use an options class instead to avoid frequent deprecation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From one hand I completely agree. From the other hand, ParquetFileWriter should be an internal class. It is unfortunate that it is public. I would not create yet another parameters builder for ParquetFileWriter.
I'll think about a solution somewhere in between.

@gszadovszky
Copy link
Contributor Author

Thank you, @wgtmac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants