Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: ensure Iceberg layouts own the SeekableChannelsProvider #6371

Merged
merged 4 commits into from
Nov 15, 2024

Conversation

devinrsmith
Copy link
Member

@devinrsmith devinrsmith commented Nov 12, 2024

This greatly improves the efficiency of Iceberg reading. Previously, it was creating a SeekableChannelsProvider per URI, and now only one is created once per layout (/ Table).

To aid in this update, objects that were previously created on-demand in IcebergBaseLayout are now created once upon construction. To enable this, it was noted that only the URI scheme is relevant for discrimination, and not actually the full URI to the data files. Thus, we can use the URI scheme as provided via org.apache.iceberg.Table#location to do any up-front loading.

The various interfaces that take a URI have been update to take a URI scheme instead. While this change could technically have been made in a non-breaking fashion by delegating existing URI methods to URI scheme methods, the existence of the URI methods encourages the wrong mental model and is easy to misuse, so they have been removed.

One of the ParquetTableLocationKey constructors has been deprecated, marked for removal. A more appropriate constructor has been added.

BREAKING CHANGE:

  • SeekableChannelsProviderLoader.fromServiceLoader has been removed, replaced with SeekableChannelsProviderLoader.load.
  • DataInstructionsProviderLoader.fromServiceLoader has been removed, replaced with DataInstructionsProviderLoader.load.
  • SeekableChannelsProviderPlugin methods have been changed, now use a String for the URI scheme instead of a URI.
  • DataInstructionsProviderPlugin.createInstructions method has been changed, now uses a String for the URI scheme instead of a URI.
  • IcebergTableParquetLocationKey has added a new SeekableChannelsProvider parameter.

@devinrsmith devinrsmith added this to the Triage milestone Nov 12, 2024
@devinrsmith devinrsmith self-assigned this Nov 12, 2024
Copy link
Contributor

@malhotrashivam malhotrashivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like the change, we should get this in before the release.
Minor comments.

@@ -54,4 +54,23 @@ public SeekableChannelsProvider fromServiceLoader(@NotNull final URI uri,
}
throw new UnsupportedOperationException("No plugin found for uri: " + uri);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the fromServiceLoader(uri, specialInstructions) path now, especially since we have a breaking release coming up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because one part of a release is breaking does not mean that we should feel free to break anything; there should be some level of nuance involved involved in each breaking change.

I did a quick scan of DHE usages; they have one test usage of fromServiceLoader afaict.

@devinrsmith devinrsmith changed the title fix: ensure Iceberg layouts own the SeekableChannelsProvider fix!: ensure Iceberg layouts own the SeekableChannelsProvider Nov 13, 2024
@devinrsmith devinrsmith modified the milestones: Triage, 0.37.0 Nov 13, 2024
malhotrashivam
malhotrashivam previously approved these changes Nov 13, 2024
Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changeset is fine, modulo incredibly minor quibbles. I guess the real question is, what do we give up? It seems to me like having access to the full URI is potentially more powerful, and that we're preventing "hybrid" Iceberg catalogs with more than one URI scheme in use (which is (1) something I kind of want to build, and (2) not something that would totally surprise me in the wild).

@malhotrashivam
Copy link
Contributor

malhotrashivam commented Nov 14, 2024

Changeset is fine, modulo incredibly minor quibbles. I guess the real question is, what do we give up? It seems to me like having access to the full URI is potentially more powerful, and that we're preventing "hybrid" Iceberg catalogs with more than one URI scheme in use (which is (1) something I kind of want to build, and (2) not something that would totally surprise me in the wild).

Good catch regarding URIs with different schemes, we can either keep a set of channel providers in the Base Layout class or, for now, add an assertion that all the location keys have same URI scheme as the table location.

@devinrsmith
Copy link
Member Author

I've added an explicit URI check; it's a good point. That said, I'm almost certain that we don't support multiple URI schemes today given the pain points of configuration (not necessarily speaking of DH).

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@devinrsmith devinrsmith enabled auto-merge (squash) November 15, 2024 06:22
@devinrsmith devinrsmith merged commit 38a3aa8 into deephaven:main Nov 15, 2024
17 checks passed
@devinrsmith devinrsmith deleted the iceberg-base-layout branch November 15, 2024 08:35
@github-actions github-actions bot locked and limited conversation to collaborators Nov 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants