-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-join pushdown appends recursively _<number> to column names so that they exceed maximum alias length #18642
Conversation
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Dominik Zalewski.
|
1 similar comment
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Dominik Zalewski.
|
@@ -454,7 +454,8 @@ public Optional<JoinApplicationResult<ConnectorTableHandle>> applyJoin( | |||
ImmutableMap.Builder<JdbcColumnHandle, JdbcColumnHandle> newLeftColumnsBuilder = ImmutableMap.builder(); | |||
for (JdbcColumnHandle column : jdbcClient.getColumns(session, leftHandle)) { | |||
newLeftColumnsBuilder.put(column, JdbcColumnHandle.builderFrom(column) | |||
.setColumnName(column.getColumnName() + "_" + nextSyntheticColumnId) | |||
.setColumnName(column.getOriginalColumnName() + "_" + session.getAlwaysIncreasingNumber()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure whether we can re-use the nextSyntheticColumnId, as it seems to be using the max() function. If anyone knows whether 'nextSyntheticColumnId' has the property that it always increases monothonicaly, we could reuse it. However, keep in mind that the same change would need to be applied for 'aggregation' (here the change is only for 'join'), so I'd be better to come up with a generic solution.
columnName, | ||
jdbcTypeHandle, | ||
columnType, | ||
nullable, | ||
comment); | ||
handle.originalColumnName = originalColumnName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originalColumnName
is not added as it should be as final
field as this creates a waterfall of changes in other places in the code that don't use the builder, but a direct constructor call. Since this solution proposal might be dumped and started from scratch after the review, I didn't want to invest that time yet.
Map.of(columns.get(2), "name1"), | ||
Map.of(columns.get(3), "name2")); | ||
Map.of(columns.get(2), columns.get(2)),//TODO:"name1"), | ||
Map.of(columns.get(3), columns.get(3))); //TODO:"name2")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be fixed if the 'general idea' is approved by this review.
79cdb09
to
9526323
Compare
047f181
to
b1a2d84
Compare
3c0bafc
to
aa84720
Compare
|
||
import io.trino.spi.connector.ConnectorSession; | ||
|
||
public interface ColumnWithAliasFormatter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interface has one implementation. Drop it as it's not needed. Default implementation should get OptionalInt maximumLength
and act on it accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping interface and leaving only a single class. The interface has been introduced only due to the fact that on the previous round of review, that on your request was done via videocall, you specifically requested to break that PR into multiple commits, where one is derived logically from the other. As a result I introduced a step-by-step 4 commit process as if I was guiding you through a refactor session. The natural thing was to introduce an interface first and then swap the implementation later (see commits history to understand better).
It also felt awkward to me that in the end we are left off with only a single implementation of an interface. Maybe the result of that is because there was a miscommunication between two of us WRT how this should be done. I'd appreciate your feedback (if possible more elaborate) on this, so that I can understand better what went wrong here. This will allow me to avoid doing unnecessary work in the future.
With the OptionalInt parameter, I'm withholding this change for now, as I still don't know how to code against that in the integration test, so that change is coupled with the other bulky work I need to do anyway.
implements ColumnWithAliasFormatter | ||
{ | ||
@Override | ||
public JdbcColumnHandle format(JdbcColumnHandle column, int nextSyntheticColumnId, ConnectorSession session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we put ConnectorSession first. Rearrange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. I did check the number of occurrences:
Which shows that ConnectorSession is only used as a NON-first parameter in less than 5% of cases.
Can you let me know how does it improve readability or in general management of the source code in trino if I align with 95% of cases? My concern is that the remaining 5% of the cases will not disappear, so the 0.3% of cases my change is throwing in (or not, if I make a change like you requested), seems like a marginal value.
Having said that, I can make that change, but the point of this exercise is unclear to me. It would be clearer if you requested that I change all the other occurrences by moving the ConnectorSession to the first param and possibly write a CI/CD plugin that would enforce that.
Can you justify/back up your request with numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We default to the majority case. There is no science/numbers/rules behind it - just a preference to keep it consistent if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping consistency has benefits. e.g. avoids easy mistakes when refactoring to use delegate and argument order is switched.
It's not 100% consistent only because there's no easy tools to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hashhar, @wendigo - thank you for responding. You might find my questions awkward, that is because I'm trying to learn by example, and given the fact that I asked for a specific rule/document/number that would back up your request, and I did not receive it, that learning will be harder than I expected.
The reason why I'm digging on this one (even though this discussion seems pointless on the first sight), is it looks to me like I have a sample of TWO examples and it seems like one contradicts the other. I know this is 'statistically unstable' as the sample is too small, but still I need to somehow learn/make progress.
So the other counter-example is in this PR: #18718 where both of you commented and it seems like it's the other way round. That is, a change was proposed to standardise and use a builder instead of a constructor call, but that comment:
"LGTM % how do we prevent new usages of the constructor from appearing?"
by @hashhar, is what I believe stopped the PR from being merged, even though it had a green build and an approval, what I believed were the only things necessary for a merge to master to occur.
So with that long comment I wanted to say that I do not understand why:
- in this PR you are asking to 'standardise' even though there is no change that this will ever be standardised
- in the other PR I did try to standardise, and you left me in vain with the PR claiming that it could never be standardised
I appreciate your patience on reading this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this PR it's just about "code-style".
the other PR will affect correctness since if a caller doesn't use the builder then the length limit value won't be set hence there's correctness which comes before style and consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this PR you are asking to 'standardise' even though there is no change that this will ever be standardised
Also because it's hard to ensure things are always standardized doesn't mean we should actively avoid standardizing where possible or when we notice it.
You can also submit a follow-up PR after this which standardises the other occurrences as well. It'll be a welcome change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid I'm going to have to be the one that parks this discussion and does the change despite the fact that it does not look like there is a right-wrong kind of approach you both could defend.
I think we came pretty far from what we wanted to achieve and the result of carrying on is that:
a) @wendigo has even lost interest in commenting here
b) I seem to be making a conclusion that there is a rule that says:
- the maintainer is always right
- even if the maintainer is not right, look at pt 1.
which is probably exactly the opposite of why the review is being done in the first place.
Let me do that change and use your energy in the other place where I have a bigger problem, where I don't know how to pass a config value to a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be clearer if you requested that I change all the other occurrences by moving the ConnectorSession to the first param and possibly write a CI/CD plugin that would enforce that.
This is what we'd do in a perfect world. I don't think however that it can be done in a reasonable amount of time (i.e. Less time than it takes to send a manual PR for aligning argument order) from past experiences we have writing errorprone or modernizer checks.
implements ColumnWithAliasFormatter | ||
{ | ||
@Override | ||
public JdbcColumnHandle format(JdbcColumnHandle column, int nextSyntheticColumnId, ConnectorSession session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead two distinct implementations let's have one with the signature format(ConnectorSession session, JdbcColumnHandle column, OptionalInt maximumLength, int nextColumnId)
.
if maximumLength is empty return legacy value, otherwise: apply logic as in the new implementation.
{ | ||
this.jdbcClient = requireNonNull(jdbcClient, "jdbcClient is null"); | ||
this.precalculateStatisticsForPushdown = precalculateStatisticsForPushdown; | ||
this.jdbcQueryEventListeners = ImmutableSet.copyOf(requireNonNull(jdbcQueryEventListeners, "queryEventListeners is null")); | ||
this.aliasFormatter = aliasFormatter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.aliasFormatter = aliasFormatter; | |
this.aliasFormatter = requireNonNull(aliasFormatter, "aliasFormatter is null"); |
@@ -449,22 +455,19 @@ public Optional<JoinApplicationResult<ConnectorTableHandle>> applyJoin( | |||
if (!leftHandle.getAuthorization().equals(rightHandle.getAuthorization())) { | |||
return Optional.empty(); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated change. revert
newRightColumnsBuilder.put(column, JdbcColumnHandle.builderFrom(column) | ||
.setColumnName(column.getColumnName() + "_" + nextSyntheticColumnId) | ||
.build()); | ||
newRightColumnsBuilder.put(column, aliasFormatter.format(column, nextSyntheticColumnId, session)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you should pass maximum identifier length as OptionalInt acquired from jdbcClient method with the default implementation:
newRightColumnsBuilder.put(column, aliasFormatter.format(session, column, jdbcClient.getMaximumColumnIdentifierLength(), nextSyntheticColumnId));
plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadataFactory.java
Outdated
Show resolved
Hide resolved
@@ -100,6 +102,11 @@ public int getDomainCompactionThreshold() | |||
return domainCompactionThreshold; | |||
} | |||
|
|||
public int getColumnAliasMaxChars() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed. This shouldn't be user configurable property but something that is exposed by the implementations extending JdbcClient. Move this information to the JdbcClient with the default implementation returning OptionalInt.empty()
(meaning no identifier length limit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wendigo, I don't think you got to the clue of the problem. You have provided me with the test case example here:
https://github.com/trinodb/trino/pull/18858/files#diff-dcd58a1db0b844e3264313bafcbdd64e74bdfc8b6c864aa776fe361f9c13888e
The exact code from what you provided, I'm referring to is:
@Test
public void testJoinPushdownWithLongIdentifiers()
{
String maximumColumnIdentifier = "a".repeat(ORACLE_MAXIMUM_IDENTIFIER_LENGTH);
// Verify that the ORACLE_MAXIMUM_IDENTIFIER_LENGTH is correct for the Oracle version
// used in the test.
oracleServer.execute("SELECT 1 AS " + maximumColumnIdentifier + " FROM DUAL");
assertThatThrownBy(() -> {
oracleServer.execute("SELECT 1 AS " + maximumColumnIdentifier + "a FROM DUAL");
}).hasMessageContaining("ORA-00972: identifier is too long");
try (TestTable table = new TestTable(onRemoteDatabase(), "join_long_", "(%s decimal(19, 0))".formatted(maximumColumnIdentifier))) {
assertThat(query(joinPushdownEnabled(getSession()), """
SELECT c.custkey, o.%s, n.nationkey
FROM %s o JOIN customer c ON c.custkey = o.%s
JOIN nation n ON c.nationkey = n.nationkey""".formatted(maximumColumnIdentifier, table.getName(), maximumColumnIdentifier)))
.isFullyPushedDown();
}
}
This code is the equivalent of the test that I already have, which is in: TestOracleConnectorTest#testPushdownJoinWithLongNameSucceedsWhenConfiguredCorrectly
and is the 'happy path' testing.
I didn't need to introduce the config property nor the method getColumnAliasMaxChars
in order to code that test, ie. test the happy path. So whatever time you devoted to providing the code sample above, you could have used elsewhere in the project. Please don't take it personally. I'm only writing the above as on our meeting with @hubertsk you have said explicitly that you don't have time to leave detailed justification against the comments you are making in the PR. As your time as a technical leader is valuable, by saying the above, I wanted to give you clues where you can save your precious time. Maybe you wasted it because I somehow mislead you with what kind of help I needed. If that was the case, I apologize.
To go back to the technical part of this discussion. The real help that I'm asking for is how to code that test TestOracleConnectorTest#testPushdownJoinWithLongNameFailsWhenMaxAllowedCharsIsMisconfigured
(ie. a non-happy path) without adding the config property nor getColumnAliasMaxChars
method (like you requested). If you could provide the equivalent in the branch you have provided your code samples already, that would be awesome. I am otherwise currently stuck on exactly that issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a need to provide a negative test here. As I understand positive test (happy path) is good enough here to provide the regression coverage. Notice that test should fail without product code changes you provide. We don't need to prove that we are able to break our code.
Can you please explain me what is the point of negative test here? What do I miss?
IsMisconfigured
The best way to avoid misconfiguration is disallow users to configure. Which was requested with #18642 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kokosing - thank you. Reverted the change with adding a config property and removed the negative test.
.split(column.getColumnName()) | ||
.iterator() | ||
.next(); | ||
String formatString = "%s_%0" + sequentialNumberLength + "d"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of a format you can just concatenate string with padded nextSyntheticColumnId as follows:
String columnName = originalColumnNameTruncated + "_" + padStart(Integer.toString(nextSyntheticColumnId), sequentialNumberLength, '0'));
padStart comes from Guava's Strings class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like any programming, you can do this in 1000 different ways. Can you explain better what is the advantage of using the approach you proposed? Is there a disadvantage of using what I proposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String.format
generates a lot of garbage and is less performant so we avoid it on the hot paths (see: #15258 #15316 trinodb/tpch#34)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution that I've proposed is free of these problems and inlines nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for describing the original idea you had in your head. Now I understand that the reason you wanted me to change the String.format() to something else is due to (the hope of) performance optimization.
Today you have actually asked me about the onboarding doc (presumably irrelevant discussion in this context). So by accident I came back to the onboarding document, where I found this quote:
"It’s okay not to know everything and ask for basic stuff. We’ve all been there, so remember that there are no silly questions."
Under the umbrella of this encouragement to have child-like curiosity, let me carry on this discussion as there are some things that just don't seem to fit in my understanding.
You are writing that the change you are proposing stems from the fact that we want to avoid it on the hot paths. So this reminds me of my math teacher from my High School that once reminded me of the truth table for implication:
p | q | p → q
-- | -- | --
T | T | T
T | F | F
F | T | T
F | F | T
where the bottom two rows mean that if you start with a false assumption, you can deduce anything depending on your liking and it will always be true. So this got me to thinking. If your assumption that we are on a hot/critical path here is wrong (false), then you will always arrive at the conclusion that we should change it.
So I started reading the code at hand with the question in my head: are we really on a critical/hot path?
In the method call of the path the code in this PR is modifying, there are already three other places where we use String.format() explicitly:
- io.trino.plugin.jdbc.DefaultQueryBuilder#formatAssignments
- io.trino.plugin.jdbc.DefaultQueryBuilder#prepareJoinQuery
- io.trino.plugin.jdbc.DefaultQueryBuilder#formatJoinCondition
and we call formatAssignments twice. So this means there are already 4 calls in the product to String.format() running on production as we speak.
So this got me thinking. If it was a hot path, and the performance would be impacted by String.format(), wouldn't it already be reported by the customers?
I also did check who is the author of the code mentioned above. I was very surprised that it's the very same person who made the comment not to use String.format() in the first place.
There is also this famous quote from one of my favourite authors: Donald Erwin Knuth, The Art of Computer Programming:
“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.”
Having said all of that, I just wanted to say that the reference you've provided that was suppose to justify your request, does not look like is applicable here. From all the analysis I have provided above, is there anything that you would not agree with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because existing code is imperfect doesn't mean future code should be too.
The good thing to do here would be to send follow-up PR which fixes the existing String.format usages.
And yes nobody is perfect, we're all learning - git blame
isn't intended to be used to blame someone. I'm sure I've written a lot of stupid and complex code as well - doesn't mean I don't think it should be improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO "premature optimization is the root of all evil" doesn't mean that we shouldn't to the obvious things that we already know that we can do better.
Replying to your ownership
comment this class was created as a copy during: #10059. I'm not the sole author of this code but I can always improve it: #18885
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed String.format() to guava's pad method.
...in/trino-base-jdbc/src/test/java/io/trino/plugin/jdbc/TestColumnWithAliasFormatterFixed.java
Outdated
Show resolved
Hide resolved
@Override | ||
public void configure() | ||
{ | ||
bind(ColumnWithAliasFormatter.class).to(ColumnWithAliasFormatterFixed.class).in(Singleton.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably be extracted to JdbcClientModule
as an optional binding with a default binding to ColumnWithAliasFormatterFixed
.
That way other connectors can just plug in a new class instead of having to write their own modules. Since the module isn't installed in multiple places this feels a little bit like boilerplate when extracted to separate module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that, but bear in mind that this would mean that I have a boilerplate Module in the test only, so that I can use the @guice annotation. This is actually not that intrusive, as I can make the test class inherit from AbstractModule and only add a configure() method to it. However, this would mean that the binding code would not be shared between production code and the test code, but will be two separate codebases. If that is the policy here and this is what you want to go with, I'm fine. Don't have a strong opinion here. Let me know what's your decision here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does test code need a guice derived class? The ColumnWithAliasFormatter
can be directly tested by calling it's public method instead of holding state in the class.
For the integration test you don't need to bind anything anyway since the query runner already has the TestingOraclePlugin (or something to that effect) installed which installs all the original modules anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand your concern correctly. That's one of the main issues that the DI framework solves, ie. to be able to get an instance of a class without explicitly calling a constructor (which might have many dependent objects used in the constructor, if not now, then maybe in the future). That's also why people invented annotations like @guice so that you don't have to explicitly instantiate guice 'container' - saves you a lot of boilerplate.
Maybe it comes from the fact that you don't use unit tests in many places but mostly integration tests? I've seen in a few places that the codebase has 'new' calls in the test. This is generally considered a bad practice in the industry, so I believed we don't want to carry on that way.
If my explanations are not enough, perhaps we have a call about this on monday? If so, let me know what time suits you on slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using DI in test just to create a single class is overkill. I'd agree for a class which needs other classes to be injected. We do use Guice to create those in tests but not for a class which has no dependencies.
What does DI buy us vs new
for a class without any other injected classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scheduled a call with you on Tuesday 2023-09-05 to explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we agreed to keep existing module setup in production code (since thinking about using this in other connectors is out of scope for this PR).
In tests we'll switch to new
on either final
field in test class or initializing in @BeforeClass
method.
plugin/trino-oracle/src/test/java/io/trino/plugin/oracle/TestOracleConnectorTest.java
Outdated
Show resolved
Hide resolved
plugin/trino-oracle/src/test/java/io/trino/plugin/oracle/TestOracleConnectorTest.java
Outdated
Show resolved
Hide resolved
testPushdownJoinWithLongName(31); | ||
} | ||
catch (AssertionError ex) { | ||
assertThat(ex.getCause().getCause().getMessage()).isEqualTo("ORA-00972: identifier is too long\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use assertThatThrownBy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've seen that example in the other places of the codebase, and I actually started with the assertThatThrownBy. But I couldn't figure out how to apply this in that specific instance. Is it OK if you provide a snippet of how this should be done here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trino/testing/trino-testing/src/main/java/io/trino/testing/AbstractDistributedEngineOnlyQueries.java
Line 324 in 398b646
assertThatThrownBy(() -> query("SELECT row_number FROM n")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hubertsk. I've seen this one already. My question is how to make the exact assert I'm doing in the 'catch' clause with the 'assertThatThrownBy'. Can you provide a code snippet that does the exact equivalent of the code in this PR but uses assertThatThrownBy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That negative test is now removed, so there's no need for that method call.
@Test | ||
public void testPushdownJoinWithLongNameSucceedsWhenConfiguredCorrectly() | ||
{ | ||
testPushdownJoinWithLongName(30); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is prone to breakage. e.g. in future when we upgrade to Oracle with 128 char limits how would one find what value to adjust?
You can look at BaseConnectorTest#testCreateSchemaWithLongName
for some ideas. The other benefit being that then the test can be run against all connectors and we can catch such silent failures in other connectors too.
(This can be left as a follow-up, if you do decide to skip this comment then please create a GitHub issue + add a TODO comment linking to it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this test is fine, as it formats the alias over 30 characters which (we believe) is below the limit for any database, no matter if it's Oracle 11.x or Postgres.
I think what you were referring to is the second test that uses the constant 31, and that specifically tries to break it for Oracle 11.x. If we upgraded to 12.x and the limit by doing that changes to 128, the the test will fail, as we would not get the error message from the SQL engine. So the engineer doing the Oracle upgrade will notice that he needs to fix the constant in the test.
If I got you wrong and you were actually talking about this specific test, then it is the case that if we upgrade to 12.x or extend to any other database, the test will still be passing and that is OK. We are fine with alias being squashed to 30 characters even if the actual DB engine allows the alias to have 128 characters. As a reminder we did agree on one of the daily sessions what was the acceptance criteria and 'out of scope' for this task, which I copy here for reference from our internal ticket:
Acceptance Criteria
- Solution fixes the bug for Oracle pushdown JOIN
Out of Scope
- Integration test for other databases than Oracle
- If Alias field restriction has same bug on aggregation, it won’t be fixed
- If other databases have restriction on the alias field that result in same bug, this wont' be fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dominikzalewski please take a look at BaseConnectorTest#testCreateSchemaWithLongName
.
I believe @hashhar was referring to the way the actual limit (30
) is being passed.
testPushdownJoinWithLongName(30);
In the code above it's not clear what 30
stands for. The test @hashhar referred to solves similar problem in an elegant way.
trino/testing/trino-testing/src/main/java/io/trino/testing/BaseConnectorTest.java
Lines 3326 to 3335 in 398b646
public void testCreateSchemaWithLongName() | |
{ | |
skipTestUnless(hasBehavior(SUPPORTS_CREATE_SCHEMA)); | |
String baseSchemaName = "test_create_" + randomNameSuffix(); | |
int maxLength = maxSchemaNameLength() | |
// Assume 2^16 is enough for most use cases. Add a bit more to ensure 2^16 isn't actual limit. | |
.orElse(65536 + 5); | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in addition to magic constant the main problem is that you have two tests which have two constants which can go out of sync.
It also makes it harder to make the test generic because we need to do it in future anyway. As an example if you strictly consider only Oracle to be in scope an even smaller solution would be to add a different implementation of DefaultQueryBuilder
for Oracle instead of trying to add a generic formatter class. But that would make making this solution generic harder since everything will need to be reworked anyway.
So since we already know the long term goal the short term solutions should iteratively drive us towards those long term goals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hashhar, thanks for the explanations. However, I do find it hard to skin out exactly what your asking for. There's to many 'ifs' for me here. Can you lay out in plain the changes/steps you want me to apply to this piece of code as if you were doing it? You can make the decisions instead of describing conditionals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at
trino/testing/trino-testing/src/main/java/io/trino/testing/BaseConnectorTest.java
Lines 3326 to 3335 in 398b646
public void testCreateSchemaWithLongName() | |
{ | |
skipTestUnless(hasBehavior(SUPPORTS_CREATE_SCHEMA)); | |
String baseSchemaName = "test_create_" + randomNameSuffix(); | |
int maxLength = maxSchemaNameLength() | |
// Assume 2^16 is enough for most use cases. Add a bit more to ensure 2^16 isn't actual limit. | |
.orElse(65536 + 5); | |
(There are benefits other than consistency as I've already said since as a follow up we need to do similar fix in other connectors and this change makes it possible to reuse the test instead of write it again for all of the more than 10 JDBC based connectors we have.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is now simplified after removing the negative test. If there's something that makes it unreadable, please propose the exact changes you want me to do and I will copy-paste them to the PR.
4542683
to
33f1766
Compare
33f1766
to
4346680
Compare
public static final int DEFAULT_COLUMN_ALIAS_LENGTH = 30; | ||
public static final int ORIGINAL_COLUMN_NAME_LENGTH = 24; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code lives in trino-base-jdbc
but the values are specific to Oracle.
Also consider adding a explanatory comment about why ORIGINAL_COLUMN_NAME_LENGTH = DEFAULT_COLUMN_ALIAS_LENGTH - 6
. Where does the magic value 6 come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code lives in trino-base-jdbc but the values are specific to Oracle.
It's ok to ignore for now since as follow-up we'll make this general for all connectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also consider adding a explanatory comment about why ORIGINAL_COLUMN_NAME_LENGTH = DEFAULT_COLUMN_ALIAS_LENGTH - 6. Where does the magic value 6 come from?
This still applies.
.split(column.getColumnName()) | ||
.iterator() | ||
.next(); | ||
String formatString = "%s_%0" + sequentialNumberLength + "d"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is formatString used?
@Test | ||
public void testPushdownJoinWithLongNameSucceeds() | ||
{ | ||
tryCleanupTemporaryTable(); | ||
try { | ||
String baseColumnName = "test_pushdown_" + randomNameSuffix(); | ||
String validColumnName = baseColumnName + "z".repeat(MAX_CHARS_COLUMN_ALIAS - baseColumnName.length()); | ||
|
||
assertUpdate(format(""" | ||
CREATE TABLE orders_1 as | ||
SELECT orderkey as %s, | ||
custkey, | ||
orderstatus, | ||
totalprice, | ||
orderdate, | ||
orderpriority, | ||
clerk, | ||
shippriority, | ||
comment | ||
FROM orders | ||
""", validColumnName), "VALUES 15000"); | ||
|
||
Session session = Session.builder(getSession()) | ||
.setCatalogSessionProperty("oracle", JOIN_PUSHDOWN_ENABLED, "true") | ||
.build(); | ||
assertQuery(session, | ||
format(""" | ||
SELECT c.custkey, o.%s, n.nationkey | ||
FROM orders_1 o JOIN customer c ON c.custkey = o.custkey | ||
JOIN nation n ON c.nationkey = n.nationkey | ||
""", validColumnName), | ||
""" | ||
SELECT c.custkey, o.orderkey, n.nationkey | ||
FROM orders o JOIN customer c ON c.custkey = o.custkey | ||
JOIN nation n ON c.nationkey = n.nationkey | ||
"""); | ||
} | ||
finally { | ||
tryCleanupTemporaryTable(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Test | |
public void testPushdownJoinWithLongNameSucceeds() | |
{ | |
tryCleanupTemporaryTable(); | |
try { | |
String baseColumnName = "test_pushdown_" + randomNameSuffix(); | |
String validColumnName = baseColumnName + "z".repeat(MAX_CHARS_COLUMN_ALIAS - baseColumnName.length()); | |
assertUpdate(format(""" | |
CREATE TABLE orders_1 as | |
SELECT orderkey as %s, | |
custkey, | |
orderstatus, | |
totalprice, | |
orderdate, | |
orderpriority, | |
clerk, | |
shippriority, | |
comment | |
FROM orders | |
""", validColumnName), "VALUES 15000"); | |
Session session = Session.builder(getSession()) | |
.setCatalogSessionProperty("oracle", JOIN_PUSHDOWN_ENABLED, "true") | |
.build(); | |
assertQuery(session, | |
format(""" | |
SELECT c.custkey, o.%s, n.nationkey | |
FROM orders_1 o JOIN customer c ON c.custkey = o.custkey | |
JOIN nation n ON c.nationkey = n.nationkey | |
""", validColumnName), | |
""" | |
SELECT c.custkey, o.orderkey, n.nationkey | |
FROM orders o JOIN customer c ON c.custkey = o.custkey | |
JOIN nation n ON c.nationkey = n.nationkey | |
"""); | |
} | |
finally { | |
tryCleanupTemporaryTable(); | |
} | |
} | |
@Test | |
public void testPushdownJoinWithLongNameSucceeds() | |
{ | |
String maximumLengthColumnIdentifier = "z".repeat(MAX_CHARS_COLUMN_ALIAS); | |
try (TestTable table = new TestTable(getQueryRunner()::execute, "long_identifier", "(%s bigint)".formatted(maximumLengthColumnIdentifier))) { | |
assertThat(query(joinPushdownEnabled(getSession()), """ | |
SELECT r.name, t.%s, n.name | |
FROM %s t JOIN region r ON r.regionkey = t.%s | |
JOIN nation n ON r.regionkey = n.regionkey""".formatted(maximumLengthColumnIdentifier, table.getName(), maximumLengthColumnIdentifier))) | |
.isFullyPushedDown(); | |
} | |
} | |
This seems to be sufficient to test.
Try to avoid creating data in tests since it can be slow, specially if using larger tables like orders. nation and region are best if your test doesn't care about size of tables.
Note: we need to verify that the query was indeed pushed down because otherwise it's possible that the connector would generate a plain SELECT with no synthetic ids and hence our test will pass.
Joins cannot always be pushed down even if join pushdown is enabled (e.g. if the join condition or some other operation in the query prevents pushdown).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also while this test by itself is sufficient to prove that the aliasing logic works it doesn't prove that we chose an optimal value for the max identifier length.
IMO we should have another test to verify the length we chose:
- Try create a column with length = max_length - must succeed
- Try create a column with length = max_length + 1 - must fail in expected manner
@kokosing the reason why i believe this is useful is because for example we can use 30 as the max globally but it un-necessarily loses information in connectors which can actually support longer identifiers making reading explain plans harder than needed.
public static final int DEFAULT_COLUMN_ALIAS_LENGTH = 30; | ||
public static final int ORIGINAL_COLUMN_NAME_LENGTH = 24; | ||
|
||
public JdbcColumnHandle format(ConnectorSession session, JdbcColumnHandle column, int nextSyntheticColumnId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConnectorSession seems unused.
.iterator() | ||
.next(); | ||
String formatString = "%s_%0" + sequentialNumberLength + "d"; | ||
String columnName = originalColumnNameTruncated + "_" + padStart(Integer.toString(nextSyntheticColumnId), sequentialNumberLength, '0'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we pad the next synthetic id part, it leads to identifiers with long numbers in them. Is this required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what happens when the nextSyntheticColumnId
is "large" and causes the generated column name to exceed DEFAULT_COLUMN_ALIAS_LENGTH
?
@Guice(modules = ColumnWithAliasFormatterModule.class) | ||
public class TestColumnWithAliasFormatter | ||
{ | ||
@Inject | ||
private ColumnWithAliasFormatter actor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Guice(modules = ColumnWithAliasFormatterModule.class) | |
public class TestColumnWithAliasFormatter | |
{ | |
@Inject | |
private ColumnWithAliasFormatter actor; | |
public class TestColumnWithAliasFormatter | |
{ | |
private final ColumnWithAliasFormatter columnAliasFormatter = new ColumnWithAliasFormatter(); |
equivalent.
public static final int DEFAULT_COLUMN_ALIAS_LENGTH = 30; | ||
public static final int ORIGINAL_COLUMN_NAME_LENGTH = 24; | ||
|
||
public JdbcColumnHandle format(ConnectorSession session, JdbcColumnHandle column, int nextSyntheticColumnId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe getAliasedColumnHandle
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe name the class SyntheticColumnHandleProvider/Builder/etc.
and the method getSyntheticColumnHandle
.
Superseded by #18924 |
@wendigo, @hashhar
How to Reproduce
The following enables join pushdown that otherwise would not happen [1], [2].
The following query
should trigger the breakpoint.
In ‘Threads & Variables’ view you should be able to observe the ‘… AS COLUMN_X_Y’ aliases for columns that have as many underscores with a follow up number, as the number of joins in the query.
Proposed solution
Divide the number of characters available for the alias (N) into two:
where N - 1 = P + M. Use the additional character '_' to visually divide 'Original Column Name Prefix' from 'Uniqueness Magic Number'.
Example
Assume N = 8, P = 4, M = 3
Query:
The aliases would then become: