Multi-join pushdown appends recursively _<number> to column names so that they exceed maximum alias length #18642

dominikzalewski · 2023-08-11T11:22:02Z

How to Reproduce

Start ‘OracleQueryRunner’ in debug mode.
Setup a breakpoint in DefaultJdbcMetadata.applyJoin() method where the code is calling jdbcClient.implementJoin().
Start ‘Trino’ console client.

trino> use oracle.trino_test;

The following enables join pushdown that otherwise would not happen [1], [2].

trino> set session oracle.join_pushdown_enabled=true;

The following query

trino> SELECT c.custkey, o.orderkey, n.nationkey
      FROM orders o JOIN customer c ON c.custkey = o.custkey
      JOIN nation n ON c.nationkey = n.nationkey;

should trigger the breakpoint.

In ‘Threads & Variables’ view you should be able to observe the ‘… AS COLUMN_X_Y’ aliases for columns that have as many underscores with a follow up number, as the number of joins in the query.

Proposed solution

Divide the number of characters available for the alias (N) into two:

‘Column Name Prefix’ of length P
‘Uniqueness Magic Number’ of length M

where N - 1 = P + M. Use the additional character '_' to visually divide 'Original Column Name Prefix' from 'Uniqueness Magic Number'.

Example

Assume N = 8, P = 4, M = 3

Query:

SELECT c.custkey, o.orderkey
  FROM orders o JOIN customer c
  ON c.custkey = o.custkey

The aliases would then become:

custkey => cust_001
orderkey => orde_002

cla-bot · 2023-08-11T11:22:04Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Dominik Zalewski.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot · 2023-08-11T11:30:20Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Dominik Zalewski.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

dominikzalewski · 2023-08-11T11:32:43Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadata.java

@@ -454,7 +454,8 @@ public Optional<JoinApplicationResult<ConnectorTableHandle>> applyJoin(
        ImmutableMap.Builder<JdbcColumnHandle, JdbcColumnHandle> newLeftColumnsBuilder = ImmutableMap.builder();
        for (JdbcColumnHandle column : jdbcClient.getColumns(session, leftHandle)) {
            newLeftColumnsBuilder.put(column, JdbcColumnHandle.builderFrom(column)
-                    .setColumnName(column.getColumnName() + "_" + nextSyntheticColumnId)
+                    .setColumnName(column.getOriginalColumnName() + "_" + session.getAlwaysIncreasingNumber())


I wasn't sure whether we can re-use the nextSyntheticColumnId, as it seems to be using the max() function. If anyone knows whether 'nextSyntheticColumnId' has the property that it always increases monothonicaly, we could reuse it. However, keep in mind that the same change would need to be applied for 'aggregation' (here the change is only for 'join'), so I'd be better to come up with a generic solution.

dominikzalewski · 2023-08-11T11:34:22Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/JdbcColumnHandle.java

                    columnName,
                    jdbcTypeHandle,
                    columnType,
                    nullable,
                    comment);
+            handle.originalColumnName = originalColumnName;


originalColumnName is not added as it should be as final field as this creates a waterfall of changes in other places in the code that don't use the builder, but a direct constructor call. Since this solution proposal might be dumped and started from scratch after the review, I didn't want to invest that time yet.

dominikzalewski · 2023-08-11T11:34:44Z

plugin/trino-base-jdbc/src/test/java/io/trino/plugin/jdbc/TestDefaultJdbcQueryBuilder.java

-                Map.of(columns.get(2), "name1"),
-                Map.of(columns.get(3), "name2"));
+                Map.of(columns.get(2), columns.get(2)),//TODO:"name1"),
+                Map.of(columns.get(3), columns.get(3))); //TODO:"name2"));


Will be fixed if the 'general idea' is approved by this review.

wendigo · 2023-08-29T08:14:56Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/ColumnWithAliasFormatter.java

+
+import io.trino.spi.connector.ConnectorSession;
+
+public interface ColumnWithAliasFormatter


This interface has one implementation. Drop it as it's not needed. Default implementation should get OptionalInt maximumLength and act on it accordingly

Dropping interface and leaving only a single class. The interface has been introduced only due to the fact that on the previous round of review, that on your request was done via videocall, you specifically requested to break that PR into multiple commits, where one is derived logically from the other. As a result I introduced a step-by-step 4 commit process as if I was guiding you through a refactor session. The natural thing was to introduce an interface first and then swap the implementation later (see commits history to understand better).

It also felt awkward to me that in the end we are left off with only a single implementation of an interface. Maybe the result of that is because there was a miscommunication between two of us WRT how this should be done. I'd appreciate your feedback (if possible more elaborate) on this, so that I can understand better what went wrong here. This will allow me to avoid doing unnecessary work in the future.

With the OptionalInt parameter, I'm withholding this change for now, as I still don't know how to code against that in the integration test, so that change is coupled with the other bulky work I need to do anyway.

wendigo · 2023-08-29T08:15:17Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/ColumnWithAliasFormatterLegacy.java

+        implements ColumnWithAliasFormatter
+{
+    @Override
+    public JdbcColumnHandle format(JdbcColumnHandle column, int nextSyntheticColumnId, ConnectorSession session)


Usually we put ConnectorSession first. Rearrange.

Yes, you are right. I did check the number of occurrences:

Which shows that ConnectorSession is only used as a NON-first parameter in less than 5% of cases.

Can you let me know how does it improve readability or in general management of the source code in trino if I align with 95% of cases? My concern is that the remaining 5% of the cases will not disappear, so the 0.3% of cases my change is throwing in (or not, if I make a change like you requested), seems like a marginal value.

Having said that, I can make that change, but the point of this exercise is unclear to me. It would be clearer if you requested that I change all the other occurrences by moving the ConnectorSession to the first param and possibly write a CI/CD plugin that would enforce that.

Can you justify/back up your request with numbers?

We default to the majority case. There is no science/numbers/rules behind it - just a preference to keep it consistent if possible.

keeping consistency has benefits. e.g. avoids easy mistakes when refactoring to use delegate and argument order is switched.

It's not 100% consistent only because there's no easy tools to do so.

@hashhar, @wendigo - thank you for responding. You might find my questions awkward, that is because I'm trying to learn by example, and given the fact that I asked for a specific rule/document/number that would back up your request, and I did not receive it, that learning will be harder than I expected.

The reason why I'm digging on this one (even though this discussion seems pointless on the first sight), is it looks to me like I have a sample of TWO examples and it seems like one contradicts the other. I know this is 'statistically unstable' as the sample is too small, but still I need to somehow learn/make progress.

So the other counter-example is in this PR: #18718 where both of you commented and it seems like it's the other way round. That is, a change was proposed to standardise and use a builder instead of a constructor call, but that comment:

"LGTM % how do we prevent new usages of the constructor from appearing?"

by @hashhar, is what I believe stopped the PR from being merged, even though it had a green build and an approval, what I believed were the only things necessary for a merge to master to occur.

So with that long comment I wanted to say that I do not understand why:

in this PR you are asking to 'standardise' even though there is no change that this will ever be standardised

in the other PR I did try to standardise, and you left me in vain with the PR claiming that it could never be standardised

I appreciate your patience on reading this.

in this PR it's just about "code-style".

the other PR will affect correctness since if a caller doesn't use the builder then the length limit value won't be set hence there's correctness which comes before style and consistency.

in this PR you are asking to 'standardise' even though there is no change that this will ever be standardised

Also because it's hard to ensure things are always standardized doesn't mean we should actively avoid standardizing where possible or when we notice it.

You can also submit a follow-up PR after this which standardises the other occurrences as well. It'll be a welcome change.

I'm afraid I'm going to have to be the one that parks this discussion and does the change despite the fact that it does not look like there is a right-wrong kind of approach you both could defend.

I think we came pretty far from what we wanted to achieve and the result of carrying on is that:
a) @wendigo has even lost interest in commenting here
b) I seem to be making a conclusion that there is a rule that says:

the maintainer is always right

even if the maintainer is not right, look at pt 1.
which is probably exactly the opposite of why the review is being done in the first place.

Let me do that change and use your energy in the other place where I have a bigger problem, where I don't know how to pass a config value to a test.

It would be clearer if you requested that I change all the other occurrences by moving the ConnectorSession to the first param and possibly write a CI/CD plugin that would enforce that.

This is what we'd do in a perfect world. I don't think however that it can be done in a reasonable amount of time (i.e. Less time than it takes to send a manual PR for aligning argument order) from past experiences we have writing errorprone or modernizer checks.

wendigo · 2023-08-29T09:06:26Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/ColumnWithAliasFormatterLegacy.java

+        implements ColumnWithAliasFormatter
+{
+    @Override
+    public JdbcColumnHandle format(JdbcColumnHandle column, int nextSyntheticColumnId, ConnectorSession session)


Instead two distinct implementations let's have one with the signature format(ConnectorSession session, JdbcColumnHandle column, OptionalInt maximumLength, int nextColumnId).

if maximumLength is empty return legacy value, otherwise: apply logic as in the new implementation.

wendigo · 2023-08-29T09:07:10Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadata.java

    {
        this.jdbcClient = requireNonNull(jdbcClient, "jdbcClient is null");
        this.precalculateStatisticsForPushdown = precalculateStatisticsForPushdown;
        this.jdbcQueryEventListeners = ImmutableSet.copyOf(requireNonNull(jdbcQueryEventListeners, "queryEventListeners is null"));
+        this.aliasFormatter = aliasFormatter;


Suggested change

this.aliasFormatter = aliasFormatter;

this.aliasFormatter = requireNonNull(aliasFormatter, "aliasFormatter is null");

wendigo · 2023-08-29T09:07:20Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadata.java

@@ -449,22 +455,19 @@ public Optional<JoinApplicationResult<ConnectorTableHandle>> applyJoin(
        if (!leftHandle.getAuthorization().equals(rightHandle.getAuthorization())) {
            return Optional.empty();
        }
+


unrelated change. revert

wendigo · 2023-08-29T09:08:41Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadata.java

-            newRightColumnsBuilder.put(column, JdbcColumnHandle.builderFrom(column)
-                    .setColumnName(column.getColumnName() + "_" + nextSyntheticColumnId)
-                    .build());
+            newRightColumnsBuilder.put(column, aliasFormatter.format(column, nextSyntheticColumnId, session));


here you should pass maximum identifier length as OptionalInt acquired from jdbcClient method with the default implementation:

newRightColumnsBuilder.put(column, aliasFormatter.format(session, column, jdbcClient.getMaximumColumnIdentifierLength(), nextSyntheticColumnId));

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/DefaultJdbcMetadataFactory.java

wendigo · 2023-08-29T09:10:34Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/JdbcMetadataConfig.java

@@ -100,6 +102,11 @@ public int getDomainCompactionThreshold()
        return domainCompactionThreshold;
    }

+    public int getColumnAliasMaxChars()


This is not needed. This shouldn't be user configurable property but something that is exposed by the implementations extending JdbcClient. Move this information to the JdbcClient with the default implementation returning OptionalInt.empty() (meaning no identifier length limit)

@wendigo, I don't think you got to the clue of the problem. You have provided me with the test case example here:
https://github.com/trinodb/trino/pull/18858/files#diff-dcd58a1db0b844e3264313bafcbdd64e74bdfc8b6c864aa776fe361f9c13888e

The exact code from what you provided, I'm referring to is:

@Test public void testJoinPushdownWithLongIdentifiers() { String maximumColumnIdentifier = "a".repeat(ORACLE_MAXIMUM_IDENTIFIER_LENGTH); // Verify that the ORACLE_MAXIMUM_IDENTIFIER_LENGTH is correct for the Oracle version // used in the test. oracleServer.execute("SELECT 1 AS " + maximumColumnIdentifier + " FROM DUAL"); assertThatThrownBy(() -> { oracleServer.execute("SELECT 1 AS " + maximumColumnIdentifier + "a FROM DUAL"); }).hasMessageContaining("ORA-00972: identifier is too long"); try (TestTable table = new TestTable(onRemoteDatabase(), "join_long_", "(%s decimal(19, 0))".formatted(maximumColumnIdentifier))) { assertThat(query(joinPushdownEnabled(getSession()), """ SELECT c.custkey, o.%s, n.nationkey FROM %s o JOIN customer c ON c.custkey = o.%s JOIN nation n ON c.nationkey = n.nationkey""".formatted(maximumColumnIdentifier, table.getName(), maximumColumnIdentifier))) .isFullyPushedDown(); } }

This code is the equivalent of the test that I already have, which is in: TestOracleConnectorTest#testPushdownJoinWithLongNameSucceedsWhenConfiguredCorrectly and is the 'happy path' testing.

I didn't need to introduce the config property nor the method getColumnAliasMaxChars in order to code that test, ie. test the happy path. So whatever time you devoted to providing the code sample above, you could have used elsewhere in the project. Please don't take it personally. I'm only writing the above as on our meeting with @hubertsk you have said explicitly that you don't have time to leave detailed justification against the comments you are making in the PR. As your time as a technical leader is valuable, by saying the above, I wanted to give you clues where you can save your precious time. Maybe you wasted it because I somehow mislead you with what kind of help I needed. If that was the case, I apologize.

To go back to the technical part of this discussion. The real help that I'm asking for is how to code that test TestOracleConnectorTest#testPushdownJoinWithLongNameFailsWhenMaxAllowedCharsIsMisconfigured (ie. a non-happy path) without adding the config property nor getColumnAliasMaxChars method (like you requested). If you could provide the equivalent in the branch you have provided your code samples already, that would be awesome. I am otherwise currently stuck on exactly that issue.

I don't think there is a need to provide a negative test here. As I understand positive test (happy path) is good enough here to provide the regression coverage. Notice that test should fail without product code changes you provide. We don't need to prove that we are able to break our code.

Can you please explain me what is the point of negative test here? What do I miss?

IsMisconfigured

The best way to avoid misconfiguration is disallow users to configure. Which was requested with #18642 (comment)

@kokosing - thank you. Reverted the change with adding a config property and removed the negative test.

wendigo · 2023-08-29T09:13:44Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/ColumnWithAliasFormatterFixed.java

+                .split(column.getColumnName())
+                .iterator()
+                .next();
+        String formatString = "%s_%0" + sequentialNumberLength + "d";


instead of a format you can just concatenate string with padded nextSyntheticColumnId as follows:

String columnName = originalColumnNameTruncated + "_" + padStart(Integer.toString(nextSyntheticColumnId), sequentialNumberLength, '0'));

padStart comes from Guava's Strings class

Like any programming, you can do this in 1000 different ways. Can you explain better what is the advantage of using the approach you proposed? Is there a disadvantage of using what I proposed?

String.format generates a lot of garbage and is less performant so we avoid it on the hot paths (see: #15258 #15316 trinodb/tpch#34)

The solution that I've proposed is free of these problems and inlines nicely.

Thank you for describing the original idea you had in your head. Now I understand that the reason you wanted me to change the String.format() to something else is due to (the hope of) performance optimization.

Today you have actually asked me about the onboarding doc (presumably irrelevant discussion in this context). So by accident I came back to the onboarding document, where I found this quote:

"It’s okay not to know everything and ask for basic stuff. We’ve all been there, so remember that there are no silly questions."

Under the umbrella of this encouragement to have child-like curiosity, let me carry on this discussion as there are some things that just don't seem to fit in my understanding.

You are writing that the change you are proposing stems from the fact that we want to avoid it on the hot paths. So this reminds me of my math teacher from my High School that once reminded me of the truth table for implication:

p | q | p → q -- | -- | -- T | T | T T | F | F F | T | T F | F | T

where the bottom two rows mean that if you start with a false assumption, you can deduce anything depending on your liking and it will always be true. So this got me to thinking. If your assumption that we are on a hot/critical path here is wrong (false), then you will always arrive at the conclusion that we should change it.

So I started reading the code at hand with the question in my head: are we really on a critical/hot path?
In the method call of the path the code in this PR is modifying, there are already three other places where we use String.format() explicitly:

io.trino.plugin.jdbc.DefaultQueryBuilder#formatAssignments

io.trino.plugin.jdbc.DefaultQueryBuilder#prepareJoinQuery

io.trino.plugin.jdbc.DefaultQueryBuilder#formatJoinCondition

and we call formatAssignments twice. So this means there are already 4 calls in the product to String.format() running on production as we speak.

So this got me thinking. If it was a hot path, and the performance would be impacted by String.format(), wouldn't it already be reported by the customers?

I also did check who is the author of the code mentioned above. I was very surprised that it's the very same person who made the comment not to use String.format() in the first place.

There is also this famous quote from one of my favourite authors: Donald Erwin Knuth, The Art of Computer Programming:

“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.”

Having said all of that, I just wanted to say that the reference you've provided that was suppose to justify your request, does not look like is applicable here. From all the analysis I have provided above, is there anything that you would not agree with?

because existing code is imperfect doesn't mean future code should be too.

The good thing to do here would be to send follow-up PR which fixes the existing String.format usages.

And yes nobody is perfect, we're all learning - git blame isn't intended to be used to blame someone. I'm sure I've written a lot of stupid and complex code as well - doesn't mean I don't think it should be improved.

IMO "premature optimization is the root of all evil" doesn't mean that we shouldn't to the obvious things that we already know that we can do better.

Replying to your ownership comment this class was created as a copy during: #10059. I'm not the sole author of this code but I can always improve it: #18885

Changed String.format() to guava's pad method.

...in/trino-base-jdbc/src/test/java/io/trino/plugin/jdbc/TestColumnWithAliasFormatterFixed.java

hashhar · 2023-08-29T12:42:17Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/ColumnWithAliasFormatterModule.java

+    @Override
+    public void configure()
+    {
+        bind(ColumnWithAliasFormatter.class).to(ColumnWithAliasFormatterFixed.class).in(Singleton.class);


should probably be extracted to JdbcClientModule as an optional binding with a default binding to ColumnWithAliasFormatterFixed.

That way other connectors can just plug in a new class instead of having to write their own modules. Since the module isn't installed in multiple places this feels a little bit like boilerplate when extracted to separate module.

I can do that, but bear in mind that this would mean that I have a boilerplate Module in the test only, so that I can use the @guice annotation. This is actually not that intrusive, as I can make the test class inherit from AbstractModule and only add a configure() method to it. However, this would mean that the binding code would not be shared between production code and the test code, but will be two separate codebases. If that is the policy here and this is what you want to go with, I'm fine. Don't have a strong opinion here. Let me know what's your decision here.

why does test code need a guice derived class? The ColumnWithAliasFormatter can be directly tested by calling it's public method instead of holding state in the class.

For the integration test you don't need to bind anything anyway since the query runner already has the TestingOraclePlugin (or something to that effect) installed which installs all the original modules anyway.

I'm not sure if I understand your concern correctly. That's one of the main issues that the DI framework solves, ie. to be able to get an instance of a class without explicitly calling a constructor (which might have many dependent objects used in the constructor, if not now, then maybe in the future). That's also why people invented annotations like @guice so that you don't have to explicitly instantiate guice 'container' - saves you a lot of boilerplate.

Maybe it comes from the fact that you don't use unit tests in many places but mostly integration tests? I've seen in a few places that the codebase has 'new' calls in the test. This is generally considered a bad practice in the industry, so I believed we don't want to carry on that way.

If my explanations are not enough, perhaps we have a call about this on monday? If so, let me know what time suits you on slack.

Using DI in test just to create a single class is overkill. I'd agree for a class which needs other classes to be injected. We do use Guice to create those in tests but not for a class which has no dependencies.

What does DI buy us vs new for a class without any other injected classes?

Scheduled a call with you on Tuesday 2023-09-05 to explain.

So we agreed to keep existing module setup in production code (since thinking about using this in other connectors is out of scope for this PR).

In tests we'll switch to new on either final field in test class or initializing in @BeforeClass method.

plugin/trino-oracle/src/test/java/io/trino/plugin/oracle/TestOracleConnectorTest.java

hashhar · 2023-08-29T12:44:44Z

plugin/trino-oracle/src/test/java/io/trino/plugin/oracle/TestOracleConnectorTest.java

+            testPushdownJoinWithLongName(31);
+        }
+        catch (AssertionError ex) {
+            assertThat(ex.getCause().getCause().getMessage()).isEqualTo("ORA-00972: identifier is too long\n");


you can use assertThatThrownBy

Yes, I've seen that example in the other places of the codebase, and I actually started with the assertThatThrownBy. But I couldn't figure out how to apply this in that specific instance. Is it OK if you provide a snippet of how this should be done here?

trino/testing/trino-testing/src/main/java/io/trino/testing/AbstractDistributedEngineOnlyQueries.java

Line 324 in 398b646

assertThatThrownBy(() -> query("SELECT row_number FROM n"))

Thanks @hubertsk. I've seen this one already. My question is how to make the exact assert I'm doing in the 'catch' clause with the 'assertThatThrownBy'. Can you provide a code snippet that does the exact equivalent of the code in this PR but uses assertThatThrownBy?

Withdrawing my request. Was able to finally figure it out. The issue was that I was looking for a method too narrowly (starting with 'has'). I was not familiar with that API, but luckily I came across the method I needed. Thanks for your patience with this @hubertsk and @hashhar.

That negative test is now removed, so there's no need for that method call.

hashhar · 2023-08-29T12:48:22Z

plugin/trino-oracle/src/test/java/io/trino/plugin/oracle/TestOracleConnectorTest.java

+    @Test
+    public void testPushdownJoinWithLongNameSucceedsWhenConfiguredCorrectly()
+    {
+        testPushdownJoinWithLongName(30);


this is prone to breakage. e.g. in future when we upgrade to Oracle with 128 char limits how would one find what value to adjust?

You can look at BaseConnectorTest#testCreateSchemaWithLongName for some ideas. The other benefit being that then the test can be run against all connectors and we can catch such silent failures in other connectors too.

(This can be left as a follow-up, if you do decide to skip this comment then please create a GitHub issue + add a TODO comment linking to it)

Actually this test is fine, as it formats the alias over 30 characters which (we believe) is below the limit for any database, no matter if it's Oracle 11.x or Postgres.

I think what you were referring to is the second test that uses the constant 31, and that specifically tries to break it for Oracle 11.x. If we upgraded to 12.x and the limit by doing that changes to 128, the the test will fail, as we would not get the error message from the SQL engine. So the engineer doing the Oracle upgrade will notice that he needs to fix the constant in the test.

If I got you wrong and you were actually talking about this specific test, then it is the case that if we upgrade to 12.x or extend to any other database, the test will still be passing and that is OK. We are fine with alias being squashed to 30 characters even if the actual DB engine allows the alias to have 128 characters. As a reminder we did agree on one of the daily sessions what was the acceptance criteria and 'out of scope' for this task, which I copy here for reference from our internal ticket:

Acceptance Criteria

Solution fixes the bug for Oracle pushdown JOIN

Out of Scope

Integration test for other databases than Oracle

If Alias field restriction has same bug on aggregation, it won’t be fixed

If other databases have restriction on the alias field that result in same bug, this wont' be fixed

@dominikzalewski please take a look at BaseConnectorTest#testCreateSchemaWithLongName.
I believe @hashhar was referring to the way the actual limit (30) is being passed.

testPushdownJoinWithLongName(30);

In the code above it's not clear what 30 stands for. The test @hashhar referred to solves similar problem in an elegant way.

trino/testing/trino-testing/src/main/java/io/trino/testing/BaseConnectorTest.java

Lines 3326 to 3335 in 398b646

public void testCreateSchemaWithLongName()

{

skipTestUnless(hasBehavior(SUPPORTS_CREATE_SCHEMA));

String baseSchemaName = "test_create_" + randomNameSuffix();

int maxLength = maxSchemaNameLength()

// Assume 2^16 is enough for most use cases. Add a bit more to ensure 2^16 isn't actual limit.

.orElse(65536 + 5);

in addition to magic constant the main problem is that you have two tests which have two constants which can go out of sync.

It also makes it harder to make the test generic because we need to do it in future anyway. As an example if you strictly consider only Oracle to be in scope an even smaller solution would be to add a different implementation of DefaultQueryBuilder for Oracle instead of trying to add a generic formatter class. But that would make making this solution generic harder since everything will need to be reworked anyway.

So since we already know the long term goal the short term solutions should iteratively drive us towards those long term goals.

@hashhar, thanks for the explanations. However, I do find it hard to skin out exactly what your asking for. There's to many 'ifs' for me here. Can you lay out in plain the changes/steps you want me to apply to this piece of code as if you were doing it? You can make the decisions instead of describing conditionals.

@hubertsk, @hashhar. I tried to guess what your ask is. I have used a constant and a constant + 1, however I have a feeling that we are not totally in sync about this specific request. Do you think it's possible that we discuss that on-line?

look at

trino/testing/trino-testing/src/main/java/io/trino/testing/BaseConnectorTest.java

Lines 3326 to 3335 in 398b646

public void testCreateSchemaWithLongName()

{

skipTestUnless(hasBehavior(SUPPORTS_CREATE_SCHEMA));

String baseSchemaName = "test_create_" + randomNameSuffix();

int maxLength = maxSchemaNameLength()

// Assume 2^16 is enough for most use cases. Add a bit more to ensure 2^16 isn't actual limit.

.orElse(65536 + 5);

and try to model similarly if for nothing then at-least consitency's sake (makes future readers of the code's job easier since knowledge is transferable and things are familiar).

(There are benefits other than consistency as I've already said since as a follow up we need to do similar fix in other connectors and this change makes it possible to reuse the test instead of write it again for all of the more than 10 JDBC based connectors we have.)

This code is now simplified after removing the negative test. If there's something that makes it unreadable, please propose the exact changes you want me to do and I will copy-paste them to the PR.

This reverts commit 8fce268.

hashhar · 2023-09-05T05:14:32Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/ColumnWithAliasFormatter.java

+    public static final int DEFAULT_COLUMN_ALIAS_LENGTH = 30;
+    public static final int ORIGINAL_COLUMN_NAME_LENGTH = 24;


the code lives in trino-base-jdbc but the values are specific to Oracle.

Also consider adding a explanatory comment about why ORIGINAL_COLUMN_NAME_LENGTH = DEFAULT_COLUMN_ALIAS_LENGTH - 6. Where does the magic value 6 come from?

the code lives in trino-base-jdbc but the values are specific to Oracle.

It's ok to ignore for now since as follow-up we'll make this general for all connectors.

Also consider adding a explanatory comment about why ORIGINAL_COLUMN_NAME_LENGTH = DEFAULT_COLUMN_ALIAS_LENGTH - 6. Where does the magic value 6 come from?

This still applies.

hashhar · 2023-09-05T05:14:57Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/ColumnWithAliasFormatter.java

+                .split(column.getColumnName())
+                .iterator()
+                .next();
+        String formatString = "%s_%0" + sequentialNumberLength + "d";


is formatString used?

hashhar · 2023-09-05T05:19:56Z

plugin/trino-oracle/src/test/java/io/trino/plugin/oracle/TestOracleConnectorTest.java

+    @Test
+    public void testPushdownJoinWithLongNameSucceeds()
+    {
+        tryCleanupTemporaryTable();
+        try {
+            String baseColumnName = "test_pushdown_" + randomNameSuffix();
+            String validColumnName = baseColumnName + "z".repeat(MAX_CHARS_COLUMN_ALIAS - baseColumnName.length());
+
+            assertUpdate(format("""
+                    CREATE TABLE orders_1 as
+                        SELECT orderkey as %s,
+                            custkey,
+                            orderstatus,
+                            totalprice,
+                            orderdate,
+                            orderpriority,
+                            clerk,
+                            shippriority,
+                            comment
+                        FROM orders
+                    """, validColumnName), "VALUES 15000");
+
+            Session session = Session.builder(getSession())
+                    .setCatalogSessionProperty("oracle", JOIN_PUSHDOWN_ENABLED, "true")
+                    .build();
+            assertQuery(session,
+                    format("""
+                            SELECT c.custkey, o.%s, n.nationkey
+                                 FROM orders_1 o JOIN customer c ON c.custkey = o.custkey
+                                 JOIN nation n ON c.nationkey = n.nationkey
+                            """, validColumnName),
+                    """
+                                SELECT c.custkey, o.orderkey, n.nationkey
+                                FROM orders o JOIN customer c ON c.custkey = o.custkey
+                                JOIN nation n ON c.nationkey = n.nationkey
+                            """);
+        }
+        finally {
+            tryCleanupTemporaryTable();
+        }
+    }


Suggested change

@Test

public void testPushdownJoinWithLongNameSucceeds()

{

tryCleanupTemporaryTable();

try {

String baseColumnName = "test_pushdown_" + randomNameSuffix();

String validColumnName = baseColumnName + "z".repeat(MAX_CHARS_COLUMN_ALIAS - baseColumnName.length());

assertUpdate(format("""

CREATE TABLE orders_1 as

SELECT orderkey as %s,

custkey,

orderstatus,

totalprice,

orderdate,

orderpriority,

clerk,

shippriority,

comment

FROM orders

""", validColumnName), "VALUES 15000");

Session session = Session.builder(getSession())

.setCatalogSessionProperty("oracle", JOIN_PUSHDOWN_ENABLED, "true")

.build();

assertQuery(session,

format("""

SELECT c.custkey, o.%s, n.nationkey

FROM orders_1 o JOIN customer c ON c.custkey = o.custkey

JOIN nation n ON c.nationkey = n.nationkey

""", validColumnName),

"""

SELECT c.custkey, o.orderkey, n.nationkey

FROM orders o JOIN customer c ON c.custkey = o.custkey

JOIN nation n ON c.nationkey = n.nationkey

""");

}

finally {

tryCleanupTemporaryTable();

}

}

@Test

public void testPushdownJoinWithLongNameSucceeds()

{

String maximumLengthColumnIdentifier = "z".repeat(MAX_CHARS_COLUMN_ALIAS);

try (TestTable table = new TestTable(getQueryRunner()::execute, "long_identifier", "(%s bigint)".formatted(maximumLengthColumnIdentifier))) {

assertThat(query(joinPushdownEnabled(getSession()), """

SELECT r.name, t.%s, n.name

FROM %s t JOIN region r ON r.regionkey = t.%s

JOIN nation n ON r.regionkey = n.regionkey""".formatted(maximumLengthColumnIdentifier, table.getName(), maximumLengthColumnIdentifier)))

.isFullyPushedDown();

}

}

This seems to be sufficient to test.
Try to avoid creating data in tests since it can be slow, specially if using larger tables like orders. nation and region are best if your test doesn't care about size of tables.

Note: we need to verify that the query was indeed pushed down because otherwise it's possible that the connector would generate a plain SELECT with no synthetic ids and hence our test will pass.

Joins cannot always be pushed down even if join pushdown is enabled (e.g. if the join condition or some other operation in the query prevents pushdown).

Also while this test by itself is sufficient to prove that the aliasing logic works it doesn't prove that we chose an optimal value for the max identifier length.

IMO we should have another test to verify the length we chose:

Try create a column with length = max_length - must succeed

Try create a column with length = max_length + 1 - must fail in expected manner

@kokosing the reason why i believe this is useful is because for example we can use 30 as the max globally but it un-necessarily loses information in connectors which can actually support longer identifiers making reading explain plans harder than needed.

hashhar · 2023-09-05T05:27:23Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/ColumnWithAliasFormatter.java

+    public static final int DEFAULT_COLUMN_ALIAS_LENGTH = 30;
+    public static final int ORIGINAL_COLUMN_NAME_LENGTH = 24;
+
+    public JdbcColumnHandle format(ConnectorSession session, JdbcColumnHandle column, int nextSyntheticColumnId)


ConnectorSession seems unused.

hashhar · 2023-09-05T05:32:50Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/ColumnWithAliasFormatter.java

+                .iterator()
+                .next();
+        String formatString = "%s_%0" + sequentialNumberLength + "d";
+        String columnName = originalColumnNameTruncated + "_" + padStart(Integer.toString(nextSyntheticColumnId), sequentialNumberLength, '0');


why do we pad the next synthetic id part, it leads to identifiers with long numbers in them. Is this required?

Also what happens when the nextSyntheticColumnId is "large" and causes the generated column name to exceed DEFAULT_COLUMN_ALIAS_LENGTH?

hashhar · 2023-09-05T06:09:21Z

plugin/trino-base-jdbc/src/test/java/io/trino/plugin/jdbc/TestColumnWithAliasFormatter.java

+@Guice(modules = ColumnWithAliasFormatterModule.class)
+public class TestColumnWithAliasFormatter
+{
+    @Inject
+    private ColumnWithAliasFormatter actor;


Suggested change

@Guice(modules = ColumnWithAliasFormatterModule.class)

public class TestColumnWithAliasFormatter

{

@Inject

private ColumnWithAliasFormatter actor;

public class TestColumnWithAliasFormatter

{

private final ColumnWithAliasFormatter columnAliasFormatter = new ColumnWithAliasFormatter();

equivalent.

hashhar · 2023-09-05T06:11:43Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/ColumnWithAliasFormatter.java

+    public static final int DEFAULT_COLUMN_ALIAS_LENGTH = 30;
+    public static final int ORIGINAL_COLUMN_NAME_LENGTH = 24;
+
+    public JdbcColumnHandle format(ConnectorSession session, JdbcColumnHandle column, int nextSyntheticColumnId)


maybe getAliasedColumnHandle?

Or maybe name the class SyntheticColumnHandleProvider/Builder/etc. and the method getSyntheticColumnHandle.

hashhar · 2023-09-12T06:08:17Z

Superseded by #18924

dominikzalewski commented Aug 11, 2023

View reviewed changes

dominikzalewski requested review from wendigo and hashhar August 11, 2023 12:02

cla-bot bot added the cla-signed label Aug 16, 2023

dominikzalewski force-pushed the dominikzalewski/pushdown_alias_column_names_exceed_limit branch 18 times, most recently from 79cdb09 to 9526323 Compare August 23, 2023 15:14

hashhar mentioned this pull request Aug 24, 2023

Refactor to always use JdbcColumnHandle.Builder instead of a constructor #18718

Closed

dominikzalewski force-pushed the dominikzalewski/pushdown_alias_column_names_exceed_limit branch 2 times, most recently from 047f181 to b1a2d84 Compare August 28, 2023 16:15

dominikzalewski self-assigned this Aug 28, 2023

dominikzalewski changed the title ~~[WIP] multi-join pushdown appends recursively _<number> to column names so that they exceed maximum alias length~~ Multi-join pushdown appends recursively _<number> to column names so that they exceed maximum alias length Aug 28, 2023

dominikzalewski marked this pull request as ready for review August 28, 2023 16:21

dominikzalewski added 2 commits August 29, 2023 10:17

Refactor to use column formatter

e36d9f5

Add configuration property for max characters on alias column

8fce268

dominikzalewski force-pushed the dominikzalewski/pushdown_alias_column_names_exceed_limit branch 2 times, most recently from 3c0bafc to aa84720 Compare August 29, 2023 08:44

wendigo reviewed Aug 29, 2023

View reviewed changes

hashhar reviewed Aug 29, 2023

View reviewed changes

dominikzalewski force-pushed the dominikzalewski/pushdown_alias_column_names_exceed_limit branch 3 times, most recently from 4542683 to 33f1766 Compare September 4, 2023 12:35

dominikzalewski added 4 commits September 4, 2023 15:45

Bugfix with fixed formatter

25ea98e

Integration test

cf29b3c

Code Review - non-disputable simple fixes

356a829

Revert "Add configuration property for max characters on alias column"

4346680

This reverts commit 8fce268.

dominikzalewski force-pushed the dominikzalewski/pushdown_alias_column_names_exceed_limit branch from 33f1766 to 4346680 Compare September 4, 2023 14:07

hashhar reviewed Sep 5, 2023

View reviewed changes

hashhar closed this Sep 12, 2023


		import io.trino.spi.connector.ConnectorSession;

		public interface ColumnWithAliasFormatter

	this.aliasFormatter = aliasFormatter;
	this.aliasFormatter = requireNonNull(aliasFormatter, "aliasFormatter is null");

	public void testCreateSchemaWithLongName()
	{
	skipTestUnless(hasBehavior(SUPPORTS_CREATE_SCHEMA));

	String baseSchemaName = "test_create_" + randomNameSuffix();

	int maxLength = maxSchemaNameLength()
	// Assume 2^16 is enough for most use cases. Add a bit more to ensure 2^16 isn't actual limit.
	.orElse(65536 + 5);

		public static final int DEFAULT_COLUMN_ALIAS_LENGTH = 30;
		public static final int ORIGINAL_COLUMN_NAME_LENGTH = 24;

Multi-join pushdown appends recursively _<number> to column names so that they exceed maximum alias length #18642

Multi-join pushdown appends recursively _<number> to column names so that they exceed maximum alias length #18642

Conversation

dominikzalewski commented Aug 11, 2023 • edited Loading

How to Reproduce

Proposed solution

cla-bot bot commented Aug 11, 2023

cla-bot bot commented Aug 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wendigo Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hashhar Aug 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hubertsk Aug 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hashhar commented Sep 12, 2023

dominikzalewski commented Aug 11, 2023 •

edited

Loading

wendigo Aug 30, 2023 •

edited

Loading

hashhar Aug 31, 2023 •

edited

Loading

hubertsk Aug 31, 2023 •

edited

Loading