Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allows lots of table scans cases where keys cannot easily be extracted #7155

Merged
merged 9 commits into from
Mar 9, 2021

Conversation

AlanConfluent
Copy link
Member

@AlanConfluent AlanConfluent commented Mar 4, 2021

fixes #4484

Description

This allows many cases when table scans are enabled. Prior to this, a comparison must have been of the form KEY=12345 or possibly KEY=1 + 2 where there was a column on one side and a resolvable expression on the other. When table scans are enabled, now these can be done:

  • Both sides are column references (e.g. KEY = COL)
  • Neither side is a column reference (e.g. 5 > 3, KEY + 1 = COL + 1 )
  • One side is a column reference and the other is an unresolveable expression (e.g. KEY = COL + 10)

Resolves #4484

Testing done

RQTT tests and unit tests

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@AlanConfluent AlanConfluent requested a review from a team as a code owner March 4, 2021 02:08
Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice quick win in usability

@@ -345,6 +363,28 @@ private void setTableScanOrElseThrow(final Supplier<KsqlException> exceptionSupp
}
}

private final class NonColumnRefValidator extends TraversalExpressionVisitor<Object> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private final class NonColumnRefValidator extends TraversalExpressionVisitor<Object> {
private final class HasColumnRef extends TraversalExpressionVisitor<Object> {

It was really tough to grok nonColumnRefUnresolvable - it looks like what this does is answer "did we visit a column reference?"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right that this is a more descriptive name. I had originally thought it might do other checks, but I'll go with HasColumnRef.

"statements": [
"CREATE STREAM INPUT (ID INTEGER KEY, IGNORED INT) WITH (kafka_topic='test_topic', value_format='JSON');",
"CREATE TABLE AGGREGATE AS SELECT ID, COUNT(1) AS COUNT FROM INPUT GROUP BY ID;",
"SELECT * FROM AGGREGATE WHERE ID = 20 - 10;",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT * FROM AGGREGATE WHERE ID = 20 - 10;

at some point, this should result in a Key lookup right? What's the status of #4484 after we merge this PR? We might want to consider renaming that one (or opening a new one). I think it's important to consider that alongside this PR because we'd probably want a single-pass strategy (naively, we could just use the GenericExpressionResolver to try to resolve the non-key side of the equation and if it throws we assume it's not resolvable).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should result in a key lookup. It should fall into the case of non column ref needs to be resolved.

I think #4484 should be resolved after this.

I was debating whether to do it that way, and I'm not opposed to doing it like that. The current code is a common pattern where there's a validation and an extraction pass. In this case, it makes dealing with the consequences a bit easier (we can just declare table scan required with the other similar checks and avoid extraction). Doing some of this in extraction would just slightly complicate that code a bit. Or did you mean merging Validation and Extraction more broadly to a single pass?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ignore that original comment, I hadn't properly understood the code on my first pass. Originally I had thought that we would table scan for ID = 20 - 10 but now I'm seeing that we don't. I think the current implementation is fine.

@AlanConfluent AlanConfluent changed the title feat: Allows lots of non simple key=literal cases when table scans are enabled feat: Allows lots of table scans cases where keys cannot easily be extracted Mar 4, 2021
Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a pass, overall LGTM!

ksqlConfig.getBoolean(KsqlConfig.KSQL_QUERY_PULL_INTERPRETER_ENABLED)
).resolve(other);

if (obj instanceof Integer || obj instanceof Long) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow float/double to be casted to INT as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GenericExpressionResolver is in strict mode which I don't believe allows this. We don't do a lot of conversions elsewhere automatically, such as with the key values (and would be a bit harder to generify), so I think this if fine for consistency.

final UnqualifiedColumnReferenceExp column = getColumnRefSide(node);
final Expression other = getNonColumnRefSide(node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm wondering if we can merge these two conditions into a single one, like:

first = getFirstColumnRefSide(); // if both are column ref, return left; if non are column ref, return null;
if (first != null) {
    second = getOtherSide();
    secondHasColumnRef = new ... ; secondHasColumnRef.process(second);
    if (second isColumnRef || secondHasColumnRef.hasColumnRef)
        setTableScan();
} else {
    setTableScan();    // i.e. for `WHERE 100 = 100` or `100 = 101`, we would always do table scan
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I reoriented the logic similar to this and allowed getFirstColumnRefSide to return null so that it can be checked for, allowing me to get rid of the other method.

@@ -295,7 +297,23 @@ public Void visitComparisonExpression(
final ComparisonExpression node,
final Object context
) {
if (!isSingleColumnReference(node)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just occurred to me: if the WHERE clause is just 100 = 101 then we would still do a table scan instead of immediately return empty values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unfortunately the case today. I have a ticket #6973 which generally covers this kind of thing. You can simplify key ranges or even see that none exist at all, or that 100 = 101 can be simplified without any row information. Feel free to either add some cases to that ticket or even create another, so that we don't forget all of the optimization ideas we have.

@AlanConfluent AlanConfluent merged commit 71becea into master Mar 9, 2021
@AlanConfluent AlanConfluent deleted the allow_non_col_comparison branch March 9, 2021 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support calculated predicates in pull queries
3 participants