-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Allows lots of table scans cases where keys cannot easily be extracted #7155
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice quick win in usability
ksqldb-engine/src/main/java/io/confluent/ksql/engine/generic/GenericRecordFactory.java
Show resolved
Hide resolved
@@ -345,6 +363,28 @@ private void setTableScanOrElseThrow(final Supplier<KsqlException> exceptionSupp | |||
} | |||
} | |||
|
|||
private final class NonColumnRefValidator extends TraversalExpressionVisitor<Object> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private final class NonColumnRefValidator extends TraversalExpressionVisitor<Object> { | |
private final class HasColumnRef extends TraversalExpressionVisitor<Object> { |
It was really tough to grok nonColumnRefUnresolvable
- it looks like what this does is answer "did we visit a column reference?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right that this is a more descriptive name. I had originally thought it might do other checks, but I'll go with HasColumnRef
.
"statements": [ | ||
"CREATE STREAM INPUT (ID INTEGER KEY, IGNORED INT) WITH (kafka_topic='test_topic', value_format='JSON');", | ||
"CREATE TABLE AGGREGATE AS SELECT ID, COUNT(1) AS COUNT FROM INPUT GROUP BY ID;", | ||
"SELECT * FROM AGGREGATE WHERE ID = 20 - 10;", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT * FROM AGGREGATE WHERE ID = 20 - 10;
at some point, this should result in a Key lookup right? What's the status of #4484 after we merge this PR? We might want to consider renaming that one (or opening a new one). I think it's important to consider that alongside this PR because we'd probably want a single-pass strategy (naively, we could just use the GenericExpressionResolver to try to resolve the non-key side of the equation and if it throws we assume it's not resolvable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should result in a key lookup. It should fall into the case of non column ref needs to be resolved.
I think #4484 should be resolved after this.
I was debating whether to do it that way, and I'm not opposed to doing it like that. The current code is a common pattern where there's a validation and an extraction pass. In this case, it makes dealing with the consequences a bit easier (we can just declare table scan required with the other similar checks and avoid extraction). Doing some of this in extraction would just slightly complicate that code a bit. Or did you mean merging Validation and Extraction more broadly to a single pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ignore that original comment, I hadn't properly understood the code on my first pass. Originally I had thought that we would table scan for ID = 20 - 10
but now I'm seeing that we don't. I think the current implementation is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a pass, overall LGTM!
ksqlConfig.getBoolean(KsqlConfig.KSQL_QUERY_PULL_INTERPRETER_ENABLED) | ||
).resolve(other); | ||
|
||
if (obj instanceof Integer || obj instanceof Long) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we allow float/double to be casted to INT as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GenericExpressionResolver is in strict mode which I don't believe allows this. We don't do a lot of conversions elsewhere automatically, such as with the key values (and would be a bit harder to generify), so I think this if fine for consistency.
final UnqualifiedColumnReferenceExp column = getColumnRefSide(node); | ||
final Expression other = getNonColumnRefSide(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm wondering if we can merge these two conditions into a single one, like:
first = getFirstColumnRefSide(); // if both are column ref, return left; if non are column ref, return null;
if (first != null) {
second = getOtherSide();
secondHasColumnRef = new ... ; secondHasColumnRef.process(second);
if (second isColumnRef || secondHasColumnRef.hasColumnRef)
setTableScan();
} else {
setTableScan(); // i.e. for `WHERE 100 = 100` or `100 = 101`, we would always do table scan
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I reoriented the logic similar to this and allowed getFirstColumnRefSide to return null so that it can be checked for, allowing me to get rid of the other method.
@@ -295,7 +297,23 @@ public Void visitComparisonExpression( | |||
final ComparisonExpression node, | |||
final Object context | |||
) { | |||
if (!isSingleColumnReference(node)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just occurred to me: if the WHERE
clause is just 100 = 101
then we would still do a table scan instead of immediately return empty values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's unfortunately the case today. I have a ticket #6973 which generally covers this kind of thing. You can simplify key ranges or even see that none exist at all, or that 100 = 101
can be simplified without any row information. Feel free to either add some cases to that ticket or even create another, so that we don't forget all of the optimization ideas we have.
fixes #4484
Description
This allows many cases when table scans are enabled. Prior to this, a comparison must have been of the form
KEY=12345
or possiblyKEY=1 + 2
where there was a column on one side and a resolvable expression on the other. When table scans are enabled, now these can be done:KEY = COL
)5 > 3
,KEY + 1 = COL + 1
)KEY = COL + 10
)Resolves #4484
Testing done
RQTT tests and unit tests
Reviewer checklist