Fix GROUP BY semantics for keys with any names #4898

big-andy-coates · 2020-03-26T00:09:35Z

If grouping by a single column, e.g. GROUP BY B, then the schema of the result should have a column named B, not ROWKEY.

If grouping by something other than a single column, then we should generate a unique column name, e.g. KSQL_COL_0.

Also, note we'll need a slight change in semantics:

An old style GROUP BY a single column might look like:

-- input schema: ROWKEY => B, C
CREATE TABLE X AS SELECT B, COUNT() AS COUNT FROM Y GROUP BY B;
-- output schema: ROWKEY => B, COUNT

Moving that same persistent query to the new world of any key name goes and we run into a problem:

-- input schema: A => B, C
CREATE TABLE X AS SELECT B, COUNT() AS COUNT FROM Y GROUP BY B;
-- output schema: B => B, COUNT.  <= Duplicate column B!!!!

Hence, in the new world, the above statement will be rejected. This seems fine to me as the data for column B is already in the key! If the user wants the data in the value they can just add an aliases.

The text was updated successfully, but these errors were encountered:

fixes: confluentinc#4898 This commit sees the result of a GROUP BY on a single column reference have a schema with a key column matching the name of the column, e.g. ```sql -- source schema: A -> B, C CREATE STREAM OUTPUT AS SELECT COUNT(1) AS COUNT FROM INPUT GROUP BY B; -- output schema: B -> COUNT ``` If the GROUP BY is on anything other than a single column reference then the key column will be a unique generated column name, e.g. ```sql -- source schema: A -> B, C CREATE STREAM OUTPUT AS SELECT COUNT(1) FROM INPUT GROUP BY B+1; -- output schema: KSQL_COL_1 -> KSQL_COL_0 (Both names are generated) ``` BREAKING CHANGE: Existing queries that reference a single GROUP BY column in the projection would fail if they were resubmitted, due to a duplicate column. The same existing queries will continue to run if already running, i.e. this is only a change for newly submitted queries. Existing queries will use the old query semantics.

* chore: add GROUP BY support for any key names fixes: #4898 This commit sees the result of a GROUP BY on a single column reference have a schema with a key column matching the name of the column, e.g. ```sql -- source schema: A -> B, C CREATE STREAM OUTPUT AS SELECT COUNT(1) AS COUNT FROM INPUT GROUP BY B; -- output schema: B -> COUNT ``` If the GROUP BY is on anything other than a single column reference then the key column will be a unique generated column name, e.g. ```sql -- source schema: A -> B, C CREATE STREAM OUTPUT AS SELECT COUNT(1) FROM INPUT GROUP BY B+1; -- output schema: KSQL_COL_1 -> KSQL_COL_0 (Both names are generated) ``` BREAKING CHANGE: Existing queries that reference a single GROUP BY column in the projection would fail if they were resubmitted, due to a duplicate column. The same existing queries will continue to run if already running, i.e. this is only a change for newly submitted queries. Existing queries will use the old query semantics. Co-authored-by: Big Andy Coates <andy@confluent.io>

big-andy-coates self-assigned this Mar 26, 2020

big-andy-coates mentioned this issue Mar 26, 2020

chore: add GROUP BY support for any key names #4899

Merged

2 tasks

big-andy-coates closed this as completed in #4899 Mar 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GROUP BY semantics for keys with any names #4898

Fix GROUP BY semantics for keys with any names #4898

big-andy-coates commented Mar 26, 2020

Fix GROUP BY semantics for keys with any names #4898

Fix GROUP BY semantics for keys with any names #4898

Comments

big-andy-coates commented Mar 26, 2020