Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GROUP BY semantics for keys with any names #4898

Closed
big-andy-coates opened this issue Mar 26, 2020 · 0 comments · Fixed by #4899
Closed

Fix GROUP BY semantics for keys with any names #4898

big-andy-coates opened this issue Mar 26, 2020 · 0 comments · Fixed by #4899
Assignees

Comments

@big-andy-coates
Copy link
Contributor

If grouping by a single column, e.g. GROUP BY B, then the schema of the result should have a column named B, not ROWKEY.

If grouping by something other than a single column, then we should generate a unique column name, e.g. KSQL_COL_0.

Also, note we'll need a slight change in semantics:

An old style GROUP BY a single column might look like:

-- input schema: ROWKEY => B, C
CREATE TABLE X AS SELECT B, COUNT() AS COUNT FROM Y GROUP BY B;
-- output schema: ROWKEY => B, COUNT

Moving that same persistent query to the new world of any key name goes and we run into a problem:

-- input schema: A => B, C
CREATE TABLE X AS SELECT B, COUNT() AS COUNT FROM Y GROUP BY B;
-- output schema: B => B, COUNT.  <= Duplicate column B!!!!

Hence, in the new world, the above statement will be rejected. This seems fine to me as the data for column B is already in the key! If the user wants the data in the value they can just add an aliases.

@big-andy-coates big-andy-coates self-assigned this Mar 26, 2020
big-andy-coates added a commit to big-andy-coates/ksql that referenced this issue Mar 26, 2020
fixes: confluentinc#4898

This commit sees the result of a GROUP BY on a single column reference have a schema with a key column matching the name of the column, e.g.

```sql
-- source schema: A -> B, C
CREATE STREAM OUTPUT AS SELECT COUNT(1) AS COUNT FROM INPUT GROUP BY B;
-- output schema: B -> COUNT
```

If the GROUP BY is on anything other than a single column reference then the key column will be a unique generated column name, e.g.

```sql
-- source schema: A -> B, C
CREATE STREAM OUTPUT AS SELECT COUNT(1) FROM INPUT GROUP BY B+1;
-- output schema: KSQL_COL_1 -> KSQL_COL_0  (Both names are generated)
```

BREAKING CHANGE: Existing queries that reference a single GROUP BY column in the projection would fail if they were resubmitted, due to a duplicate column. The same existing queries will continue to run if already running, i.e. this is only a change for newly submitted queries. Existing queries will use the old query semantics.
agavra pushed a commit to big-andy-coates/ksql that referenced this issue Mar 26, 2020
fixes: confluentinc#4898

This commit sees the result of a GROUP BY on a single column reference have a schema with a key column matching the name of the column, e.g.

```sql
-- source schema: A -> B, C
CREATE STREAM OUTPUT AS SELECT COUNT(1) AS COUNT FROM INPUT GROUP BY B;
-- output schema: B -> COUNT
```

If the GROUP BY is on anything other than a single column reference then the key column will be a unique generated column name, e.g.

```sql
-- source schema: A -> B, C
CREATE STREAM OUTPUT AS SELECT COUNT(1) FROM INPUT GROUP BY B+1;
-- output schema: KSQL_COL_1 -> KSQL_COL_0  (Both names are generated)
```

BREAKING CHANGE: Existing queries that reference a single GROUP BY column in the projection would fail if they were resubmitted, due to a duplicate column. The same existing queries will continue to run if already running, i.e. this is only a change for newly submitted queries. Existing queries will use the old query semantics.
big-andy-coates added a commit that referenced this issue Mar 27, 2020
* chore: add GROUP BY support for any key names

fixes: #4898

This commit sees the result of a GROUP BY on a single column reference have a schema with a key column matching the name of the column, e.g.

```sql
-- source schema: A -> B, C
CREATE STREAM OUTPUT AS SELECT COUNT(1) AS COUNT FROM INPUT GROUP BY B;
-- output schema: B -> COUNT
```

If the GROUP BY is on anything other than a single column reference then the key column will be a unique generated column name, e.g.

```sql
-- source schema: A -> B, C
CREATE STREAM OUTPUT AS SELECT COUNT(1) FROM INPUT GROUP BY B+1;
-- output schema: KSQL_COL_1 -> KSQL_COL_0  (Both names are generated)
```

BREAKING CHANGE: Existing queries that reference a single GROUP BY column in the projection would fail if they were resubmitted, due to a duplicate column. The same existing queries will continue to run if already running, i.e. this is only a change for newly submitted queries. Existing queries will use the old query semantics.

Co-authored-by: Big Andy Coates <andy@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant