Docs for primitive key support #4478

big-andy-coates · 2020-02-07T16:06:58Z

Description

This PR updates the docs to reflect the new primitive key functionality and some code changes to update DataGen.

ksqlDB now supports the following primitive key types: INT, BIGINT, DOUBLE as well as the existing STRING type.

The key type can be defined in the CREATE TABLE or CREATE STREAM statement by including a column definition for ROWKEY in the form ROWKEY <primitive-key-type> KEY,, for example:
```
CREATE TABLE USERS (ROWKEY BIGINT KEY, NAME STRING, RATING DOUBLE) WITH (kafka_topic='users', VALUE_FORMAT='json');
```
ksqlDB currently requires the name of the key column to be ROWKEY. Support for arbitrary key names is tracked by Primitive Keys: allow key names other than ROWKEY #3536.
ksqlDB currently requires keys to use the KAFKA format. Support for additional formats is tracked by https://github.com/confluentinc/ksql/projects/3.
Schema inference currently only works with STRING keys, Support for additional key types is tracked by Retrieve key schemas from the schema registry #4462. (Schema inference is where ksqlDB infers the schema of a CREATE TABLE and CREATE STREAM statements from the schema registered in the Schema Registry, as opposed to the user supplying the set of columns in the statement).
Apache Kafka Connect can be configured to output keys in the KAFKA format by using a Converter, e.g. "key.converter": "org.apache.kafka.connect.converters.IntegerConverter". Details of which converter to use for which key type can be found here: https://docs.confluent.io/current/ksql/docs/developer-guide/serialization.html#kafka in the Connect Converter column.
@rmoff has written an introductory blog about primitive keys: https://rmoff.net/2020/02/07/primitive-keys-in-ksqldb/

BREAKING CHANGE: existing queries that perform a PARTITION BY or GROUP BY on a single column of one of the above supported primitive key types will now set the key to the appropriate type, not a STRING as previously.

Testing done

Ran through the docker and non-docker examples, ensuring statements work, updating statements and expected output where necessary.

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

@rmoff

ksqlDB now supports the following primitive key types: `INT`, `BIGINT`, `DOUBLE` as well as the existing `STRING` type. The key type can be defined in the CREATE TABLE or CREATE STREAM statement by including a column definition for `ROWKEY` in the form `ROWKEY <primitive-key-type> KEY,`, for example: ```sql CREATE TABLE USERS (ROWKEY BIGINT KEY, NAME STRING, RATING DOUBLE) WITH (kafka_topic='users', VALUE_FORMAT='json'); ``` ksqlDB currently requires the name of the key column to be `ROWKEY`. Support for arbitrary key names is tracked by confluentinc#3536. ksqlDB currently requires keys to use the `KAFKA` format. Support for additional formats is tracked by https://github.com/confluentinc/ksql/projects/3. Schema inference currently only works with `STRING` keys, Support for additional key types is tracked by confluentinc#4462. (Schema inference is where ksqlDB infers the schema of a CREATE TABLE and CREATE STREAM statements from the schema registered in the Schema Registry, as opposed to the user supplying the set of columns in the statement). Apache Kafka Connect can be configured to output keys in the `KAFKA` format by using a Converter, e.g. `"key.converter": "org.apache.kafka.connect.converters.IntegerConverter"`. Details of which converter to use for which key type can be found here: https://docs.confluent.io/current/ksql/docs/developer-guide/serialization.html#kafka in the `Connect Converter` column. @rmoff has written an introductory blog about primitive keys: https://rmoff.net/2020/02/07/primitive-keys-in-ksqldb/ BREAKING CHANGE: existing queries that perform a PARTITION BY or GROUP BY on a single column of one of the above supported primitive key types will now set the key to the appropriate type, not a `STRING` as previously.

big-andy-coates

Some reviewing notes....

docs-md/developer-guide/joins/partition-data.md

docs-md/developer-guide/ksqldb-reference/print.md

docs-md/developer-guide/test-and-debug/generate-custom-test-data.md

docs-md/tutorials/basics-docker.md

ksql-examples/src/main/java/io/confluent/ksql/datagen/RowGenerator.java

docs-md/concepts/stream-processing.md

docs-md/developer-guide/create-a-stream.md

docs-md/developer-guide/create-a-table.md

docs-md/developer-guide/joins/partition-data.md

docs-md/developer-guide/syntax-reference.md

docs-md/tutorials/basics-docker.md

Co-Authored-By: Jim Galasyn <jim.galasyn@confluent.io>

agavra

LGTM

docs-md/developer-guide/joins/partition-data.md

agavra · 2020-02-07T18:22:35Z

docs-md/developer-guide/joins/partition-data.md

+    CREATE STREAM clicks (userId INT, url STRING) WITH(kafka_topic='clickstream', value_format='json');
+
+    -- table with BIGINT userId stored in they key:
+    CREATE TABLE  users  (ROWKEY BIGINT KEY, fullName STRING) WITH(kafka_topic='users', value_format='json');


I think it would be good to have the example above explicitly name userID as the KEY - I know this isn't necessary for the example, but I think it's good to have an example of a stream with a key declared using the KEY syntax

I've deliberately not added the WITH KEY bit as its not required. The example only has what's needed. Adding additional stuff if just noise IMHO and can lead to confusion.

Plus I intend to drop the whole WITH KEY thing soon.

So do you mind if we leave this as it is?

docs-md/developer-guide/ksqldb-reference/select-push-query.md

agavra · 2020-02-07T18:27:27Z

docs-md/developer-guide/syntax-reference.md

@@ -398,14 +398,16 @@ message key by setting the `KEY` property of the `WITH` clause.
 Example:

 ```sql
-CREATE TABLE users (registertime BIGINT, gender VARCHAR, regionid VARCHAR, userid VARCHAR)
+CREATE TABLE users (rowkey INT KEY, registertime BIGINT, gender VARCHAR, regionid VARCHAR, userid INT)


all the examples have added ROWKEY <TYPE> KEY - is this now a requirement or is it just illustrative?

It's not currently required. If you don't add one KSQL currently defaults to ROWKEY STRING KEY. However, we should encourage people to be explicit, rather than rely on implicit.

I've not documented this yet as its a bit meh. Ideally, if you don't supply the key column it should mean there is no key column. We can update the docs when this is the case.

ksql-benchmark/src/main/java/io/confluent/ksql/benchmark/SerdeBenchmark.java

big-andy-coates · 2020-02-13T12:29:58Z

Comments above addressed in #4551

big-andy-coates requested review from JimGalasyn and a team as code owners February 7, 2020 16:06

big-andy-coates commented Feb 7, 2020

View reviewed changes