Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: do not allow duplicates in schemas by default #4697

Closed

Conversation

big-andy-coates
Copy link
Contributor

Description

KsqlDB does not support data sources with duplicate column names in the key and value. (Mainly due to the fact that it copies the key columns into the value while processing with Kafka Streams).

It therefore makes sense, to me at least, that our LogicalSchema rejects duplicates by default, but allows them if explicitly told to do so. Doesn't change anything functionally, but does mean code is more explicit about when it does and does not allow duplicates.

Thoughts? Is this an improvement or just noise?

Testing done

Usual.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

KsqlDB does not support data sources with duplicate column names in the key and value.  (Mainly due to the fact that it copies the key columns into the value while processing with Kafka Streams).

It therefore makes sense, to me at least, that our `LogicalSchema` rejects duplicates by default, but allows them if explicitly told to do so.  Doesn't change anything functionally, but does mean code is more explicit about when it does and does not allow duplicates.
@big-andy-coates big-andy-coates requested a review from a team as a code owner March 3, 2020 21:02
@agavra
Copy link
Contributor

agavra commented Mar 4, 2020

KsqlDB does not support data sources with duplicate column names in the key and value.

How does it fail currently?

@big-andy-coates
Copy link
Contributor Author

KsqlDB does not support data sources with duplicate column names in the key and value.

How does it fail currently?

Because key columns must be named ROWKEY and ROWKEY is a system column name, and you're not allowed to have a persistent query with a a system column name in the value schema.

However, I'm dropping the requirement that key columns must be called ROWKEY ...

@agavra
Copy link
Contributor

agavra commented Mar 4, 2020

ah yes... that makes sense. In that case I think this PR is good - i'll take a look at the code itself in a bit

*
* @return self.
*/
public Builder allowDuplicates() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline, I think having this makes things a little more confusing. If we wanted to make it strongly typed, I think we should have a different class that allows duplicates but I think that's overkill

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, cool, my thinking was going down the same route.

@big-andy-coates big-andy-coates deleted the schema_duplicates branch March 11, 2020 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants