Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify timezone for DateTime / DateTime64 #349

Closed
aadant opened this issue Oct 26, 2023 · 17 comments
Closed

Specify timezone for DateTime / DateTime64 #349

aadant opened this issue Oct 26, 2023 · 17 comments
Assignees
Labels
enhancement New feature or request GA-1 All the issues that are issues in release(Scheduled Dec 2023) high-priority qa-verified label to mark issues that were verified by QA

Comments

@aadant
Copy link
Collaborator

aadant commented Oct 26, 2023

Specify the default timezone for the CH server :

clickhouse.datetime.timezone=UTC

This setting would control the timezone for DateTime* data types :

`ts` DateTime64(6, 'UTC')
@aadant
Copy link
Collaborator Author

aadant commented Oct 26, 2023

This is related to the source setting, ideally both are the same for usual migration / replication :

database.connectionTimeZone: "America/Chicago"

@IlyaTsoi
Copy link
Contributor

IlyaTsoi commented Oct 27, 2023

@aadant, I'm trying to solve this. I think that would be better if it didn't solve on a column data type level, but on INSERT query level. We can use UTC by default within sink-connector and use toDateTime*() within INSERT query with indicating time zone:

INSERT INTO table(`date_column`) VALUES (toDateTime64('2023-04-29 06:36:28.46277', 6, 'UTC'))

Or always use integer for DateTime types within INSERT query. This is from Clickhouse docs:

When inserting datetime as an integer, it is treated as an appropriately scaled Unix Timestamp (UTC)

So, it would be nice if ZonedTimestampConverter() returned an integer instead of a string.

@aadant
Copy link
Collaborator Author

aadant commented Oct 28, 2023

@IlyaTsoi sure if clickhouse.server.default_timezone is not specified, it falls back to the server time zone. However, in order to :

  • emulate the DateTime behavior of MySQL (insensitive to TZ change)
  • source and target servers timezone difference
  • source table representing UTC DateTime replicating to a non UTC server

we need this override.

Currently the sink-connector casts all DateTimes to UTC and it is influenced by database.connectionTimeZone.

UTC -> UTC works
TZ -> same TZ works if database.connectionTimeZone=TZ(@subkanthi tested that, I need to double check)

Ideally we need to override at the column level (overrides are not implemented yet, except at the global level)
see #323

@aadant
Copy link
Collaborator Author

aadant commented Oct 28, 2023

Actually the difficulty is that DateTime In MySQL is like String representing Date and Time, while DateTime* in CH behave like MySQL Timestamp. In addition and unlike MySQL, CH can stick the TZ at the column level.

https://dev.mysql.com/doc/refman/8.0/en/datetime.html

@IlyaTsoi
Copy link
Contributor

CH can stick the TZ at the column level.

Right, but it uses TZ column metadata when calculates inserted data. When you execute SELECT query from DateTime* column it returns DateTime* in UTC. And ClickHouse stores DateTime* in UTC.

CREATE TABLE test_datetime (`id` Int8, `value` DateTime64(6, 'Europe/Moscow')) ENGINE = MergeTree ORDER BY value;
INSERT INTO test_datetime(id, value)
VALUES 
	(1,1546300800000000) -- will be inserted as UTC
	,(2,'2019-01-01 00:00:00'); -- will be recalculated considering TZ metadata column value
SELECT
	id
	,value AS `as stored`
	,toTimezone(value, 'Europe/Paris') AS `to 'Europe/Paris' TZ`
	,toString(value) AS `affected by column TZ metadata`
FROM test_datetime
ORDER BY id;
id as stored to 'Europe/Paris' TZ affected by column TZ metadata
1 2019-01-01 00:00:00 2019-01-01 01:00:00.000 +0100 2019-01-01 03:00:00.000000
2 2018-12-31 21:00:00 2018-12-31 22:00:00.000 +0100 2019-01-01 00:00:00.000000

This is from docs:

Internally, stores data as a number of ‘ticks’ since epoch start (1970-01-01 00:00:00 UTC) as Int64. The tick resolution is determined by the precision parameter. Additionally, the DateTime64 type can store time zone that is the same for the entire column, that affects how the values of the DateTime64 type values are displayed in text format and how the values specified as strings are parsed (‘2020-01-01 05:00:01.000’).

And now there is a problem, when a inserted data in String type is recalculated to a wrong value. I have checked all the methods of DebeziumConverter class and defined, that only ZonedTimestampConverter() returns a String value. It also should return a number to avoid the differences of INSERT behavior.
The basic logic as I see it - if source data doesn't have any TZ information - one is considered like UTC (or we can consider database.connectionTimeZone to recalculate it to UTC). If source data has TZ information - it's recalculated to UTC and then inserted in ClickHouse. We don't need to control a column metadata, let it be the same with a Clickhouse server, it doesn't matter.
Partly I have already implemented it here.
Maybe we are telling about different issues? Then sorry, i misunderstood)

@aadant
Copy link
Collaborator Author

aadant commented Oct 29, 2023

@IlyaTsoi I agree with what you say. In the general case, we should be able to migrate from a MySQL server with the same time zone without issue.

Suppose the MySQL database stores DateTimes from a different TZ (like UTC in a America/Chicago database), how do you migrate them accurately ?

The MySQL database could also contain DateTimes from different TZ in which case you need an "override" per database / table / column basis.

What I propose here is an optional variable, by default the sink-connector will assume that the timezone is the target CH server timezone. This is an attempt to override it globally before it is possible to override it more granularly.

@IlyaTsoi
Copy link
Contributor

@aadant OK, I did catch the main idea. I just want that all DateTime data was inserted as UTC to a Clickhouse. So we should calculate UTC from an optional parameter for database/table/column. And one more thing - we shoul use long instead of String type within an INSERT query for DateTime data. I would be happy if you considered my wishes). Thanks for your job!

@subkanthi subkanthi added the enhancement New feature or request label Nov 2, 2023
@IlyaTsoi
Copy link
Contributor

IlyaTsoi commented Nov 9, 2023

@aadant, @subkanthi look,please here. I have implemented the similar behavior. Long story short, I added a string config parameter source.datetime.timezone (default - UTC). It makes Clickhouse to recalculate DateTime values if this parameter value differs from UTC. Its value is appended within input() of an INSERT query for DateTime* types. I have had to edit the *timestamp*.convert() methods to make them return strings. I can make a PR if my solution is reasonable for this issue.

@aadant
Copy link
Collaborator Author

aadant commented Nov 10, 2023

@IlyaTsoi I reviewed the code, looks good to me for DateTime (up to second) but you should also support DateTime([1-6])

For MySQL Timestamp([1-6]), it also needs to be tested depending how Debezium pre-processed them.

@subkanthi I noticed this problem with current 2023-11-09 this is a promising fix

@IlyaTsoi
Copy link
Contributor

@aadant thanks for review. I fixed

@aadant
Copy link
Collaborator Author

aadant commented Nov 12, 2023

@IlyaTsoi : did you try to replicate the data_types schema (UTC to UTC, non UTC to same non UTC, non UTC to DateTime with UTC) ?

@IlyaTsoi
Copy link
Contributor

IlyaTsoi commented Nov 12, 2023

@aadant, of course, I haven’t tested it enough and not for all cases. I added a small unit test and it shows the expected result. I ran it for different configuration values of ch-server-TZ and connection-TZ. Additionally, I migrated a real 10G dataset from MSSQL to Clickhouse without any errors. Sorry, I didn't catch what do you mean by, for example, replication from UTC to non-UTC? I do nothing on Clickhouse side, only tell it to consider incoming data like one with TZ offset. Except ZonedTimestamp - it is inserted without changes.

@aadant
Copy link
Collaborator Author

aadant commented Nov 13, 2023

@IlyaTsoi : I created #374 and started to review there. It looks promising as it solved my particular case.
Then I tried to play with the source timezone and got an issue (see the exception). I think the design is good, I would recommend to write to String / Nullable(String) instead of DateTime64. This way CH will do the conversion on the server side.

To test this, we have a data_types schema that @subkanthi can test with different TZ (UTC to UTC, America/Chicago to America/Chicago)

@subkanthi
Copy link
Collaborator

@IlyaTsoi , there is a Integration test here - https://github.com/Altinity/clickhouse-sink-connector/blob/develop/sink-connector-lightweight/src/test/java/com/altinity/clickhouse/debezium/embedded/ddl/parser/ClickHouseDebeziumEmbeddedDDLCreateTableIT.java , we can extend this by setting the connectionTimeZone configuration in debezium and setting the timezone in Clickhouse by modifying init_clickhouse.sql

@subkanthi
Copy link
Collaborator

subkanthi commented Dec 1, 2023

Questions:
Will this only affect autocreate table feature in sink connector and we dont have to worry about tables that already exist?

@aadant
Copy link
Collaborator Author

aadant commented Dec 1, 2023

Yes

@subkanthi subkanthi added the GA-1 All the issues that are issues in release(Scheduled Dec 2023) label Dec 18, 2023
@Selfeer
Copy link
Collaborator

Selfeer commented Feb 8, 2024

The issue was manually verified by the Altinity QA team and marked as qa-verified.

Build used for testing: altinityinfra/clickhouse-sink-connector:443-14a14ba5c18fb3bc9f57c48d9a5079d4d6fe15e8-lt

We've verified that modifying the value of the clickhouse.datetime.timezone config forcefully overrides the default timezone for ClickHouse tables with timezone.

The timezones used for testing were taken from the ClickHouse system.time_zones table.

@Selfeer Selfeer added the qa-verified label to mark issues that were verified by QA label Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request GA-1 All the issues that are issues in release(Scheduled Dec 2023) high-priority qa-verified label to mark issues that were verified by QA
Projects
None yet
Development

No branches or pull requests

4 participants