Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insert_deduplication_token setting for INSERT statement #1

Merged
merged 7 commits into from
Jan 29, 2022

Conversation

devcrafter
Copy link
Owner

The setting allows a user to provide own deduplication semantic in Replicated*MergeTree
If provided, it's used instead of data digest to generate block ID
So, for example, by providing a unique value for the setting in each INSERT statement,
user can avoid the same inserted data being deduplicated

Issue: ClickHouse#7461

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

insert_deduplication_token setting for INSERT statement (only for Replicated*MergeTree tables)
The setting allows a user to provide own deduplication semantic in Replicated*MergeTree
For example, by providing a unique value for the setting in each INSERT statement,
the user can avoid the same inserted data being deduplicated

Detailed description / Documentation draft:
...

By adding documentation, you'll allow users to try your new feature immediately, not when someone else will have time to document it later. Documentation is necessary for all features that affect user experience in any way. You can add brief documentation draft above, or add documentation right into your patch as Markdown files in docs folder.

If you are doing this for the first time, it's recommended to read the lightweight Contributing to ClickHouse Documentation guide first.

Information about CI checks: https://clickhouse.tech/docs/en/development/continuous-integration/

@devcrafter devcrafter force-pushed the deduplication_token_7461 branch 2 times, most recently from 1af0bfa to 77a388d Compare November 22, 2021 09:03
@devcrafter
Copy link
Owner Author

I'd consider limiting the size of insert_deduplication_token. What could be a reasonable number? I guess 1 KB should be fine ...

@devcrafter devcrafter force-pushed the deduplication_token_7461 branch from 77a388d to 52ee695 Compare November 22, 2021 10:20
@alexey-milovidov
Copy link

Not necessary, it is limited implicitly nevertheless by memory limit and other limits.

@devcrafter devcrafter force-pushed the deduplication_token_7461 branch 2 times, most recently from 3a71303 to c7d987e Compare November 28, 2021 18:53
@devcrafter
Copy link
Owner Author

@devcrafter devcrafter force-pushed the deduplication_token_7461 branch 2 times, most recently from 2e4b96a to 4d94eb4 Compare December 1, 2021 20:43
devcrafter pushed a commit that referenced this pull request Dec 1, 2021
@devcrafter devcrafter force-pushed the deduplication_token_7461 branch from 4d94eb4 to ec8407a Compare December 1, 2021 20:54
@devcrafter
Copy link
Owner Author

@alesapin @alexey-milovidov Please let me know if you have any comments

@devcrafter devcrafter force-pushed the deduplication_token_7461 branch from d142860 to f50f1fd Compare December 6, 2021 18:20
@devcrafter devcrafter force-pushed the deduplication_token_7461 branch from f50f1fd to 0f1ac1b Compare December 19, 2021 13:09
The setting allows a user to provide own deduplication semantic in Replicated*MergeTree
If provided, it's used instead of data digest to generate block ID
So, for example, by providing a unique value for the setting in each INSERT statement,
user can avoid the same inserted data being deduplicated

Inserting data within the same INSERT statement are split into blocks
according to the *insert_block_size* settings
(max_insert_block_size, min_insert_block_size_rows, min_insert_block_size_bytes).
Each block with the same INSERT statement will get an ordinal number.
The ordinal number is added to insert_deduplication_token to get block dedup token
i.e. <token>_0, <token>_1, ... Deduplication is done per block
So, to guarantee deduplication for two same INSERT queries,
dedup token and number of blocks to have to be the same

Issue: ClickHouse#7461
@devcrafter devcrafter force-pushed the deduplication_token_7461 branch from 0f1ac1b to 100ee92 Compare December 19, 2021 13:17
@devcrafter devcrafter force-pushed the deduplication_token_7461 branch 2 times, most recently from 68f4659 to fe0695e Compare January 5, 2022 01:23
@devcrafter devcrafter force-pushed the deduplication_token_7461 branch from fe0695e to 0857a8d Compare January 9, 2022 18:06
+ sync replica before checks in replica test
+ use different table names for multi block tests
@devcrafter devcrafter force-pushed the deduplication_token_7461 branch from b3de677 to 5199365 Compare January 10, 2022 11:12
@devcrafter devcrafter merged commit f6684db into master Jan 29, 2022
devcrafter pushed a commit that referenced this pull request Dec 9, 2022
…reated

ASAN report:

    Code: 586. DB::ErrnoException: Cannot create file: /src/.clickhouse_history, errno: 2, strerror: No such file or directory. (CANNOT_CREATE_FILE)
    =================================================================
    ==1==ERROR: AddressSanitizer: heap-use-after-free on address 0x6240000208f0 at pc 0x000030d22ade bp 0x7ffff2ff3f70 sp 0x7ffff2ff3f68
    READ of size 8 at 0x6240000208f0 thread T2
        #0 0x30d22add in DB::ProcessList::insert() build_docker/../src/Interpreters/ProcessList.cpp:89:36
        #1 0x31411018 in DB::executeQueryImpl() build_docker/../src/Interpreters/executeQuery.cpp:516:60
        ClickHouse#2 0x3140e1ab in DB::executeQuery() build_docker/../src/Interpreters/executeQuery.cpp:1083:30
        ClickHouse#3 0x3364391e in DB::LocalConnection::sendQuery() build_docker/../src/Client/LocalConnection.cpp:119:21
        ClickHouse#4 0x3367bab0 in DB::Suggest::fetch() build_docker/../src/Client/Suggest.cpp:141:16
        ClickHouse#5 0x336820eb in void DB::Suggest::load<DB::LocalConnection>()::'lambda'()::operator()() const build_docker/../src/Client/Suggest.cpp:118:17

    0x6240000208f0 is located 2032 bytes inside of 7056-byte region [0x624000020100,0x624000021c90)
    freed by thread T0 here:

        #0 0xe381ef2 in operator delete(void*, unsigned long) (/wrk/clickhouse-asan+0xe381ef2) (BuildId: 6ea6d1a5d2d5a164f60f0fd8230936305bc8d9d0)
        #1 0x335509fe in DB::ClientBase::~ClientBase() build_docker/../src/Client/ClientBase.cpp:293:25
        ClickHouse#2 0x1f809bd5 in mainEntryClickHouseLocal(int, char**) build_docker/../programs/local/LocalServer.cpp:804:5
        ClickHouse#3 0xe3856ad in main build_docker/../programs/main.cpp:482:12
        ClickHouse#4 0x7ffff7dc0082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16

    previously allocated by thread T0 here:
        #0 0xe38128d in operator new(unsigned long) (/wrk/clickhouse-asan+0xe38128d) (BuildId: 6ea6d1a5d2d5a164f60f0fd8230936305bc8d9d0)
        #1 0x2f34a7f3 in std::__1::__unique_if<DB::ContextSharedPart>::__unique_single std::__1::make_unique[abi:v15003]<DB::ContextSharedPart>() build_docker/../contrib/libcxx/include/__memory/unique_ptr.h:714:28
        ClickHouse#2 0x2f34a7f3 in DB::Context::createShared() build_docker/../src/Interpreters/Context.cpp:603:32
        ClickHouse#3 0x1f7f901d in DB::LocalServer::processConfig() build_docker/../programs/local/LocalServer.cpp:535:22
        ClickHouse#4 0x1f7f4d92 in DB::LocalServer::main() build_docker/../programs/local/LocalServer.cpp:419:5
        ClickHouse#5 0x3af24ffe in Poco::Util::Application::run() build_docker/../contrib/poco/Util/src/Application.cpp:334:8
        ClickHouse#6 0x1f809bca in mainEntryClickHouseLocal(int, char**) build_docker/../programs/local/LocalServer.cpp:803:20
        ClickHouse#7 0xe3856ad in main build_docker/../programs/main.cpp:482:12
        ClickHouse#8 0x7ffff7dc0082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16

    Thread T2 created by T0 here:
        #0 0xe32fedc in pthread_create (/wrk/clickhouse-asan+0xe32fedc) (BuildId: 6ea6d1a5d2d5a164f60f0fd8230936305bc8d9d0)
        #1 0x336806df in std::__1::__libcpp_thread_create[abi:v15003](unsigned long*, void* (*)(void*), void*) build_docker/../contrib/libcxx/include/__threading_support:376:10
        ClickHouse#2 0x336806df in std::__1::thread::thread<void DB::Suggest::load<DB::LocalConnection>()::'lambda'(), void>() build_docker/../contrib/libcxx/include/thread:311:16
        ClickHouse#3 0x3367ff5b in void DB::Suggest::load<DB::LocalConnection>(std::__1::shared_ptr<DB::Context const>, DB::ConnectionParameters const&, int) build_docker/../src/Client/Suggest.cpp:110:22
        ClickHouse#4 0x3357fee9 in DB::ClientBase::runInteractive() build_docker/../src/Client/ClientBase.cpp:2066:22
        ClickHouse#5 0x1f7f5264 in DB::LocalServer::main() build_docker/../programs/local/LocalServer.cpp
        ClickHouse#6 0x3af24ffe in Poco::Util::Application::run() build_docker/../contrib/poco/Util/src/Application.cpp:334:8
        ClickHouse#7 0x1f809bca in mainEntryClickHouseLocal(int, char**) build_docker/../programs/local/LocalServer.cpp:803:20
        ClickHouse#8 0xe3856ad in main build_docker/../programs/main.cpp:482:12
        ClickHouse#9 0x7ffff7dc0082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
@what-the-diff
Copy link

what-the-diff bot commented Jan 25, 2023

  • Added a new setting insert_deduplication_token
  • Changed IMergeTreeDataPart::getZeroLevelPartBlockID to accept dedup token as an argument and use it instead of data digest if provided
  • Replaced calls to getZeroLevelPartBlockID with the ones that pass empty string or context's settings value for insert_deduplication_token depending on whether we need block id based on data digest or not (MergeTreeSink, MergeTreeData)
  • Updated tests: 02124 - added test case for multiple blocks in one query; 02119 - updated expected result after adding support for dedup tokens into INSERT SELECT queries
  • Added tests for deduplication token on MergeTree and ReplicatedMergeTree tables.
  • Fixed bug with incorrect deduplication of blocks by data digest when insert_deduplication_token is set to non-empty value ([experiment] apt-get upgrade in docker/server/Dockerfile ClickHouse/ClickHouse#12124).

devcrafter pushed a commit that referenced this pull request May 15, 2023
Reordering of the static variables should be enough to fix this:

    [ RUN      ] IOResourceStaticResourceManager.Prioritization
    =================================================================
    ==13==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7f7d8d621970 at pc 0x5636b80dcbb2 bp 0x7f7c48e47dd0 sp 0x7f7c48e47dc8
    READ of size 8 at 0x7f7d8d621970 thread T3975 (ThreadPool)
        #0 0x5636b80dcbb1 in IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_1::operator()(long) const build_docker/./src/IO/Resource/tests/gtest_resource_manager_static.cpp:78:13
        #1 0x5636b80dcbb1 in IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0::operator()() const build_docker/./src/IO/Resource/tests/gtest_resource_manager_static.cpp:95:21
        ClickHouse#2 0x5636b80dcbb1 in decltype(std::declval<IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&>()()) std::__1::__invoke[abi:v15000]<IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&>(IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&) build_docker/./contrib/llvm-project/libcxx/include/__functional/invoke.h:394:23
        ClickHouse#3 0x5636b80dcbb1 in decltype(auto) std::__1::__apply_tuple_impl[abi:v15000]<IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&, std::__1::tuple<>&>(IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&, std::__1::tuple<>&, std::__1::__tuple_indices<>) build_docker/./contrib/llvm-project/libcxx/include/tuple:1789:1
        ClickHouse#4 0x5636b80dcbb1 in decltype(auto) std::__1::apply[abi:v15000]<IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&, std::__1::tuple<>&>(IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&, std::__1::tuple<>&) build_docker/./contrib/llvm-project/libcxx/include/tuple:1798:1
        ClickHouse#5 0x5636b80dcbb1 in ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0>(IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&&)::'lambda'()::operator()() build_docker/./src/Common/ThreadPool.h:228:13
        ClickHouse#6 0x5636b80dcbb1 in decltype(std::declval<IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0>()()) std::__1::__invoke[abi:v15000]<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0>(IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&&)::'lambda'()&>(IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&&) build_docker/./contrib/llvm-project/libcxx/include/__functional/invoke.h:394:23
        ClickHouse#7 0x5636b80dcbb1 in void std::__1::__invoke_void_return_wrapper<void, true>::__call<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0>(IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&&)::'lambda'()&>(ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0>(IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&&)::'lambda'()&) build_docker/./contrib/llvm-project/libcxx/include/__functional/invoke.h:479:9
        ClickHouse#8 0x5636b80dcbb1 in std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0>(IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&&)::'lambda'(), void ()>::operator()[abi:v15000]() build_docker/./contrib/llvm-project/libcxx/include/__functional/function.h:235:12
        ClickHouse#9 0x5636b80dcbb1 in void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0>(IOResourceStaticResourceManager_Prioritization_Test::TestBody()::$_0&&)::'lambda'(), void ()>>(std::__1::__function::__policy_storage const*) build_docker/./contrib/llvm-project/libcxx/include/__functional/function.h:716:16
        ClickHouse#10 0x5636ea219310 in std::__1::__function::__policy_func<void ()>::operator()[abi:v15000]() const build_docker/./contrib/llvm-project/libcxx/include/__functional/function.h:848:16
        ClickHouse#11 0x5636ea219310 in std::__1::function<void ()>::operator()() const build_docker/./contrib/llvm-project/libcxx/include/__functional/function.h:1187:12
        ClickHouse#12 0x5636ea219310 in ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) build_docker/./src/Common/ThreadPool.cpp:416:13
        ClickHouse#13 0x5636ea2258ac in void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()::operator()() const build_docker/./src/Common/ThreadPool.cpp:180:73
        ClickHouse#14 0x5636ea2258ac in decltype(std::declval<void>()()) std::__1::__invoke[abi:v15000]<void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(void&&) build_docker/./contrib/llvm-project/libcxx/include/__functional/invoke.h:394:23
        ClickHouse#15 0x5636ea2258ac in void std::__1::__thread_execute[abi:v15000]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(std::__1::tuple<void, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>&, std::__1::__tuple_indices<>) build_docker/./contrib/llvm-project/libcxx/include/thread:284:5
        ClickHouse#16 0x5636ea2258ac in void* std::__1::__thread_proxy[abi:v15000]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>>(void*) build_docker/./contrib/llvm-project/libcxx/include/thread:295:5
        ClickHouse#17 0x7f7d8fcf1608 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x8608) (BuildId: 7b4536f41cdaa5888408e82d0836e33dcf436466)
        ClickHouse#18 0x7f7d8fc16132 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x11f132) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)

    Address 0x7f7d8d621970 is located in stack of thread T0 at offset 368 in frame
        #0 0x5636b80d9bef in IOResourceStaticResourceManager_Prioritization_Test::TestBody() build_docker/./src/IO/Resource/tests/gtest_resource_manager_static.cpp:46

      This frame has 10 object(s):
        [32, 36) 'requests_per_thead' (line 48)
        [48, 208) 't' (line 49)
        [272, 296) 'ref.tmp' (line 51)
        [336, 352) 'last_priority' (line 74)
        [368, 376) 'check' (line 75) <== Memory access at offset 368 is inside this variable
        [400, 424) 'name' (line 83)
        [464, 512) 'ref.tmp13' (line 87)
        [544, 560) 'c' (line 101)
        [576, 600) 'ref.tmp32' (line 101)
        [640, 664) 'ref.tmp41' (line 102)

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
devcrafter pushed a commit that referenced this pull request Jun 12, 2023
Fixed convertFieldToType case of converting Date and Date32 to DateTime64 Field
devcrafter pushed a commit that referenced this pull request Jun 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants