Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiFlash may failed to start deployed under disagg arch #9282

Closed
JaySon-Huang opened this issue Aug 1, 2024 · 1 comment · Fixed by #9283
Closed

TiFlash may failed to start deployed under disagg arch #9282

JaySon-Huang opened this issue Aug 1, 2024 · 1 comment · Fixed by #9283
Labels
affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. component/storage severity/major type/bug The issue is confirmed as a bug.

Comments

@JaySon-Huang
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

TiFlash write node failed to start when deployed under disagg arch

[2024/08/01 06:33:12.636 +00:00] [ERROR] [Exception.cpp:96] ["Code: 49, e.displayText() = DB::Exception: Restore position from BlobStat failed, the space/subspace is already being used, offset=0x15373 blob_id=551 page_id=0x020102000000018204A17C03 entry=PageEntry{file: 551, offset: 0x15373, size: 15, checksum: 0x12D46C1C0D5B51FE, tag: 0, field_offsets: [], checkpoint_info: invalid}, e.what() = DB::Exception, Stack trace:
  0xaaaabb76614c    StackTrace::StackTrace() [tiflash+34955596]
                    dbms/src/Common/StackTrace.cpp:23
  0xaaaac1a699b0    DB::Exception::Exception<unsigned long&, unsigned long&, DB::UniversalPageId const&, DB::PS::V3::PageEntryV3&>(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, unsigned long&, unsigned long&, DB::UniversalPageId const&, DB::PS::V3::PageEntryV3&) [tiflash+138779056]
                    dbms/src/Common/Exception.h:46
  0xaaaac1a68060    DB::PS::V3::PageDirectoryFactory<DB::PS::V3::universal::FactoryTrait>::restoreBlobStats(std::__1::unique_ptr<DB::PS::V3::PageDirectory<DB::PS::V3::universal::PageDirectoryTrait>, std::__1::default_delete<DB::PS::V3::PageDirectory<DB::PS::V3::universal::PageDirectoryTrait>>> const&) [tiflash+138772576]
                    dbms/src/Storages/Page/V3/PageDirectoryFactory.cpp:170
  0xaaaac1a67880    DB::PS::V3::PageDirectoryFactory<DB::PS::V3::universal::FactoryTrait>::createFromReader(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::shared_ptr<DB::PS::V3::WALStoreReader>, std::__1::unique_ptr<DB::PS::V3::WALStore, std::__1::default_delete<DB::PS::V3::WALStore>>) [tiflash+138770560]
                    dbms/src/Storages/Page/V3/PageDirectoryFactory.cpp:70
  0xaaaac1a674d0    DB::PS::V3::PageDirectoryFactory<DB::PS::V3::universal::FactoryTrait>::create(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::shared_ptr<DB::FileProvider>&, std::__1::shared_ptr<DB::PSDiskDelegator>&, DB::PS::V3::WALConfig const&) [tiflash+138769616]
                    dbms/src/Storages/Page/V3/PageDirectoryFactory.cpp:45
  0xaaaac1aade88    DB::UniversalPageStorage::restore() [tiflash+139058824]
                    dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp:89
  0xaaaac1ac1eb0    DB::UniversalPageStorageService::create(DB::Context&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::shared_ptr<DB::PSDiskDelegator>, DB::PageStorageConfig const&) [tiflash+139140784]
                    dbms/src/Storages/Page/V3/Universal/UniversalPageStorageService.cpp:57
  0xaaaac0be9c98    DB::Context::initializeWriteNodePageStorageIfNeed(DB::PathPool const&) [tiflash+123575448]
                    dbms/src/Interpreters/Context.cpp:1927
  0xaaaabb7dc2a8    DB::Server::main(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>> const&) [tiflash+35439272]
                    dbms/src/Server/Server.cpp:1362
  0xaaaac285630c    Poco::Util::Application::run() [tiflash+153379596]
                    contrib/poco/Util/src/Application.cpp:335
  0xaaaabb7d6dac    DB::Server::run() [tiflash+35417516]
                    dbms/src/Server/Server.cpp:263
  0xaaaabb7e2f48    mainEntryClickHouseServer(int, char**) [tiflash+35467080]
                    dbms/src/Server/Server.cpp:1947
  0xaaaabb73bc34    main [tiflash+34782260]
                    dbms/src/Server/main.cpp:173
  0xffff90ad73fc    __libc_start_call_main [libc.so.6+160764]
                    ./csu/../sysdeps/nptl/libc_start_call_main.h:58
  0xffff90ad74cc    __libc_start_main_impl [libc.so.6+160972]
                    ./csu/../csu/libc-start.c:392
  0xaaaabb73aeb0    _start [tiflash+34778800]"] [source="void DB::Context::initializeWriteNodePageStorageIfNeed(const PathPool &)"] [thread_id=1]

4. What is your TiFlash version? (Required)

v7.5.2

@JaySon-Huang JaySon-Huang added type/bug The issue is confirmed as a bug. component/storage labels Aug 1, 2024
@JaySon-Huang
Copy link
Contributor Author

By dumping the entries from PageStorage WAL, we can locate that two entries cause this issue:

[2024/08/01 15:41:28.755 +08:00] [INFO] [PageDirectoryFactory.cpp:225] ["{type:VAR_ENT, page_id:0x020102000000009CA4430A08, ori_id:0x.0, version:108969363.0, entry:PageEntry{file: 551, offset: 0x15376, size: 0, checksum: 0x0, tag: 0, field_offsets: [], checkpoint_info: invalid}, being_ref_count:1}"] [thread_id=1]
[2024/08/01 15:41:50.199 +08:00] [INFO] [PageDirectoryFactory.cpp:225] ["{type:PUT    , page_id:0x020102000000018204A17C03, ori_id:0x.0, version:127726467.0, entry:PageEntry{file: 551, offset: 0x15373, size: 15, checksum: 0x12D46C1C0D5B51FE, tag: 0, field_offsets: [], checkpoint_info: invalid}, being_ref_count:1}"] [thread_id=1]

The first entry is a page with data.size == 0, placed at blob_id=551, offset=0x15376. And later another page is placed at blob_id=551, offset=0x15373 with data.size == 15. The second page will cover the place of the first page. And our code can not handle this situation with "empty page data".

And minimal reproduce ut:

TEST_F(PageDirectoryTest, EmptyPage)
{
    {
        PageEntriesEdit edit;
        edit.put(buildV3Id(TEST_NAMESPACE_ID, 9), PageEntryV3{.file_id = 551, .size = 0, .offset = 0x15376});
        edit.put(buildV3Id(TEST_NAMESPACE_ID, 10), PageEntryV3{.file_id = 551, .size = 15, .offset = 0x15373});
        dir->apply(std::move(edit));
    }

    auto s0 = dir->createSnapshot();
    auto edit = dir->dumpSnapshotToEdit(s0);
    auto restore_from_edit = [](const PageEntriesEdit & edit, BlobStats & stats) {
        auto deseri_edit = u128::Serializer::deserializeFrom(u128::Serializer::serializeTo(edit), nullptr);
        auto provider = DB::tests::TiFlashTestEnv::getDefaultFileProvider();
        auto path = getTemporaryPath();
        PSDiskDelegatorPtr delegator = std::make_shared<DB::tests::MockDiskDelegatorSingle>(path);
        PageDirectoryFactory<u128::FactoryTrait> factory;
        auto d
            = factory.setBlobStats(stats).createFromEditForTest(getCurrentTestName(), provider, delegator, deseri_edit);
        return d;
    };

    {
        auto path = getTemporaryPath();
        PSDiskDelegatorPtr delegator = std::make_shared<DB::tests::MockDiskDelegatorSingle>(path);
        auto config = BlobConfig{};
        BlobStats stats(log, delegator, config);
        {
            std::lock_guard lock(stats.lock_stats);
            stats.createStatNotChecking(551, BLOBFILE_LIMIT_SIZE, lock);
        }
        auto restored_dir = restore_from_edit(edit, stats);
        auto snap = restored_dir->createSnapshot();
        getNormalPageIdU64(restored_dir, 9, snap);
        getNormalPageIdU64(restored_dir, 10, snap);
    }
}

@JaySon-Huang JaySon-Huang added affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. and removed may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 may-affects-8.1 labels Aug 5, 2024
@ti-chi-bot ti-chi-bot bot closed this as completed in #9283 Aug 5, 2024
@ti-chi-bot ti-chi-bot bot closed this as completed in dc20fe9 Aug 5, 2024
ti-chi-bot bot pushed a commit that referenced this issue Aug 5, 2024
)

close #9282

PageStorage: Fix empty page cause TiFlash failed to start

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

Co-authored-by: JaySon <tshent@qq.com>
Co-authored-by: JaySon-Huang <tshent@qq.com>
ti-chi-bot bot pushed a commit that referenced this issue Aug 5, 2024
)

close #9282

PageStorage: Fix empty page cause TiFlash failed to start

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

Co-authored-by: JaySon <tshent@qq.com>
Co-authored-by: JaySon-Huang <tshent@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. component/storage severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant