Fix memory leak issue with AWS API usage #6160

jepett0 · 2024-07-02T08:43:44Z

A memory leak can be observed in the production and reproduced with a ydbd binary compiled with the --sanitize=leak flag and a ydb import s3 call. Detailed reproduction steps are described in the issue.

The exact reason for the leak is unknown to me. However, it seems to be caused by the indeterministic order of static variable destruction. Here is a relevant line from the docs of AWS C++ SDK:

The SDK for C++ and its dependencies use C++ static objects, and the order of static object destruction is not determined by the C++ standard. To avoid memory issues caused by the nondeterministic order of static variable destruction, do not wrap the calls to Aws::InitAPI and Aws::ShutdownAPI into another static object.

At first glance, it does not seem like TApiInitializer objects have static storage duration in the current code. However, if you change the TApiInitializer object to be a Singleton itself and run the same script as described in the issue, then you will get the exact same leak sanitizer report. It gave me an idea that the underlying issue might be fixed in the same way as a static TApiInitializer would be.

I have done some research on the topic of memory issues in AWS SDK and that is what I have found:

A question asking if multiple InitAPI / ShutdownAPI calls are ok. (The answer is "no".)
A comment elaborating a bit more on the possible problems of a static RAII wrapper of InitAPI / ShutdownAPI calls.
A blame of Apache Arrow code that contains changes that have happened between version 12 and 13. It is important to note that the exact same leak was fixed by some change in the Arrow code which happened between versions 12 and 13 and I believe this PR did it. To quote the fix:

When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible. S3 can only be initialized and finalized once. S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished.

Proposed fix

So the fix that I propose in this PR is simple: call InitAPI somewhere in the beginning of the main function and call ShutdownAPI somewhere close to the end of the main function. Specifically, make TApiInitializer a data member of TKikimrRunner.

Unit tests have to call InitAPI and ShutdownAPI explicitly before each test run. It must be done once per process, so a unit test hook is a good choice because individual test cases are run sequentially in a for loop in the same process by default.

jepett0 · 2024-07-02T17:03:02Z

ydb/core/wrappers/s3_storage_config.cpp

@@ -1,81 +1,15 @@
 #include "s3_storage.h"
 #include "s3_storage_config.h"

-#include <contrib/libs/aws-sdk-cpp/aws-cpp-sdk-core/include/aws/core/internal/AWSHttpResourceClient.h>


These includes were not used, I believe

jepett0 · 2024-07-02T17:09:33Z

ydb/core/wrappers/s3_storage_config.cpp

-
-struct TApiInitializer {
-    TApiInitializer() {
-        Options.httpOptions.initAndCleanupCurl = false;


I suppose there is no need to have a separate TCurlInitializer (which calls curl_global_init and cleanup), because if you enable the httpOptions.initAndCleanupCurl option, then curl_global_init and cleanup would be called from AWS SDK

jepett0 · 2024-07-02T17:11:38Z

ydb/core/wrappers/s3_storage_config.cpp

-        Options.httpOptions.initAndCleanupCurl = false;
-        InitAPI(Options);
-
-        Internal::CleanupEC2MetadataClient(); // speeds up config construction


This function is from Internal namespace. I don't think we should call it. Moreover, TApiInitializer would be initialized only once per ydbd application start, so its performance should not be an issue.

jepett0 · 2024-07-02T17:13:40Z

ydb/core/wrappers/s3_wrapper_ut.cpp

        Runtime.Reset();
+        S3Mock.Reset();


Reset in reverse order of initialization. It does not really matter

jepett0 · 2024-07-02T17:16:00Z

ydb/services/ydb/backup_ut/ya.make

I decided to move BackupRestore tests from ydb_import_ut.cpp, because its ya.make builds a huge binary with a lot of tests. None of these tests need Y_TEST_HOOK_BEFORE_RUN, except mine, so I decided to separate them.

Implement AWS API guard as a YDB GlobalObject instead of calling InitAPI / ShutdownAPI multiple times during the program run.

github-actions · 2024-08-07T12:16:10Z

⚪ 2024-08-07 12:16:09 UTC Pre-commit check for d7c414d has started.
⚪ 2024-08-07 12:19:02 UTC Check linux-x86_64-release-clang14 is running...
🟢 2024-08-07 12:26:08 UTC Build successful.

github-actions · 2024-08-07T12:16:20Z

⚪ 2024-08-07 12:16:20 UTC Pre-commit check for d7c414d has started.
⚪ 2024-08-07 12:19:15 UTC Check linux-x86_64-relwithdebinfo is running...
🟡 2024-08-07 13:52:11 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
15250	13899	0	4	1336	11

⚪ 2024-08-07 13:54:38 UTC Failed tests rerun (try 2) linux-x86_64-relwithdebinfo is running...
🟡 2024-08-07 14:05:54 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
18 (only retried tests)	4	0	2	4	8

⚪ 2024-08-07 14:07:14 UTC Failed tests rerun (try 3) linux-x86_64-relwithdebinfo is running...
🟢 2024-08-07 14:15:11 UTC Tests successful.

Test history | Ya make output

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
14 (only retried tests)	6	0	0	0	8

🟢 2024-08-07 14:16:27 UTC Build successful.
🟢 2024-08-07 14:17:01 UTC ydbd size 8.1 GiB changed* by -11.2 KiB, which is <= 0 Bytes vs main: OK

ydbd size dash	main: `2b5467c`	merge: `d7c414d`	diff	diff %
ydbd size	8 672 546 288 Bytes	8 672 534 832 Bytes	-11.2 KiB	-0.000%
ydbd stripped size	471 982 856 Bytes	471 983 752 Bytes	+896 Bytes	+0.000%

^{*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation}

github-actions · 2024-08-07T12:16:22Z

⚪ 2024-08-07 12:16:21 UTC Pre-commit check for d7c414d has started.
⚪ 2024-08-07 12:19:14 UTC Check linux-x86_64-release-asan is running...
🔴 2024-08-07 14:29:38 UTC Some tests failed, follow the links below.

Test history | Ya make output

TESTS	PASSED	ERRORS	FAILED	SKIPPED	MUTED^?
10913	10861	0	10	25	17

🟢 2024-08-07 14:30:52 UTC Build successful.
🟢 2024-08-07 14:31:22 UTC ydbd size 5.4 GiB changed* by +5.0 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash	main: `2b5467c`	merge: `d7c414d`	diff	diff %
ydbd size	5 831 510 096 Bytes	5 831 515 176 Bytes	+5.0 KiB	+0.000%
ydbd stripped size	1 464 849 872 Bytes	1 464 853 008 Bytes	+3.1 KiB	+0.000%

^{*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation}

jepett0 · 2024-08-07T12:47:21Z

@CyberROFL has pointed out the following potential problems:

KIKIMR-11055. Data race in Aws::S3::S3Client. See old thread sanitizer warnings here.
KIKIMR-12129. Potential data race in setenv, getenv. Fixed by replacing them with a call to CleanupEC2MetadataClient.

I believe that the fix proposed in this PR should avoid both problems.

Reasons why data races in Aws::S3::S3Client (KIKIMR-11055) are most probably avoided

There are 6 data races caught in KIKIMR-11055:

2 in curl_slist_append
in curl_easy_init
in Curl_open
in Curl_initinfo
in Curl_saferealloc

As of this PR, curl is managed by AWS SDK and I believe the authors are competent enough and the user base is large enough for the data races mentioned above to be eliminated.

Reasons why potential data races in setenv, getenv calls (KIKIMR-12129) are avoided

This PR does not introduce any new calls to them.

Thread sanitizer tests

I have run functional tests in ydb/services/ydb/backup_ut multiple times under TSAN and got the following results:

Test name	Runs in total	Runs with at least one data race
BackupRestore::Basic	10	0
BackupRestore::RestoreTablePartitioningSettings	10	2
BackupRestore::RestoreIndexTablePartitioningSettings	10	2
BackupRestoreS3::Basic	12	1
BackupRestoreS3::RestoreTablePartitioningSettings	10	1
BackupRestoreS3::RestoreIndexTablePartitioningSettings	10	0

The caught data races are the following:

Data races in AWS SDK stack were not observed.

Comparison with the base branch code

I have tested the code of the base commit of this PR from the main brach under TSAN and the results are the following:

Test name	Runs in total	Data races
BackupRestore::Basic	1	0
BackupRestore::RestoreTablePartitioningSettings	1	1
BackupRestore::RestoreIndexTablePartitioningSettings	1	2

Caught data races:

TBucketQuoter::FillBucket
TMemorizableControlWrapper::Update
AtomicSet

Summary

Data races caught in 62 runs of the code in this PR have all been caught in 3 runs of the base branch code. This makes me certain that I have not introduced any new data races.

jepett0 · 2024-08-07T13:28:26Z

ydb/core/driver_lib/run/run.cpp

@@ -1653,6 +1653,8 @@ TIntrusivePtr<TServiceInitializersList> TKikimrRunner::CreateServiceInitializers
        sil->AddServiceInitializer(new TGraphServiceInitializer(runConfig));
    }

+    sil->AddServiceInitializer(new TAwsApiInitializer(*this));


No dedicated service mask, always enabled. This would help us to avoid misconfiguration by users.

This reverts commit ff0fd33.

…ydb-platform#8698)

…m#6160)" (ydb-platform#8698)" This reverts commit 2091af7.

ydb-platform deleted a comment from github-actions bot Jul 2, 2024

This comment was marked as outdated.

Sign in to view

jepett0 marked this pull request as ready for review July 2, 2024 17:00

jepett0 requested review from CyberROFL, Enjection and ijon July 2, 2024 17:01

jepett0 commented Jul 2, 2024

View reviewed changes

ijon previously approved these changes Jul 2, 2024

View reviewed changes

This was referenced Jul 4, 2024

ImportFromS3Leak.static.1 #6150

Closed

Fix Aws::InitAPI memory leak #6039

Closed

jepett0 added 2 commits August 7, 2024 07:17

Fix memory leak issue with AWS API usage

e92053a

Implement AWS API guard as a YDB GlobalObject instead of calling InitAPI / ShutdownAPI multiple times during the program run.

Initialize AWS API always

3ca7872

jepett0 dismissed ijon’s stale review via 3ca7872 August 7, 2024 07:25

jepett0 force-pushed the ImportFromS3Leak.main.1 branch from 35b0937 to 3ca7872 Compare August 7, 2024 07:25

This comment was marked as outdated.

Sign in to view

fix ya.make to enable testing under thread sanitizer

667b6c4

jepett0 requested a review from ijon August 7, 2024 13:06

jepett0 commented Aug 7, 2024

View reviewed changes

ijon approved these changes Aug 7, 2024

View reviewed changes

jepett0 merged commit ff0fd33 into ydb-platform:main Aug 7, 2024
10 of 12 checks passed

This was referenced Aug 8, 2024

YQ revert local grpc peer value #7555

Merged

YQ kqprun pass actor system config #7693

Merged

jepett0 added a commit to jepett0/ydb that referenced this pull request Sep 3, 2024

Revert "Fix memory leak issue with AWS API usage (ydb-platform#6160)"

79283df

This reverts commit ff0fd33.

jepett0 added a commit that referenced this pull request Sep 3, 2024

Revert "Fix memory leak issue with AWS API usage (#6160)" (#8698)

2091af7

KirillKurdyukov pushed a commit to KirillKurdyukov/ydb that referenced this pull request Sep 4, 2024

Revert "Fix memory leak issue with AWS API usage (ydb-platform#6160)" (…

2bd3180

…ydb-platform#8698)

GrigoriyPA mentioned this pull request Sep 5, 2024

Fixed unit test ManyColumnShards #8777

Merged

jepett0 mentioned this pull request Sep 5, 2024

Restore the commit fixing a memory leak in AWS SDK #8823

Merged

jepett0 added a commit to jepett0/ydb that referenced this pull request Sep 5, 2024

Revert "Revert "Fix memory leak issue with AWS API usage (ydb-platfor…

33b866c

…m#6160)" (ydb-platform#8698)" This reverts commit 2091af7.

This was referenced Sep 9, 2024

Fix CLI max partitions default #8944

Merged

Fix reading without consumer from federation #9136

Merged

StekPerepolnen mentioned this pull request Sep 17, 2024

doc hc api images source #9376

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak issue with AWS API usage #6160

Fix memory leak issue with AWS API usage #6160

jepett0 commented Jul 2, 2024 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

jepett0 Jul 2, 2024

jepett0 Jul 2, 2024

jepett0 Jul 2, 2024

jepett0 Jul 2, 2024

jepett0 Jul 2, 2024

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

github-actions bot commented Aug 7, 2024 •

edited

Loading

github-actions bot commented Aug 7, 2024 •

edited

Loading

github-actions bot commented Aug 7, 2024 •

edited

Loading

jepett0 commented Aug 7, 2024 •

edited

Loading

jepett0 Aug 7, 2024 •

edited

Loading

Fix memory leak issue with AWS API usage #6160

Fix memory leak issue with AWS API usage #6160

Conversation

jepett0 commented Jul 2, 2024 • edited Loading

Proposed fix

This comment was marked as outdated.

This comment was marked as outdated.

jepett0 Jul 2, 2024

Choose a reason for hiding this comment

jepett0 Jul 2, 2024

Choose a reason for hiding this comment

jepett0 Jul 2, 2024

Choose a reason for hiding this comment

jepett0 Jul 2, 2024

Choose a reason for hiding this comment

jepett0 Jul 2, 2024

Choose a reason for hiding this comment

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

github-actions bot commented Aug 7, 2024 • edited Loading

github-actions bot commented Aug 7, 2024 • edited Loading

github-actions bot commented Aug 7, 2024 • edited Loading

jepett0 commented Aug 7, 2024 • edited Loading

Reasons why data races in Aws::S3::S3Client (KIKIMR-11055) are most probably avoided

Reasons why potential data races in setenv, getenv calls (KIKIMR-12129) are avoided

Thread sanitizer tests

Comparison with the base branch code

Summary

jepett0 Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

jepett0 commented Jul 2, 2024 •

edited

Loading

github-actions bot commented Aug 7, 2024 •

edited

Loading

github-actions bot commented Aug 7, 2024 •

edited

Loading

github-actions bot commented Aug 7, 2024 •

edited

Loading

jepett0 commented Aug 7, 2024 •

edited

Loading

jepett0 Aug 7, 2024 •

edited

Loading