Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak issue with AWS API usage #6160

Merged
merged 3 commits into from
Aug 7, 2024

Conversation

jepett0
Copy link
Collaborator

@jepett0 jepett0 commented Jul 2, 2024

#5737

A memory leak can be observed in the production and reproduced with a ydbd binary compiled with the --sanitize=leak flag and a ydb import s3 call. Detailed reproduction steps are described in the issue.

The exact reason for the leak is unknown to me. However, it seems to be caused by the indeterministic order of static variable destruction. Here is a relevant line from the docs of AWS C++ SDK:

The SDK for C++ and its dependencies use C++ static objects, and the order of static object destruction is not determined by the C++ standard. To avoid memory issues caused by the nondeterministic order of static variable destruction, do not wrap the calls to Aws::InitAPI and Aws::ShutdownAPI into another static object.

At first glance, it does not seem like TApiInitializer objects have static storage duration in the current code. However, if you change the TApiInitializer object to be a Singleton itself and run the same script as described in the issue, then you will get the exact same leak sanitizer report. It gave me an idea that the underlying issue might be fixed in the same way as a static TApiInitializer would be.

I have done some research on the topic of memory issues in AWS SDK and that is what I have found:

  • A question asking if multiple InitAPI / ShutdownAPI calls are ok. (The answer is "no".)
  • A comment elaborating a bit more on the possible problems of a static RAII wrapper of InitAPI / ShutdownAPI calls.
  • A blame of Apache Arrow code that contains changes that have happened between version 12 and 13. It is important to note that the exact same leak was fixed by some change in the Arrow code which happened between versions 12 and 13 and I believe this PR did it. To quote the fix:

When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible. S3 can only be initialized and finalized once. S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished.

Proposed fix

So the fix that I propose in this PR is simple: call InitAPI somewhere in the beginning of the main function and call ShutdownAPI somewhere close to the end of the main function. Specifically, make TApiInitializer a data member of TKikimrRunner.

Unit tests have to call InitAPI and ShutdownAPI explicitly before each test run. It must be done once per process, so a unit test hook is a good choice because individual test cases are run sequentially in a for loop in the same process by default.

@ydb-platform ydb-platform deleted a comment from github-actions bot Jul 2, 2024
@ydb-platform ydb-platform deleted a comment from github-actions bot Jul 2, 2024
@ydb-platform ydb-platform deleted a comment from github-actions bot Jul 2, 2024

This comment was marked as outdated.

This comment was marked as outdated.

@jepett0 jepett0 marked this pull request as ready for review July 2, 2024 17:00
@jepett0 jepett0 requested review from CyberROFL, Enjection and ijon July 2, 2024 17:01
@@ -1,81 +1,15 @@
#include "s3_storage.h"
#include "s3_storage_config.h"

#include <contrib/libs/aws-sdk-cpp/aws-cpp-sdk-core/include/aws/core/internal/AWSHttpResourceClient.h>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These includes were not used, I believe


struct TApiInitializer {
TApiInitializer() {
Options.httpOptions.initAndCleanupCurl = false;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there is no need to have a separate TCurlInitializer (which calls curl_global_init and cleanup), because if you enable the httpOptions.initAndCleanupCurl option, then curl_global_init and cleanup would be called from AWS SDK

Options.httpOptions.initAndCleanupCurl = false;
InitAPI(Options);

Internal::CleanupEC2MetadataClient(); // speeds up config construction
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is from Internal namespace. I don't think we should call it. Moreover, TApiInitializer would be initialized only once per ydbd application start, so its performance should not be an issue.

Runtime.Reset();
S3Mock.Reset();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reset in reverse order of initialization. It does not really matter

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to move BackupRestore tests from ydb_import_ut.cpp, because its ya.make builds a huge binary with a lot of tests. None of these tests need Y_TEST_HOOK_BEFORE_RUN, except mine, so I decided to separate them.

ijon
ijon previously approved these changes Jul 2, 2024
jepett0 added 2 commits August 7, 2024 07:17
Implement AWS API guard as a YDB GlobalObject instead of calling InitAPI / ShutdownAPI multiple times during the program run.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Copy link

github-actions bot commented Aug 7, 2024

2024-08-07 12:16:09 UTC Pre-commit check for d7c414d has started.
2024-08-07 12:19:02 UTC Check linux-x86_64-release-clang14 is running...
🟢 2024-08-07 12:26:08 UTC Build successful.

Copy link

github-actions bot commented Aug 7, 2024

2024-08-07 12:16:20 UTC Pre-commit check for d7c414d has started.
2024-08-07 12:19:15 UTC Check linux-x86_64-relwithdebinfo is running...
🟡 2024-08-07 13:52:11 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
15250 13899 0 4 1336 11

2024-08-07 13:54:38 UTC Failed tests rerun (try 2) linux-x86_64-relwithdebinfo is running...
🟡 2024-08-07 14:05:54 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
18 (only retried tests) 4 0 2 4 8

2024-08-07 14:07:14 UTC Failed tests rerun (try 3) linux-x86_64-relwithdebinfo is running...
🟢 2024-08-07 14:15:11 UTC Tests successful.

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
14 (only retried tests) 6 0 0 0 8

🟢 2024-08-07 14:16:27 UTC Build successful.
🟢 2024-08-07 14:17:01 UTC ydbd size 8.1 GiB changed* by -11.2 KiB, which is <= 0 Bytes vs main: OK

ydbd size dash main: 2b5467c merge: d7c414d diff diff %
ydbd size 8 672 546 288 Bytes 8 672 534 832 Bytes -11.2 KiB -0.000%
ydbd stripped size 471 982 856 Bytes 471 983 752 Bytes +896 Bytes +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Aug 7, 2024

2024-08-07 12:16:21 UTC Pre-commit check for d7c414d has started.
2024-08-07 12:19:14 UTC Check linux-x86_64-release-asan is running...
🔴 2024-08-07 14:29:38 UTC Some tests failed, follow the links below.

Test history | Ya make output

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
10913 10861 0 10 25 17

🟢 2024-08-07 14:30:52 UTC Build successful.
🟢 2024-08-07 14:31:22 UTC ydbd size 5.4 GiB changed* by +5.0 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 2b5467c merge: d7c414d diff diff %
ydbd size 5 831 510 096 Bytes 5 831 515 176 Bytes +5.0 KiB +0.000%
ydbd stripped size 1 464 849 872 Bytes 1 464 853 008 Bytes +3.1 KiB +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@jepett0
Copy link
Collaborator Author

jepett0 commented Aug 7, 2024

@CyberROFL has pointed out the following potential problems:

I believe that the fix proposed in this PR should avoid both problems.

Reasons why data races in Aws::S3::S3Client (KIKIMR-11055) are most probably avoided

There are 6 data races caught in KIKIMR-11055:

  • 2 in curl_slist_append
  • in curl_easy_init
  • in Curl_open
  • in Curl_initinfo
  • in Curl_saferealloc

As of this PR, curl is managed by AWS SDK and I believe the authors are competent enough and the user base is large enough for the data races mentioned above to be eliminated.

Reasons why potential data races in setenv, getenv calls (KIKIMR-12129) are avoided

This PR does not introduce any new calls to them.

Thread sanitizer tests

I have run functional tests in ydb/services/ydb/backup_ut multiple times under TSAN and got the following results:

Test name Runs in total Runs with at least one data race
BackupRestore::Basic 10 0
BackupRestore::RestoreTablePartitioningSettings 10 2
BackupRestore::RestoreIndexTablePartitioningSettings 10 2
BackupRestoreS3::Basic 12 1
BackupRestoreS3::RestoreTablePartitioningSettings 10 1
BackupRestoreS3::RestoreIndexTablePartitioningSettings 10 0

The caught data races are the following:

Data races in AWS SDK stack were not observed.

Comparison with the base branch code

I have tested the code of the base commit of this PR from the main brach under TSAN and the results are the following:

Test name Runs in total Data races
BackupRestore::Basic 1 0
BackupRestore::RestoreTablePartitioningSettings 1 1
BackupRestore::RestoreIndexTablePartitioningSettings 1 2

Caught data races:

  • TBucketQuoter::FillBucket
  • TMemorizableControlWrapper::Update
  • AtomicSet

Summary

Data races caught in 62 runs of the code in this PR have all been caught in 3 runs of the base branch code. This makes me certain that I have not introduced any new data races.

@jepett0 jepett0 requested a review from ijon August 7, 2024 13:06
@@ -1653,6 +1653,8 @@ TIntrusivePtr<TServiceInitializersList> TKikimrRunner::CreateServiceInitializers
sil->AddServiceInitializer(new TGraphServiceInitializer(runConfig));
}

sil->AddServiceInitializer(new TAwsApiInitializer(*this));
Copy link
Collaborator Author

@jepett0 jepett0 Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No dedicated service mask, always enabled. This would help us to avoid misconfiguration by users.

@jepett0 jepett0 merged commit ff0fd33 into ydb-platform:main Aug 7, 2024
10 of 12 checks passed
jepett0 added a commit to jepett0/ydb that referenced this pull request Sep 3, 2024
jepett0 added a commit to jepett0/ydb that referenced this pull request Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants