-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fuzzer for async data cache #10244
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@zacw7 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
This pull request was exported from Phabricator. Differential Revision: D58715904 |
Summary: Introduce a basic fuzzer for the async data cache. Each iteration involves: 1. Creating a set of data files of varying sizes. 2. Setting up the async data cache with an SSD using a specified configuration. 3. Performing parallel random reads from these data files. In the initial setup, most of the parameters are defined as gflags and we'll decide later which parameters should be fuzzed during the tests. Pull Request resolved: facebookincubator#10244 Differential Revision: D58715904 Pulled By: zacw7
This pull request was exported from Phabricator. Differential Revision: D58715904 |
Summary: Introduce a basic fuzzer for the async data cache. Each iteration involves: 1. Creating a set of data files of varying sizes. 2. Setting up the async data cache with an SSD using a specified configuration. 3. Performing parallel random reads from these data files. In the initial setup, most of the parameters are defined as gflags and we'll decide later which parameters should be fuzzed during the tests. Pull Request resolved: facebookincubator#10244 Differential Revision: D58715904 Pulled By: zacw7
This pull request was exported from Phabricator. Differential Revision: D58715904 |
Summary: Introduce a basic fuzzer for the async data cache. Each iteration involves: 1. Creating a set of data files of varying sizes. 2. Setting up the async data cache with an SSD using a specified configuration. 3. Performing parallel random reads from these data files. In the initial setup, most of the parameters are defined as gflags and we'll decide later which parameters should be fuzzed during the tests. Pull Request resolved: facebookincubator#10244 Differential Revision: D58715904 Pulled By: zacw7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add doc for running asyc data cache fuzzer, an example to follow:
return executor_.get(); | ||
} | ||
|
||
void initializeDataFiles(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comments for those functions to explain what is done in the different initializations? Same for the rest of init functions.
I would prefer the function name to be as succinct as possible, so how about initDataFiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the naming convention of all cache related testing. Examples:
void initializeCache( initializeSsdFile( static void initializeContents(int64_t sequence, memory::Allocation& alloc) {
I would say let's keep it consistent here.
DEFINE_int32( | ||
max_num_reads, | ||
100, | ||
"Max number of reads to be performed per thread."); | ||
|
||
DEFINE_int32(num_threads, 16, "Number of threads to read."); | ||
|
||
DEFINE_int32(num_files, 8, "Number of data files to be created."); | ||
|
||
DEFINE_uint64( | ||
offset_interval_bytes, | ||
8 << 20, | ||
"The offset bytes to be aligned at for cache reads."); | ||
|
||
DEFINE_uint64( | ||
min_file_bytes, | ||
32 << 20, | ||
"Minimum file size in bytes of the data files to be created."); | ||
|
||
DEFINE_uint64( | ||
max_file_bytes, | ||
64 << 20, | ||
"Maximum file size in bytes of the data files to be created."); | ||
|
||
DEFINE_int32(num_files_in_group, 3, "Number of files to be grouped together."); | ||
|
||
DEFINE_int64(memory_cache_bytes, 16 << 20, "Memory cache size in bytes."); | ||
|
||
DEFINE_uint64(ssd_cache_bytes, 128 << 20, "Ssd cache size in bytes."); | ||
|
||
DEFINE_int32(num_shards, 4, "Number of shards of SSD cache."); | ||
|
||
DEFINE_uint64( | ||
ssd_checkpoint_interval_bytes, | ||
64 << 20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should expose those as parameter of fuzzer. They should be able to be randomized, for now if we want to keep it simple, we can make them constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. I've discussed with @xiaoxmeng and we'll decide later which parameters should be 1) randomized by fuzzer; 2) kept as configurable parameters; 3) fixed as constants.
So I don't have a strong preference on how we should define them for now in this initial PR. Let's see if @xiaoxmeng has some different opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can randomize this later if it is zero.
This pull request was exported from Phabricator. Differential Revision: D58715904 |
Summary: Introduce a basic fuzzer for the async data cache. Each iteration involves: 1. Creating a set of data files of varying sizes. 2. Setting up the async data cache with an SSD using a specified configuration. 3. Performing parallel random reads from these data files. In the initial setup, most of the parameters are defined as gflags and we'll decide later which parameters should be fuzzed during the tests. Pull Request resolved: facebookincubator#10244 Differential Revision: D58715904 Pulled By: zacw7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zacw7 thanks for adding the cache fuzzer % comments.
cache_ = AsyncDataCache::create(allocator_, std::move(ssdCache), {}); | ||
} | ||
|
||
void AsyncDataCacheFuzzer::initializeInputs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about each read thread
Loop:
1. pickup a file;
2. create a cache buffer input
3. enqueue a randomly selected read offsets
4. call load on the cache buffer input
5. randomly to read from a subset or all the enqueued streams in step3?
6. for each selected enqueue stream, read from start to the end and verify the read bytes? Thanks!
This pull request was exported from Phabricator. Differential Revision: D58715904 |
Summary: Introduce a basic fuzzer for the async data cache. Each iteration involves: 1. Creating a set of data files of varying sizes. 2. Setting up the async data cache with an SSD using a specified configuration. 3. Performing parallel random reads from these data files. In the initial setup, most of the parameters are defined as gflags and we'll decide later which parameters should be fuzzed during the tests. Pull Request resolved: facebookincubator#10244 Differential Revision: D58715904 Pulled By: zacw7
This pull request was exported from Phabricator. Differential Revision: D58715904 |
Summary: Introduce a basic fuzzer for the async data cache. Each iteration involves: 1. Creating a set of data files of varying sizes. 2. Setting up the async data cache with an SSD using a specified configuration. 3. Performing parallel random reads from these data files. In the initial setup, most of the parameters are defined as gflags and we'll decide later which parameters should be fuzzed during the tests. Pull Request resolved: facebookincubator#10244 Differential Revision: D58715904 Pulled By: zacw7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zacw7 thanks for the update % minors!
void CacheFuzzer::initializeCache() { | ||
// We have up to 20 threads and 16 threads are used for reading so | ||
// there are some threads left over for SSD background write. | ||
executor_ = std::make_unique<folly::IOThreadPoolExecutor>(20); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shall separate them?
readerExecutor_ -> cpu executor: 64
prefetchExecutor_ -> io executor which passed to buffered input: 4
ssdExecutor_ -> io executor which passed to SSD cache for SSD staging write: 4?
Summary: Introduce a basic fuzzer for the async data cache. Each iteration involves: 1. Creating a set of data files of varying sizes. 2. Setting up the async data cache with an SSD using a specified configuration. 3. Performing parallel random reads from these data files. In the initial setup, most of the parameters are defined as gflags and we'll decide later which parameters should be fuzzed during the tests. Pull Request resolved: facebookincubator#10244 Differential Revision: D58715904 Pulled By: zacw7
This pull request was exported from Phabricator. Differential Revision: D58715904 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D58715904 |
Summary: Introduce a basic fuzzer for the async data cache. Each iteration involves: 1. Creating a set of data files of varying sizes. 2. Setting up the async data cache with an SSD using a specified configuration. 3. Performing parallel random reads from these data files. In the initial setup, most of the parameters are defined as gflags and we'll decide later which parameters should be fuzzed during the tests. Pull Request resolved: facebookincubator#10244 Differential Revision: D58715904 Pulled By: zacw7
This pull request was exported from Phabricator. Differential Revision: D58715904 |
Summary: Introduce a basic fuzzer for the async data cache. Each iteration involves: 1. Creating a set of data files of varying sizes. 2. Setting up the async data cache with an SSD using a specified configuration. 3. Performing parallel random reads from these data files. In the initial setup, most of the parameters are defined as gflags and we'll decide later which parameters should be fuzzed during the tests. Pull Request resolved: facebookincubator#10244 Differential Revision: D58715904 Pulled By: zacw7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zacw7 few more comments.
velox/exec/fuzzer/CacheFuzzer.cpp
Outdated
for (auto i = 0; i < FLAGS_num_source_files; ++i) { | ||
// Initialize buffered input. | ||
auto readFile = fs_->openFileForRead(fileNames_[i]); | ||
groupIds_.emplace_back( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the file group support? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not. Let me remove it.
This pull request was exported from Phabricator. Differential Revision: D58715904 |
Summary: Introduce a basic fuzzer for the async data cache. Each iteration involves: 1. Creating a set of data files of varying sizes. 2. Setting up the async data cache with an SSD using a specified configuration. 3. Performing parallel random reads from these data files. In the initial setup, most of the parameters are defined as gflags and we'll decide later which parameters should be fuzzed during the tests. Pull Request resolved: facebookincubator#10244 Differential Revision: D58715904 Pulled By: zacw7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zacw7 LGTM. Thanks % minors!
This pull request was exported from Phabricator. Differential Revision: D58715904 |
Summary: Introduce a basic fuzzer for the async data cache. Each iteration involves: 1. Creating a set of data files of varying sizes. 2. Setting up the async data cache with an SSD using a specified configuration. 3. Performing parallel random reads from these data files. In the initial setup, most of the parameters are defined as gflags and we'll decide later which parameters should be fuzzed during the tests. Pull Request resolved: facebookincubator#10244 Reviewed By: xiaoxmeng Differential Revision: D58715904 Pulled By: zacw7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zacw7 thanks for the update!
Summary: Introduce a basic fuzzer for the async data cache. Each iteration involves: 1. Creating a set of data files of varying sizes. 2. Setting up the async data cache with an SSD using a specified configuration. 3. Performing parallel random reads from these data files. In the initial setup, most of the parameters are defined as gflags and we'll decide later which parameters should be fuzzed during the tests. Pull Request resolved: facebookincubator#10244 Reviewed By: xiaoxmeng Differential Revision: D58715904 Pulled By: zacw7
This pull request was exported from Phabricator. Differential Revision: D58715904 |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Introduce a basic fuzzer for the async data cache. Each iteration involves:
In the initial setup, most of the parameters are defined as gflags and we'll decide later which parameters should be fuzzed during the tests.