-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14548: [C++] Add madvise random support for memory mapped file #11588
Conversation
|
|
||
Status MemoryMappedFile::WillNeed(const std::vector<ReadRange>& ranges) { | ||
std::vector<MemoryRegion> regions(ranges.size()); | ||
RETURN_NOT_OK(ReadRangesToMemoryRegions(ranges, memory_map_, regions)); | ||
return ::arrow::internal::MemoryAdviseWillNeed(regions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we already have WillNeed
API in MemoryMappedFile
to advise OS about the needed ranges, I add an AdviseRandom
API similarly to indicate the random access pattern. Initially, I would like to make this API consistent with WillNeed
and simply call it Random
but I think this may be slightly confusing as well, so I name it AdviseRandom
currently. Let me know if you have other naming suggestion for this API.
return IOErrorFromErrno(err, "posix_madvise failed"); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extract this piece of code from MemoryAdviseWillNeed
API so that it can be used to provide other advices to OS.
5c4521b
to
3ec1b9c
Compare
Thanks for the PR. |
Thanks @niyue ! Test code: // test.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
const unsigned seed = 42;
const int N = 4;
// random read at most 1/N pages
int test_random_read(const char *p, size_t sz)
{
srand(seed);
int sum = 0;
for (size_t i = 0; i < sz/4096/N; ++i) {
double r = (double)rand() / RAND_MAX;
r *= (sz - 2);
sum += p[(size_t)r];
}
return sum;
}
int main(int argc, char *argv[])
{
// test.bin is filled with 1G ramdon data
int fd = open("./test.bin", O_RDONLY);
if (fd < 0) abort();
struct stat statbuf;
if (fstat(fd, &statbuf) < 0) abort();
const size_t sz = statbuf.st_size;
char *p = mmap(NULL, sz, PROT_READ, MAP_SHARED, fd, 0);
if (p == MAP_FAILED) abort();
printf("%d pages mapped\n", (int)(sz/4096));
// in my test box (ubuntu20.04, linux 5.8, 16G ram, x86_64)
// - with madvise, 22% (1/N) file is in page cache when program finishes
// - without madvise, 98% file is in page cache
if (argc == 1) {
printf("with madvise(random)\n");
if (posix_madvise(p, sz, POSIX_MADV_RANDOM) != 0) abort();
} else {
printf("without madvise\n");
}
int sum = test_random_read(p, sz);
munmap(p, sz);
return sum;
} Test steps:
# create 1G test file
$ dd if=/dev/urandom of=test.bin bs=1M count=1K
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 15.5135 s, 69.2 MB/s
$ sudo sync
# make sure test.bin is not in page cache
$ sudo sh -c 'echo 1 > /proc/sys/vm/drop_caches'
$ pcstat test.bin
+----------+----------------+------------+-----------+---------+
| Name | Size (bytes) | Pages | Cached | Percent |
|----------+----------------+------------+-----------+---------|
| test.bin | 1073741824 | 262144 | 0 | 0.000 |
+----------+----------------+------------+-----------+---------+
# evaluate test program with madvise(random)
$ gcc -O3 test.c && ./a.out
262144 pages mapped
with madvise(random)
$ pcstat test.bin
+----------+----------------+------------+-----------+---------+
| Name | Size (bytes) | Pages | Cached | Percent |
|----------+----------------+------------+-----------+---------|
| test.bin | 1073741824 | 262144 | 57943 | 22.104 |
+----------+----------------+------------+-----------+---------+
# make sure test.bin is not in page cache
$ sudo sh -c 'echo 1 > /proc/sys/vm/drop_caches'
$ pcstat test.bin
+----------+----------------+------------+-----------+---------+
| Name | Size (bytes) | Pages | Cached | Percent |
|----------+----------------+------------+-----------+---------|
| test.bin | 1073741824 | 262144 | 0 | 0.000 |
+----------+----------------+------------+-----------+---------+
# evaluate test program without madvise
$ gcc -O3 test.c && ./a.out nomadvise
262144 pages mapped
without madvise
$ pcstat test.bin
+----------+----------------+------------+-----------+---------+
| Name | Size (bytes) | Pages | Cached | Percent |
|----------+----------------+------------+-----------+---------|
| test.bin | 1073741824 | 262144 | 258625 | 98.658 |
+----------+----------------+------------+-----------+---------+ |
What if we simply always pass (the OS should do fine on its own for sequential access patterns, AFAIK) |
Note however that the entire point of |
Instead we may add an option to the IPC reader to disable advisory hints, for use cases like this one (binary search in a large memory-mapped IPC file). |
Does WILLNEED play nicely with RANDOM? It seems like we would always want to specify both for IPC. I haven't done much testing of this myself. Typically when I work with IPC data it is in-memory already. |
I don't think you can combine several values. POSIX says:
|
I don't do enough testing for different madvice yet, but I assume this will likely affect non random access pattern's throughput negatively. |
I suppose this saves memory, but since it's in the OS cache, does that actually lead to memory pressure? |
Probably not. AFAIU, @niyue's complaint is more about the additional I/O (which would happen in the background, but may still hamper performance of other concurrent I/Os).. |
…advise OS about its memory usage.
3ec1b9c
to
bc2d1c9
Compare
Correct. In my test, it is more about unnecessary IO. You can tell from @cyb70289's test as well, 800MB unnecessary IO was performed, if the disk has 200MB/s read bandwidth, it will fully occupy the disk IO for 4 seconds, which make other disk IO not possible. Although I don't perform any serious memory related testing for this madvice, I feel the program using mmap + random access pattern + default IO advice could affect other applications' performance as well. The program essentially reads lots of never used data into page cache in this case, and OS may have to replace existing data in page cache with these never read data, which will cause other applications to perform worse in the end. |
That makes sense and I agree it could hamper performance. One thing we do not have very well tested and benchmarked yet is multi-query performance. Most of our work is on making a single query run as fast as possible but once there are multiple queries running at the same time then issues like this will presumably start to surface. I think we can get away with always applying RANDOM but I'm not certain for memory mapped files. Details: So with most of our readers (soon to be all I hope) we don't really need the OS to do prefetching for us. Even if it was helpful it isn't something we can rely on because remote filesystems (e.g. S3, GCS) will suffer. We typically know exactly what ranges of data we want to access and have a pretty good idea (e.g. batch readahead) when we need to start loading the data. That's why I was really hoping we could do a combination of RANDOM (please don't prefetch for me) and WILLNEED (please prefetch this specifically starting now). For regular files, it probably isn't that big of a deal. We should probably always apply RANDOM. We don't really need prefetching because we are doing asynchronous reads on the I/O thread pool and we kick those off early. We do our own plugging and batching of requests with ReadRangeCache. We are manually duplicating what the OS is doing (since we can't rely on it for remote filesystems) and it works ok as far as I can tell. So then I suspect the issue is really just with memory mapped files. Because ReadAt is a no-op as far as the OS is concerned it won't know that it needs to load those pages in. We do that today with WILLNEED. My original assumption would have been that prefetching wasn't applied with mmap'd files but it appears that is false. So is there any way we can disable prefetching but still inform the OS that it needs to load a specific range of pages into RAM? |
The whole point of our madvise usage is to let the kernel prefetch data. If we decide we don't need prefetching, then we should just remove calls to madvise (which is a system call, so is not without overhead). |
Also:
That sounds contradictory, so I'm not sure how to resolve it. |
The problem seems to be the default prefetch. The fadvise documentation here is actually more complete than the madvise documentation but I don't know for certain that the parameters mean the same thing:
For Linux the parameter of interest is read_ahead_kb which is the default readahead referred to in the advise documentation. On my system it is set to 128 (and I think this is the default). While WILLNEED and DONTNEED are specific prefetching instructions (and only apply to a range of data) the advice RANDOM, SEQUENTIAL, and NORMAL are file-wide instructions to control the default behavior. Since we only need to call it once per file the overhead shouldn't be too great.
Another way to phrase it would be "can we modify the file-wide default and still provide instructions for specific regions." |
Similar discussion: elastic/elasticsearch#27748 |
Just some tests. FYI. Looks we can advise both I modified test code to only read higher half mapped memory (0.5G ~ 1G). With both |
Is it? I'm skeptical that the kernel prefetches an entire 1GB file if you don't ask it to.
That's only for
What would that change? The file-wide default is probably not to prefetch. It doesn't seem to me that you are addressing the original complaint. |
I'd like to reboot the discussion and stop discussing flag combinations without regard for the original issue. Here is the complaint:
So, to sum it up:
@niyue Am I right? |
@pitrou Sorry, probably I didn't make it clear enough. This is not what I concern. The array in my case is not very small (typically 0.5~1 million words in the array), two binary search accesses usually won't fall into the prefetched pages range, which makes the OS prefetching useless and disk doing lots of unnecessary IO. We have thousands of such IPC files to binary search for each request, and the amount of wasted IO is non trivial. Additionally, some of the storage we test (from some big cloud vendor), have limited bandwidth, which makes this behavior worse. I submitted this PR because after reading some documentation, I realized I need to advise OS not doing the prefetching for data that is NOT requested by user program since I know up front the prefetching is useless. So far I find this will not matter too much if the storage has good bandwidth, but it could help a lot if the storage bandwidth is limited. But I don't perform test for multiple concurrent programs yet, and I feel even for fast storage, this could help if there are multiple programs running since system page cache can be utilized better. More generally, I think people may run into such issue when doing random access for an mmaped array with small size element (e.g. small string, integer, date time, most of the primitive types) in an IPC file if they read these files from bandwidth limited storage. |
@niyue To me, it seems that you're saying I misunderstood your use case, while you're repeating exactly what I understood. To make things clear:
Is that right? |
For context it is probably worth pointing out that @niyue recently added #11486 which gets around the classic "IPC reader reads the entire file even if you only want a few columns" issue. However, I agree with @pitrou . It sounds like you are not just limiting which columns you are accessing but you are also accessing very few rows. In that case the problem is likely the fact that the record batch file reader loads the entire array via And, in Int hat case, the solution is what Antoine suggested. We should provide an option in MemoryMappedFile to prevent calls to madvise. |
This is what I considered different. To make things simpler, we can assume there is only one array in the record batch, so that we don't have to discuss if multiple arrays are read in this case. The user program did only access a tiny bit of the record batch (accessing one element in the array), since it is mmaped, and I expect only 1 page to be loaded as it is the minimum IO required by OS (or is this an incorrect expectation?), however, I found simply calling
What I think different is, the user program doesn't want to ignore most of the remaining data, and according to the API it calls, it tries to read 1 page (1 array element actually) and uses the content in this page. From what I consider, this doesn't ignore any remaining data: it does ignore the remaining |
@westonpace I didn't realize this previously. Do you mean I will do more arrow coding reading to see how it works. Previously I briefly did some code searching on this part, and I thought the |
Thanks for the pointer, and I found this
I debugged the program previously and confirmed |
I want to add to this conversation that Linux kernel also supports setting the madvise flags for a memory region from another process (see |
So I think the work I'm doing on ARROW-14577 / #11616 will help here. I am adding two new methods to RecordBatchFileReader, Once those are in we can remove the automatic WILLNEED advice from MemoryMappedFile. Instead So if you want to randomly hop through a file (e.g. for a binary search) simply use I'm still a few days away from finishing so any feedback on the approach now would be appreciated. |
I think the default should still be to prefetch automatically, though. That seems to be the dominant use case (if you load a batch, you probably want to use it in its entirety or a large part of it). |
That seems reasonable. What if wel add a new optional parameter to |
Wow, I completely forgot to answer this, sorry.
Hmm, that doesn't sound like an ideal solution, as that would only work for memory-mapped files. |
I happened to revisit this PR a week ago.
@westonpace I briefly went through PR 11616 last time but didn't find the two new methods, do we still plan to add such APIs?
@pitrou I don't have a whole picture on these APIs, could you please shed some more light on this part, what would be a possible ideal solution for it? I am still interested in making this to work better, and it will be great if you can give some more advice and I can give it a try later. |
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
Add advise random memory API for MemoryMappedFile so that client can advise OS about its memory usage if it uses random access