Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14548: [C++] Add madvise random support for memory mapped file #11588

Closed
wants to merge 1 commit into from

Conversation

niyue
Copy link
Contributor

@niyue niyue commented Nov 2, 2021

Add advise random memory API for MemoryMappedFile so that client can advise OS about its memory usage if it uses random access

@github-actions
Copy link

github-actions bot commented Nov 2, 2021

@github-actions
Copy link

github-actions bot commented Nov 2, 2021

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.


Status MemoryMappedFile::WillNeed(const std::vector<ReadRange>& ranges) {
std::vector<MemoryRegion> regions(ranges.size());
RETURN_NOT_OK(ReadRangesToMemoryRegions(ranges, memory_map_, regions));
return ::arrow::internal::MemoryAdviseWillNeed(regions);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we already have WillNeed API in MemoryMappedFile to advise OS about the needed ranges, I add an AdviseRandom API similarly to indicate the random access pattern. Initially, I would like to make this API consistent with WillNeed and simply call it Random but I think this may be slightly confusing as well, so I name it AdviseRandom currently. Let me know if you have other naming suggestion for this API.

return IOErrorFromErrno(err, "posix_madvise failed");
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extract this piece of code from MemoryAdviseWillNeed API so that it can be used to provide other advices to OS.

@niyue niyue force-pushed the feature/mmap-advise-random branch from 5c4521b to 3ec1b9c Compare November 2, 2021 13:09
@cyb70289
Copy link
Contributor

cyb70289 commented Nov 3, 2021

Thanks for the PR.
I'm not sure if exposing madvise options is beneficial. If we expose RANDOM, then why not SEQUENTIAL?
I prefer don't expose these access pattern related options, just let OS do the job.
WILLNEED is a useful option IMO, I suppose OS won't invent a reading without hints.
@pitrou

@niyue
Copy link
Contributor Author

niyue commented Nov 3, 2021

Thanks for the PR.
I'm not sure if exposing madvise options is beneficial. If we expose RANDOM, then why not SEQUENTIAL?
I prefer don't expose these access pattern related options, just let OS do the job.
WILLNEED is a useful option IMO, I suppose OS won't invent a reading without hints.

I think probably someone else may need SEQUENTIAL as well. In my test, if the access pattern is random access (binary searching an array in a memory mapped arrow IPC file in my case), I find OS (Linux) will prefetch data, and lots of IO are wasted (90% in my test), page cache is full of never used data as well.

I wrote a program to access an array in a mmap record batch with binary exponential indexes (arr[1], arr[2], arr[4], arr[8], ...)
to visualize what I found, before applying the random advice, the page cache for this file looks like this:
image

After applying the random advice, the page cache for this file looks like this:
image

This probably not a problem for a fast storage (SSD), but the IO will become bottleneck if the storage bandwidth is limited.

@cyb70289
Copy link
Contributor

cyb70289 commented Nov 3, 2021

Thanks @niyue !
I did a similar test to read randomly at most 1/4 pages of a memory mapped file with 1G size. Without madvise, almost all 1G data is in the page cache after program finishes.
With madvise(random), only 22% is in the page cache, as expected.
Though I'm still not sure of the usefulness of tuning this option manually. In practice, I think it's pretty hard to get it right.

Test code:

// test.c

#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>

const unsigned seed = 42;
const int N = 4;

// random read at most 1/N pages
int test_random_read(const char *p, size_t sz)
{
    srand(seed);

    int sum = 0;
    for (size_t i = 0; i < sz/4096/N; ++i) {
        double r = (double)rand() / RAND_MAX;
        r *= (sz - 2);
        sum += p[(size_t)r];
    }
    return sum;
}

int main(int argc, char *argv[])
{
    // test.bin is filled with 1G ramdon data
    int fd = open("./test.bin", O_RDONLY);
    if (fd < 0) abort();

    struct stat statbuf;
    if (fstat(fd, &statbuf) < 0) abort();
    const size_t sz = statbuf.st_size;

    char *p = mmap(NULL, sz, PROT_READ, MAP_SHARED, fd, 0);
    if (p == MAP_FAILED) abort();
    printf("%d pages mapped\n", (int)(sz/4096));

    // in my test box (ubuntu20.04, linux 5.8, 16G ram, x86_64)
    // - with madvise, 22% (1/N) file is in page cache when program finishes
    // - without madvise, 98% file is in page cache
    if (argc == 1) {
        printf("with madvise(random)\n");
        if (posix_madvise(p, sz, POSIX_MADV_RANDOM) != 0) abort();
    } else {
        printf("without madvise\n");
    }

    int sum = test_random_read(p, sz);

    munmap(p, sz);
    return sum;
}

Test steps:

# create 1G test file
$ dd if=/dev/urandom of=test.bin bs=1M count=1K
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 15.5135 s, 69.2 MB/s
$ sudo sync

# make sure test.bin is not in page cache
$ sudo sh -c 'echo 1 > /proc/sys/vm/drop_caches'
$ pcstat test.bin
+----------+----------------+------------+-----------+---------+
| Name     | Size (bytes)   | Pages      | Cached    | Percent |
|----------+----------------+------------+-----------+---------|
| test.bin | 1073741824     | 262144     | 0         |   0.000 |
+----------+----------------+------------+-----------+---------+

# evaluate test program with madvise(random)
$ gcc -O3 test.c && ./a.out
262144 pages mapped
with madvise(random)
$ pcstat test.bin
+----------+----------------+------------+-----------+---------+
| Name     | Size (bytes)   | Pages      | Cached    | Percent |
|----------+----------------+------------+-----------+---------|
| test.bin | 1073741824     | 262144     | 57943     |  22.104 |
+----------+----------------+------------+-----------+---------+

# make sure test.bin is not in page cache
$ sudo sh -c 'echo 1 > /proc/sys/vm/drop_caches'
$ pcstat test.bin
+----------+----------------+------------+-----------+---------+
| Name     | Size (bytes)   | Pages      | Cached    | Percent |
|----------+----------------+------------+-----------+---------|
| test.bin | 1073741824     | 262144     | 0         |   0.000 |
+----------+----------------+------------+-----------+---------+

# evaluate test program without madvise
$ gcc -O3 test.c && ./a.out nomadvise
262144 pages mapped
without madvise
$ pcstat test.bin
+----------+----------------+------------+-----------+---------+
| Name     | Size (bytes)   | Pages      | Cached    | Percent |
|----------+----------------+------------+-----------+---------|
| test.bin | 1073741824     | 262144     | 258625    |  98.658 |
+----------+----------------+------------+-----------+---------+

@pitrou
Copy link
Member

pitrou commented Nov 3, 2021

What if we simply always pass POSIX_MADV_RANDOM? MemoryAdviseWillNeed is used in situations where the access pattern is not sequential.

(the OS should do fine on its own for sequential access patterns, AFAIK)

cc @lidavidm @westonpace

@pitrou
Copy link
Member

pitrou commented Nov 3, 2021

Note however that the entire point of MemoryAdviseWillNeed is to induce prefetching before actually reading it (to hide read latencies). If the OS doesn't actually prefetch anything, then the call is useless. It's not obvious how POSIX_MADV_RANDOM is used by the Linux kernel here.

@pitrou
Copy link
Member

pitrou commented Nov 3, 2021

Instead we may add an option to the IPC reader to disable advisory hints, for use cases like this one (binary search in a large memory-mapped IPC file).

@westonpace
Copy link
Member

Does WILLNEED play nicely with RANDOM? It seems like we would always want to specify both for IPC. I haven't done much testing of this myself. Typically when I work with IPC data it is in-memory already.

@pitrou
Copy link
Member

pitrou commented Nov 3, 2021

Does WILLNEED play nicely with RANDOM?

I don't think you can combine several values. POSIX says:

The advice to be applied to the memory range is specified by the advice parameter and may be one of the following values: [...]

@niyue
Copy link
Contributor Author

niyue commented Nov 3, 2021

What if we simply always pass POSIX_MADV_RANDOM?

I don't do enough testing for different madvice yet, but I assume this will likely affect non random access pattern's throughput negatively.

@lidavidm
Copy link
Member

lidavidm commented Nov 3, 2021

I suppose this saves memory, but since it's in the OS cache, does that actually lead to memory pressure?

@pitrou
Copy link
Member

pitrou commented Nov 3, 2021

Probably not. AFAIU, @niyue's complaint is more about the additional I/O (which would happen in the background, but may still hamper performance of other concurrent I/Os)..

@niyue niyue force-pushed the feature/mmap-advise-random branch from 3ec1b9c to bc2d1c9 Compare November 3, 2021 13:57
@niyue
Copy link
Contributor Author

niyue commented Nov 3, 2021

Probably not. AFAIU, @niyue's complaint is more about the additional I/O (which would happen in the background, but may still hamper performance of other concurrent I/Os)..

Correct. In my test, it is more about unnecessary IO. You can tell from @cyb70289's test as well, 800MB unnecessary IO was performed, if the disk has 200MB/s read bandwidth, it will fully occupy the disk IO for 4 seconds, which make other disk IO not possible.

Although I don't perform any serious memory related testing for this madvice, I feel the program using mmap + random access pattern + default IO advice could affect other applications' performance as well. The program essentially reads lots of never used data into page cache in this case, and OS may have to replace existing data in page cache with these never read data, which will cause other applications to perform worse in the end.

@westonpace
Copy link
Member

That makes sense and I agree it could hamper performance. One thing we do not have very well tested and benchmarked yet is multi-query performance. Most of our work is on making a single query run as fast as possible but once there are multiple queries running at the same time then issues like this will presumably start to surface.

I think we can get away with always applying RANDOM but I'm not certain for memory mapped files. Details:

So with most of our readers (soon to be all I hope) we don't really need the OS to do prefetching for us. Even if it was helpful it isn't something we can rely on because remote filesystems (e.g. S3, GCS) will suffer. We typically know exactly what ranges of data we want to access and have a pretty good idea (e.g. batch readahead) when we need to start loading the data.

That's why I was really hoping we could do a combination of RANDOM (please don't prefetch for me) and WILLNEED (please prefetch this specifically starting now). For regular files, it probably isn't that big of a deal. We should probably always apply RANDOM. We don't really need prefetching because we are doing asynchronous reads on the I/O thread pool and we kick those off early. We do our own plugging and batching of requests with ReadRangeCache. We are manually duplicating what the OS is doing (since we can't rely on it for remote filesystems) and it works ok as far as I can tell.

So then I suspect the issue is really just with memory mapped files. Because ReadAt is a no-op as far as the OS is concerned it won't know that it needs to load those pages in. We do that today with WILLNEED. My original assumption would have been that prefetching wasn't applied with mmap'd files but it appears that is false. So is there any way we can disable prefetching but still inform the OS that it needs to load a specific range of pages into RAM?

@pitrou
Copy link
Member

pitrou commented Nov 3, 2021

The whole point of our madvise usage is to let the kernel prefetch data. If we decide we don't need prefetching, then we should just remove calls to madvise (which is a system call, so is not without overhead).

@pitrou
Copy link
Member

pitrou commented Nov 3, 2021

Also:

So is there any way we can disable prefetching but still inform the OS that it needs to load a specific range of pages into RAM?

That sounds contradictory, so I'm not sure how to resolve it.

@westonpace
Copy link
Member

The whole point of our madvise usage is to let the kernel prefetch data. If we decide we don't need prefetching, then we should just remove calls to madvise (which is a system call, so is not without overhead).

The problem seems to be the default prefetch. The fadvise documentation here is actually more complete than the madvise documentation but I don't know for certain that the parameters mean the same thing:

Under Linux, POSIX_FADV_NORMAL sets the readahead window to the default size for the backing device; POSIX_FADV_SEQUENTIAL doubles this size, and POSIX_FADV_RANDOM disables file readahead entirely. These changes affect the entire file, not just the specified region (but other open file handles to the same file are unaffected).

For Linux the parameter of interest is read_ahead_kb which is the default readahead referred to in the advise documentation. On my system it is set to 128 (and I think this is the default).

While WILLNEED and DONTNEED are specific prefetching instructions (and only apply to a range of data) the advice RANDOM, SEQUENTIAL, and NORMAL are file-wide instructions to control the default behavior.

Since we only need to call it once per file the overhead shouldn't be too great.

That sounds contradictory, so I'm not sure how to resolve it.

Another way to phrase it would be "can we modify the file-wide default and still provide instructions for specific regions."

@westonpace
Copy link
Member

Similar discussion: elastic/elasticsearch#27748

@cyb70289
Copy link
Contributor

cyb70289 commented Nov 4, 2021

Just some tests. FYI.

Looks we can advise both WILLNEED and RANDOM to same memory region (order doesn't matter per my test).

I modified test code to only read higher half mapped memory (0.5G ~ 1G). With both WILLNEED and RANDOM advised, besides random pages scattered in 0.5G ~ 1G, the 0 ~ 1280K memory at the very beginning of mapped memory is also in page cache. No additional useless pages are in os page cache.
Looks WILLNEED does 10 read aheads from starting address in my test box. Of course, this behaviour is not documented and may be brittle.

@pitrou
Copy link
Member

pitrou commented Nov 4, 2021

The problem seems to be the default prefetch.

Is it? I'm skeptical that the kernel prefetches an entire 1GB file if you don't ask it to.

While WILLNEED and DONTNEED are specific prefetching instructions (and only apply to a range of data) the advice RANDOM, SEQUENTIAL, and NORMAL are file-wide instructions to control the default behavior.

That's only for posix_fadvise, not posix_madvise. And it's Linux-specific...

Another way to phrase it would be "can we modify the file-wide default and still provide instructions for specific regions."

What would that change? The file-wide default is probably not to prefetch.

It doesn't seem to me that you are addressing the original complaint.

@pitrou
Copy link
Member

pitrou commented Nov 4, 2021

I'd like to reboot the discussion and stop discussing flag combinations without regard for the original issue.

Here is the complaint:

In my test, if the access pattern is random access (binary searching an array in a memory mapped arrow IPC file in my case), I find OS (Linux) will prefetch data, and lots of IO are wasted (90% in my test), page cache is full of never used data as well.

So, to sum it up:

  • the Arrow IPC layer issues madvise calls for record batches that are read by the user, so that the OS prefetches them in the background
  • here, the user doesn't want the record batches to be prefetched, because they are only doing very sparse reads and ignoring most of the remaining data

@niyue Am I right?

@niyue
Copy link
Contributor Author

niyue commented Nov 4, 2021

@pitrou Sorry, probably I didn't make it clear enough. This is not what I concern.
My use case is pretty similar with the elastic's discussion @westonpace mentioned above. In my case, I stored a list of sorted terms in a string array in arrow IPC file, and do mmap and binary search on this sorted string array. Each term is a short string (usually less than 10 bytes), when the program reads the term, only 1 page (the minimum unit of IO) is needed to be loaded since the program requests only less than 10 bytes from OS for each access. However, from what I currently observe, macOS 11.6 will prefetch 24 pages (even only 1 page is requested by the program), and Linux (Alpine Linux running in a Docker container) will prefetch 16 pages (even only 1 page is requested).

The array in my case is not very small (typically 0.5~1 million words in the array), two binary search accesses usually won't fall into the prefetched pages range, which makes the OS prefetching useless and disk doing lots of unnecessary IO. We have thousands of such IPC files to binary search for each request, and the amount of wasted IO is non trivial. Additionally, some of the storage we test (from some big cloud vendor), have limited bandwidth, which makes this behavior worse.

I submitted this PR because after reading some documentation, I realized I need to advise OS not doing the prefetching for data that is NOT requested by user program since I know up front the prefetching is useless.

So far I find this will not matter too much if the storage has good bandwidth, but it could help a lot if the storage bandwidth is limited. But I don't perform test for multiple concurrent programs yet, and I feel even for fast storage, this could help if there are multiple programs running since system page cache can be utilized better.

More generally, I think people may run into such issue when doing random access for an mmaped array with small size element (e.g. small string, integer, date time, most of the primitive types) in an IPC file if they read these files from bandwidth limited storage.

@pitrou
Copy link
Member

pitrou commented Nov 4, 2021

@niyue To me, it seems that you're saying I misunderstood your use case, while you're repeating exactly what I understood.

To make things clear:

  • you're asking the IPC reader to read some (one or several) record batches
  • you actually access only a tiny bit of the record batches' data you just asked to read

Is that right?

@westonpace
Copy link
Member

For context it is probably worth pointing out that @niyue recently added #11486 which gets around the classic "IPC reader reads the entire file even if you only want a few columns" issue.

However, I agree with @pitrou . It sounds like you are not just limiting which columns you are accessing but you are also accessing very few rows. In that case the problem is likely the fact that the record batch file reader loads the entire array via ArrayLoader => GetBuffer => ReadBuffer => RandomAccessFile::ReadAt(entire-buffer-range)

And, in MemoryMappedFile::ReadAt we call ::arrow::internal::MemoryAdviseWillNeed on the entire range accessed.

Int hat case, the solution is what Antoine suggested. We should provide an option in MemoryMappedFile to prevent calls to madvise.

@niyue
Copy link
Contributor Author

niyue commented Nov 4, 2021

@pitrou

you actually access only a tiny bit of the record batches' data you just asked to read

This is what I considered different. To make things simpler, we can assume there is only one array in the record batch, so that we don't have to discuss if multiple arrays are read in this case. The user program did only access a tiny bit of the record batch (accessing one element in the array), since it is mmaped, and I expect only 1 page to be loaded as it is the minimum IO required by OS (or is this an incorrect expectation?), however, I found simply calling array[i] to read one value, multiple pages are loaded by the OS, which is unexpected.

because they are only doing very sparse reads and ignoring most of the remaining data

What I think different is, the user program doesn't want to ignore most of the remaining data, and according to the API it calls, it tries to read 1 page (1 array element actually) and uses the content in this page. From what I consider, this doesn't ignore any remaining data: it does ignore the remaining 4KB - one array element size data, but this is expected and not what I complained here. The prefetched pages are what I complained, since it happened internally and automatically, I didn't expect it to happen (why reading mmaped array[i] will lead to multiple pages of IO? was my question when troubleshooting this problem), and I would like to find an approach to prevent it from happening in this case.

@niyue
Copy link
Contributor Author

niyue commented Nov 4, 2021

in MemoryMappedFile::ReadAt we call ::arrow::internal::MemoryAdviseWillNeed on the entire range accessed

@westonpace I didn't realize this previously. Do you mean MemoryMappedFile:WillNeed is called internally by arrow during this process, which caused the OS prefetching issue?

I will do more arrow coding reading to see how it works. Previously I briefly did some code searching on this part, and I thought the WillNeed API should be called explicitly by user program to indicate which part of the data is needed.

@niyue
Copy link
Contributor Author

niyue commented Nov 4, 2021

@westonpace

in MemoryMappedFile::ReadAt we call ::arrow::internal::MemoryAdviseWillNeed on the entire range accessed

Thanks for the pointer, and I found this MemoryAdviseWillNeed is called inside MemoryMappedFile::ReadAt and is probably causing the problem I ran into.

 RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
      {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));

I debugged the program previously and confirmed MemoryMappedFile::ReadAt didn't prefetch the array data after calling it, and the real IO only happens when the user program accesses the array value, so I thought mmap worked but OS didn't work as expected. But I didn't think about that it advises OS to prefetch this range and made OS to behave differently. Let me do some more investigation to see how it works. Thanks so much.

@edponce
Copy link
Contributor

edponce commented Nov 5, 2021

I want to add to this conversation that Linux kernel also supports setting the madvise flags for a memory region from another process (see process_madvise). So if performing many random small accesses is not a common pattern maybe this is a viable solution for the client application, although, not sure if the memory address can be extracted from IPC.

@westonpace
Copy link
Member

So I think the work I'm doing on ARROW-14577 / #11616 will help here. I am adding two new methods to RecordBatchFileReader, WillNeedMetadata(vector<int> indices) and WillNeedBatches(vector<int> indices). The first tells the reader to preload the metadata but not preload the data blocks. The second tells the reader to preload the metadata and the data blocks.

Once those are in we can remove the automatic WILLNEED advice from MemoryMappedFile. Instead WillNeedBatches will call WILLNEED on the data ranges before loading them and WillNeedMetadata will not do this (nor will the default).

So if you want to randomly hop through a file (e.g. for a binary search) simply use WillNeedMetadata. If you want to efficiently load entire blocks of data then call WillNeedBatches.

I'm still a few days away from finishing so any feedback on the approach now would be appreciated.

@pitrou
Copy link
Member

pitrou commented Nov 5, 2021

I think the default should still be to prefetch automatically, though. That seems to be the dominant use case (if you load a batch, you probably want to use it in its entirety or a large part of it).

@westonpace
Copy link
Member

I think the default should still be to prefetch automatically, though. That seems to be the dominant use case (if you load a batch, you probably want to use it in its entirety or a large part of it).

That seems reasonable. What if wel add a new optional parameter to arrow::io::MemoryMappedFile::Open which is bool auto_willneed = true. Then whether the IPC reader does madvise or not will depend on how the file is created (and this is completely independent to whether the IPC reader uses a ReadRangeCache).

@pitrou
Copy link
Member

pitrou commented May 4, 2022

Wow, I completely forgot to answer this, sorry.

That seems reasonable. What if wel add a new optional parameter to arrow::io::MemoryMappedFile::Open which is bool auto_willneed = true. Then whether the IPC reader does madvise or not will depend on how the file is created (and this is completely independent to whether the IPC reader uses a ReadRangeCache).

Hmm, that doesn't sound like an ideal solution, as that would only work for memory-mapped files.

@niyue
Copy link
Contributor Author

niyue commented May 7, 2022

I happened to revisit this PR a week ago.

So I think the work I'm doing on ARROW-14577 / #11616 will help here. I am adding two new methods to RecordBatchFileReader, WillNeedMetadata(vector indices) and WillNeedBatches(vector indices).

@westonpace I briefly went through PR 11616 last time but didn't find the two new methods, do we still plan to add such APIs?

that doesn't sound like an ideal solution, as that would only work for memory-mapped files

@pitrou I don't have a whole picture on these APIs, could you please shed some more light on this part, what would be a possible ideal solution for it? I am still interested in making this to work better, and it will be great if you can give some more advice and I can give it a try later.

@amol-
Copy link
Member

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants