[GLUTEN-7860][CORE] In shuffle writer, replace MemoryMappedFile to avoid OOM #7861

ccat3z · 2024-11-08T03:10:42Z

What changes were proposed in this pull request?

This pr fixed #7860 by MmapFileStream extended arrow:io::InputStream. MmapFileStream will invoke MADV_DONTNEED to release previous memory when read next range of data.

How was this patch tested?

// Generate 10 partitions, each partition has about 10GB random data.
def gen(scale: Int, parts: Int) = {
  sc.parallelize(1 to (1024*1024), numSlices = 1000)
    .map(x => (x % 1000, randStr(scale * parts)))
    .repartition(parts)
    .toDF("a", "b")
    .save./* ... */
}

// Trigger shuffle spill by `repartition(50)`.
def test(parts: Int = 50) = {
  spark.read./* ... */.repartition(parts)
    .filter(expr("a < 0*rand()")) // avoid pushdown repartition
}

# Executor Memory Config
spark.executor.memory=512M
spark.yarn.executor.memoryOverhead=512M
spark.gluten.memory.offHeap.size.in.bytes=1610612736

Test Result:

impl	avg time to merge spills (s)	avg total spilled size of each task (MB)
read (arrow ReadableFile)	10.58706836156	9935.920098495480
mmap (open required range by MemoryMappedFile)	6.602059312420000	9935.920098495480
madv (this pr)	6.73993204562	9935.920098495480
mmap (repace madv by munmap in this pr)	6.55791399852	9935.920098495480

munmap patch in above test:

diff --git a/cpp/core/shuffle/Utils.cc b/cpp/core/shuffle/Utils.cc
index 1ceb777f1..742c53c90 100644
--- a/cpp/core/shuffle/Utils.cc
+++ b/cpp/core/shuffle/Utils.cc
@@ -243,9 +243,9 @@ void MmapFileStream::advance(int64_t length) {
 
   auto purgeLength = (pos_ - posRetain_) & pageMask;
   if (purgeLength > 0) {
-    int ret = madvise(data_ + posRetain_, purgeLength, MADV_DONTNEED);
+    int ret = munmap(data_ + posRetain_, purgeLength);
     if (ret != 0) {
-      LOG(WARNING) << "fadvise failed " << ::arrow::internal::ErrnoMessage(errno);
+      LOG(WARNING) << "munmap failed " << ::arrow::internal::ErrnoMessage(errno);
     }
     posRetain_ += purgeLength;
   }
@@ -269,7 +269,7 @@ void MmapFileStream::willNeed(int64_t length) {
 
 arrow::Status MmapFileStream::Close() {
   if (data_ != nullptr) {
-    int result = munmap(data_, size_);
+    int result = munmap(data_ + posRetain_, size_ - posRetain_);
     if (result != 0) {
       LOG(WARNING) << "munmap failed";
     }

github-actions · 2024-11-08T03:10:59Z

#7860

ccat3z · 2024-11-08T04:03:41Z

cc @kecookier

kecookier · 2024-11-09T02:30:23Z

/Benchmark Velox

ccat3z · 2024-11-09T03:07:39Z

/Benchmark Velox

ccat3z · 2024-11-09T03:15:47Z

/Benchmark Velox

zhztheplayer

@ccat3z Do you see #7860 fixed with this approach?

I am triggering a benchmark manually.

cc @marin-ma @FelixYBW

zhztheplayer · 2024-11-11T07:36:09Z

cpp/core/shuffle/Spill.cc

@@ -73,7 +73,7 @@ void Spill::insertPayload(

 void Spill::openSpillFile() {
  if (!is_) {
-    GLUTEN_ASSIGN_OR_THROW(is_, arrow::io::MemoryMappedFile::Open(spillFile_, arrow::io::FileMode::READ));
+    GLUTEN_ASSIGN_OR_THROW(is_, arrow::io::ReadableFile::Open(spillFile_));


Is the API implemented with buffered read?

Not sure whether https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/buffered.h may help here.

Spill merge needn't buffer

marin-ma · 2024-11-11T07:40:27Z

I am triggering a benchmark manually.

@zhztheplayer There's no shuffle spill on jenkins. The change won't be tested.

zhztheplayer · 2024-11-11T07:49:20Z

I am triggering a benchmark manually.

@zhztheplayer There's no shuffle spill on jenkins. The change won't be tested.

Thought we always rely on Spark-controlled spill in shuffle. Does Jenkins CI always have enough memory for all shuffle data?

FelixYBW · 2024-11-12T21:34:13Z

@zhztheplayer There's no shuffle spill on jenkins. The change won't be tested.

Is it because the spill will be triggered on other operators in the pipeline? Like a sort + shuffle. Will the sort be triggered or shuffle?

FelixYBW · 2024-11-13T23:13:19Z

@zhztheplayer @marin-ma can we create a query and config to test it?

ccat3z · 2024-11-18T03:38:39Z

@FelixYBW @zhztheplayer I added MmapFileStream in this pr. MmapFileStream will invoke MADV_DONTNEED to release previous memory when reading next range of data. Test approach and result has updated in PR description.

marin-ma · 2024-11-18T04:00:10Z

cpp/core/shuffle/Utils.cc

+  auto fstream = std::shared_ptr<MmapFileStream>(new MmapFileStream());
+  fstream->fd_ = std::move(fd);
+  fstream->data_ = static_cast<uint8_t*>(result);
+  fstream->size_ = size;


Can we use std::make_shared and set the argument through ctor?

marin-ma · 2024-11-18T04:01:26Z

cpp/core/shuffle/Utils.h

@@ -72,4 +72,34 @@ arrow::Result<std::shared_ptr<arrow::RecordBatch>> makeUncompressedRecordBatch(

 std::shared_ptr<arrow::Buffer> zeroLengthNullBuffer();

+class MmapFileStream : public arrow::io::InputStream {


Could you please add some comments to explain the usage/functionality for this class?

marin-ma

Some minor comments. Thanks!

marin-ma · 2024-11-18T12:14:04Z

cpp/core/shuffle/Utils.h

+// to prefetch and release memory timely.
+class MmapFileStream : public arrow::io::InputStream {
+ public:
+  MmapFileStream(arrow::internal::FileDescriptor fd, uint8_t* data, int64_t size)


Please separate the declaration and definition. And add a blank line between two member functions.

marin-ma · 2024-11-18T12:14:18Z

cpp/core/shuffle/Utils.h

+  arrow::Status Close() override;
+  arrow::Result<int64_t> Read(int64_t nbytes, void* out) override;
+  arrow::Result<std::shared_ptr<arrow::Buffer>> Read(int64_t nbytes) override;
+  bool closed() const override {


marin-ma · 2024-11-18T12:14:25Z

cpp/core/shuffle/Utils.h

+  };
+
+ private:
+  arrow::Result<int64_t> actualReadSize(int64_t nbytes) {


FelixYBW · 2024-11-18T20:35:33Z

cpp/core/shuffle/Utils.h

@@ -72,4 +72,37 @@ arrow::Result<std::shared_ptr<arrow::RecordBatch>> makeUncompressedRecordBatch(

 std::shared_ptr<arrow::Buffer> zeroLengthNullBuffer();

+// MmapFileStream is used to optimize sequential file reading. It uses madvise
+// to prefetch and release memory timely.
+class MmapFileStream : public arrow::io::InputStream {


You may contribute MmapFileStream to Apache Arrow in future.

FelixYBW · 2024-11-18T23:20:49Z

Thank you. Looks good solution!

marin-ma

LGTM. Thanks!

zhztheplayer · 2024-11-19T09:21:36Z

Is it because the spill will be triggered on other operators in the pipeline? Like a sort + shuffle. Will the sort be triggered or shuffle?

So far the spill will be triggered on components holding more memory no matter it's Velox operator or shuffle. We have a basic priority setting in Spiller API and in future we can extend and use it to implement some fixed spill order.

FelixYBW · 2024-11-19T21:54:48Z

Is it because the spill will be triggered on other operators in the pipeline? Like a sort + shuffle. Will the sort be triggered or shuffle?

So far the spill will be triggered on components holding more memory no matter it's Velox operator or shuffle. We have a basic priority setting in Spiller API and in future we can extend and use it to implement some fixed spill order.

So now once spill is called, all operator's spill is triggered, right?

zhztheplayer · 2024-11-20T00:47:31Z

Is it because the spill will be triggered on other operators in the pipeline? Like a sort + shuffle. Will the sort be triggered or shuffle?

So far the spill will be triggered on components holding more memory no matter it's Velox operator or shuffle. We have a basic priority setting in Spiller API and in future we can extend and use it to implement some fixed spill order.

So now once spill is called, all operator's spill is triggered, right?

We pass a target spill size to Velox API so usually the spill call stops when enough memory space is reclaimed. So a portion of the operators can be omitted in the procedure.

FelixYBW · 2024-11-21T06:18:53Z

We pass a target spill size to Velox API so usually the spill call stops when enough memory space is reclaimed. So a portion of the operators can be omitted in the procedure.

Will it still call shuffle's writer's spill anyway?

FelixYBW · 2024-11-23T07:04:10Z

can you resolve conflict?

ccat3z · 2024-11-25T02:51:51Z

can you resolve conflict?

Rebased to latest main.

zhztheplayer · 2024-11-25T06:53:49Z

We pass a target spill size to Velox API so usually the spill call stops when enough memory space is reclaimed. So a portion of the operators can be omitted in the procedure.

Will it still call shuffle's writer's spill anyway?

yes exactly

FelixYBW · 2024-11-26T01:20:17Z

cpp/core/shuffle/Utils.cc

+};
+
+void MmapFileStream::advance(int64_t length) {
+  static auto pageSize = static_cast<size_t>(arrow::internal::GetPageSize());


page should be too small. Can you use config of spark.shuffle.file.buffer?

github-actions · 2024-11-27T10:17:26Z