Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Task killed by Yarn #6947

Open
FelixYBW opened this issue Aug 20, 2024 · 13 comments
Open

[VL] Task killed by Yarn #6947

FelixYBW opened this issue Aug 20, 2024 · 13 comments
Labels
bug Something isn't working tracker Tracker of issues in the same category triage

Comments

@FelixYBW
Copy link
Contributor

Backend

VL (Velox)

Bug description

one common issue of Gluten is that the task is killed by yarn. Currently Gluten has some memory allocation like std::vector isn't tracked by spark's memory management. It's counted into executor.overhead memory now. But there should be some such allocation with large data size. We should create a proxy allocator for such memory allocation.

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

@FelixYBW FelixYBW added bug Something isn't working triage labels Aug 20, 2024
@zhztheplayer zhztheplayer changed the title [VL] task killed by Yarn [VL] Task killed by Yarn Aug 21, 2024
@Yohahaha
Copy link
Contributor

We may need provide the ability to proxy malloc/free in Gluten, and inject this proxy to Velox MemoryAllocator.

@FelixYBW
Copy link
Contributor Author

Related to #6960. The golbal memory allocation isn't tracked either, which doesn't have a limit now. It's mostly the root cause.

Thank @zhztheplayer and @marin-ma

@FelixYBW
Copy link
Contributor Author

PR6988 added the limitation of global memory allocation, 0.75xoverhead memory. But Velox has no control of the global memory allocation. So with PR6988, we can workaround to set a large overhead memory, otherwise you will see the Exceeded memory allocator limit exception. Though not kill by yarn anymore.

@FelixYBW
Copy link
Contributor Author

FelixYBW commented Sep 20, 2024

FelixYBW@fbfe0f3

Here is a simple way to detect this. It output the memory allocation bigger than 1M but not from memory pool. disable jemalloc, set executor.core to 1, then you can get output like

allocated 67108864
libvelox.so(+0x15b4ff5) [0x7f3301cf2ff5]
libvelox.so(+0x15b51eb) [0x7f3301cf31eb]
libgluten.so(_Znwm+0x19) [0x7f3307be9ea9]
libvelox.so(_ZNSt6vectorISt10shared_ptrIN8facebook5velox4exec15WindowPartitionEESaIS5_EE17_M_realloc_insertIJS5_EEEvN9__gnu_cxx17__normal_iteratorIPS5_S7_EEDpOT_+0x80) [0x7f3305130180]
libvelox.so(_ZN8facebook5velox4exec24RowsStreamingWindowBuild18addPartitionInputsEb+0x1cf) [0x7f330512f18f]
libvelox.so(_ZN8facebook5velox4exec24RowsStreamingWindowBuild8addInputESt10shared_ptrINS0_9RowVectorEE+0x419) [0x7f330512f799]
libvelox.so(_ZN8facebook5velox4exec6Window8addInputESt10shared_ptrINS0_9RowVectorEE+0x53) [0x7f3304bda5b3]
libvelox.so(_ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE+0xfb9) [0x7f3304b5e0f9]
libvelox.so(_ZN8facebook5velox4exec6Driver4nextEPN5folly10SemiFutureINS3_4UnitEEE+0xbb) [0x7f3304b5f63b]
libvelox.so(_ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE+0x384) [0x7f3303124514]
libvelox.so(_ZN6gluten24WholeStageResultIterator4nextEv+0x78) [0x7f3301ce78f8]
libgluten.so(Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext+0x139) [0x7f33071c99c9]

In this example, the root cause is

windowPartitions_.push_back(std::make_shared<WindowPartition>(
       pool_, data_.get(), inversedInputChannels_, sortKeyInfo_));

Solution is to set memory::StlAllocator<char*> to std::vector and std::allocate_shared

@FelixYBW
Copy link
Contributor Author

the other places:
https://github.com/FelixYBW/velox/blob/97f6d25518f3fea264e8382ac66a8d2479ff98f0/velox/exec/SortBuffer.cpp#L277

  auto spillRows = std::vector<char*>(
      sortedRows_.begin() + numOutputRows_, sortedRows_.end());

https://github.com/FelixYBW/velox/blob/97f6d25518f3fea264e8382ac66a8d2479ff98f0/velox/exec/SortBuffer.cpp#L114

sortedRows_.resize(numInputRows_);

@FelixYBW
Copy link
Contributor Author

and here:
https://github.com/FelixYBW/velox/blob/a322893d54e87acbe6a0f84c88d9dbbbd5441a5f/velox/exec/SpillFile.cpp#L145

  IOBufOutputStream out(
      *pool_, nullptr, std::max<int64_t>(64 * 1024, batch_->size()));

@FelixYBW
Copy link
Contributor Author

and here: FelixYBW/velox@a322893/velox/exec/SpillFile.cpp#L145

  IOBufOutputStream out(
      *pool_, nullptr, std::max<int64_t>(64 * 1024, batch_->size()));

It's a temp buffer and will be released after flush

@FelixYBW
Copy link
Contributor Author

Another known BKM, the malloc also matters. To boost performance, jemalloc will hold some memory for performance, which is counted into overhead memory by yarn as well.

@FelixYBW
Copy link
Contributor Author

FelixYBW commented Sep 25, 2024

https://github.com/FelixYBW/velox/blob/97f6d25518f3fea264e8382ac66a8d2479ff98f0/velox/exec/Spiller.cpp#L419

Looks timsort needs multiple memory allocation. In this case each allocation is ~3M

allocate 3964904 0x55d7c6b55880
libvelox.so(+0x15b6105) [0x7f8acf3e7105]
libvelox.so(+0x15b643b) [0x7f8acf3e743b]
libgluten.so(_Znwm+0x19) [0x7f8ad52e0ea9]
libvelox.so(_ZNSt6vectorIPcSaIS0_EE13_M_assign_auxISt13move_iteratorIN9__gnu_cxx17__normal_iteratorIPS0_S_IS0_N8facebook5velox6memory12StlAllocatorIS0_EEEEEEEEvT_SG_St20forward_iterator_tag+0x5e) [0x7f8ad07fed9e]
libvelox.so(+0x29c6ac7) [0x7f8ad07f7ac7]
libvelox.so(_ZN8facebook5velox4exec7Spiller12ensureSortedERNS2_8SpillRunE+0x2685) [0x7f8ad07faed5]
libvelox.so(_ZN8facebook5velox4exec7Spiller10writeSpillEi+0x60) [0x7f8ad07fb540]
libvelox.so(+0x29ca72b) [0x7f8ad07fb72b]
libvelox.so(_ZNSt17_Function_handlerIFSt10unique_ptrIN8facebook5velox4exec7Spiller11SpillStatusESt14default_deleteIS5_EEvEZNS2_6memory28createAsyncMemoryReclaimTaskIS5_EESt10shared_ptrINS2_11AsyncSourceIT_EEESt8functionIFS0_ISE_S6_ISE_EEvEEEUlvE_E9_M_invokeERKSt9_Any_data+0x58) [0x7f8ad07fb938]
libvelox.so(_ZN8facebook5velox11AsyncSourceINS0_4exec7Spiller11SpillStatusEE4moveEv+0x226) [0x7f8ad07fe916]
libvelox.so(_ZN8facebook5velox4exec7Spiller8runSpillEb+0x46c) [0x7f8ad07f5fbc]
libvelox.so(_ZN8facebook5velox4exec7Spiller5spillEPKNS1_20RowContainerIteratorE+0x88) [0x7f8ad07f68b8]
libvelox.so(_ZN8facebook5velox4exec10SortBuffer10spillInputEv+0x40) [0x7f8ad07e1d70]
libvelox.so(_ZN8facebook5velox4exec7OrderBy7reclaimEmRNS0_6memory15MemoryReclaimer5StatsE+0xf8) [0x7f8ad2293828]
libvelox.so(+0x2986fdd) [0x7f8ad07b7fdd]
libvelox.so(_ZN8facebook5velox6memory15MemoryReclaimer3runERKSt8functionIFlvEERNS2_5StatsE+0x59) [0x7f8acf5ac699]
libvelox.so(_ZN8facebook5velox4exec8Operator15MemoryReclaimer7reclaimEPNS0_6memory10MemoryPoolEmmRNS4_15MemoryReclaimer5StatsE+0x27d) [0x7f8ad07bc7dd]
libvelox.so(_ZN8facebook5velox6memory15MemoryReclaimer7reclaimEPNS1_10MemoryPoolEmmRNS2_5StatsE+0x57a) [0x7f8acf5ac21a]
libvelox.so(_ZN8facebook5velox4exec23ParallelMemoryReclaimer7reclaimEPNS0_6memory10MemoryPoolEmmRNS3_15MemoryReclaimer5StatsE+0xd05) [0x7f8ad07b7845]
libvelox.so(_ZN8facebook5velox6memory15MemoryReclaimer7reclaimEPNS1_10MemoryPoolEmmRNS2_5StatsE+0x57a) [0x7f8acf5ac21a]
libvelox.so(_ZN8facebook5velox4exec4Task15MemoryReclaimer11reclaimTaskERKSt10shared_ptrIS2_EmmRNS0_6memory15MemoryReclaimer5StatsE+0xed) [0x7f8ad081759d]
libvelox.so(_ZN8facebook5velox4exec4Task15MemoryReclaimer7reclaimEPNS0_6memory10MemoryPoolEmmRNS4_15MemoryReclaimer5StatsE+0x189) [0x7f8ad0817c19]
libvelox.so(_ZN8facebook5velox6memory15MemoryReclaimer7reclaimEPNS1_10MemoryPoolEmmRNS2_5StatsE+0x57a) [0x7f8acf5ac21a]
libvelox.so(_ZN6gluten20ListenableArbitrator14shrinkCapacityEmbb+0xc8) [0x7f8acf3fd938]
libvelox.so(_ZN6gluten24WholeStageResultIterator14spillFixedSizeEl+0x2fa) [0x7f8acf3dbfca]
libgluten.so(Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeSpill+0x125) [0x7f8ad48c0e75]

@FelixYBW
Copy link
Contributor Author

FelixYBW commented Sep 26, 2024

window related fix1: facebookincubator/velox#11077

sort related fix: facebookincubator/velox#11129

@FelixYBW
Copy link
Contributor Author

FelixYBW commented Sep 29, 2024

@FelixYBW
Copy link
Contributor Author

An basic design in Velox is that before spill, we allocate memory in spark memory pool (offheap size). During spill, we allocate memory in global memory pool (overhead)

@FelixYBW
Copy link
Contributor Author

The 3 configurations have big impact to the offheap and overhead memory usage:

spark.gluten.sql.columnar.backend.velox.spillWriteBufferSize
spark.gluten.sql.columnar.backend.velox.MaxSpillRunRows
spark.gluten.sql.columnar.backend.velox.maxSpillFileSize

SpillWriteBufferSize controls the buffer size when spill write data to disk. Looks it also control the read buffer size when spill data is fetch back. Each file must have one buffer allocated in offheap memory. If the size is too large, it will report OOM error triggered by getOutput.

MaxSpillRunRows controls the batch size of spill. The bigger the number, the more overhead memory is allocated, because during spill all memory allocation is overhead memory. The smaller the number, the more spill files.

maxSpillFileSize controls the file size of spill. The smaller the number, the more spill files.

#8025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working tracker Tracker of issues in the same category triage
Projects
None yet
Development

No branches or pull requests

2 participants