[DOC] Multi-Threaded shuffle documentation is not accurate on the read side #9512
Labels
documentation
Improvements or additions to documentation
shuffle
things that impact the shuffle plugin
The Multi-Threaded shuffle documentation says:
Unfortunately, that isn't true for the read side. The write side of the shuffle is following this, as Spark has different shuffle writer algorithms (bypass merge, and merge sort). The reader side is a single implementation in Spark, so we don't have a "bypass merge" reader or a "merge sort" reader, it's just the reader. Therefore the documentation should state this, and it's currently incorrect.
Note that we have reduced
spark.rapids.shuffle.multiThreaded.maxBytesInFlight
lately from 2GB to 128MB because of memory constraints #9153, and this is an ideal knob to control the size in bytes that we allow to be in flight in the decompression/decode threads. Another option to disable the MT reader side only entirely would be to setspark.rapids.shuffle.multiThreaded.reader.threads=0
. This is another tool if a user is having issues at shuffle read time only.The text was updated successfully, but these errors were encountered: