[QST] ETL performance #5379

rlu-aa · 2020-10-26T06:29:10Z

rlu-aa
Oct 26, 2020

What is your question?

I'm being curious and trying to test if spark-rapids benefits Spark ETL. The ETL job used is one of our heaviest

A 400 GB skewed data set (df1), joining with a 100 GB well distributed data set (df2), df2 is a true subset of df1 is we only look at the join on columns, though df2 comes with two more extra columns. Salt join trick is used to handle the imbalance.
A UDF is executed on the resulted data set (df3), which groups df3 by certain columns, yields df4
Join df3 with df4 as the final result

After each step a dataframe.persist(StorageLevel.DISK_ONLY) is called, just on the safer side.

In production we us AWS EMR 5 with 40 * r5.4xlarge (each with 16 vCPU and 128 GB RAM; 160xlarge in total), the job above takes ~10 minutes.

In comparison with spark-rapids I setup an EMR 6 cluster with 8 * g4dn.2xlarge (8 vCPU, 32 GB RAM, 1 T4 GPU; 16xlarge in total, 1/10 as large as production cluster) + driver, with df1 downsampled to 15% of total (1/6 roughly) to save some time, and the job takes 15 minutes.

Now let's do the approx. math: 1/6 data set on a 1/10 large cluster, the job takes 1.5x time, at a glance it's almost a draw game.

However as I have df1 increased from 1/6 to 1/3 of total (with num of spark shuffle partition and salt join factor also doubled), the ETL job didn't finish in 30 minutes, in fact it didn't finish after 1 hour and failed tasks started to show up in Spark web UI. Obviously OOM got in the way and further tuning/debugging is required.

But even if we assume the ETL performance of spark-rapids grows ideally and linearly with more g4dn nodes added into cluster, it's still not a win over r5.4x cluster. That's not impressive to me if disappointing, let alone spark-rapids requires more complicated Spark/Yarn configuration. And to think about it makes intuitive sense as it's always easy to add more RAM than more VRAM (multi-GPU aside).

To wrap up, my real question I guess is, am I missing something, is the ETL job a suitable case, or where does the strength of spark-rapids ETL actually lie? And for that is more advanced infra. like nvlink and InfiniBand network needed?

My Spark configuration is pasted below, if it helps. Thanks in advance.
$SPARK_HOME/bin/pyspark \ --conf spark.locality.wait=0s \ --conf spark.sql.files.maxPartitionBytes=512m \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --conf spark.executor.resource.gpu.discoveryScript=/mnt/gpu-discovery.sh \ --conf spark.rapids.shims-provider-override=com.nvidia.spark.rapids.shims.spark300.SparkShimServiceProvider \ --files /mnt/gpu-discovery.sh \ --master yarn \ --conf spark.driver.memory=18g \ --conf spark.executor.memory=18g \ --conf spark.executor.memoryOverhead=4g \ --conf spark.executor.cores=6 \ --conf spark.driver.cores=6 \ --conf spark.task.cpus=1 \ --conf spark.rapids.memory.pinnedPool.size=4g \ --conf spark.sql.shuffle.partitions=320 \ --conf spark.task.resource.gpu.amount=0.16 \ --conf spark.executor.resource.gpu.amount=1 \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.rapids.sql.concurrentGpuTasks=2 \ --conf spark.executor.instances=8 \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.kryo.registrator=com.nvidia.spark.rapids.GpuKryoRegistrator

Answered by revans2

Oct 26, 2020

So there are several issues here and we are working on all of them in one form or another. The reason for the initial slowness comes down to moving data between the GPU and the CPU.

In spark 3.0 there is no way to accelerate dataframe.persist so all we can do is convert the data back to rows and let the existing spark code handle it. In spark 3.1 we were able to work with the spark community to make this part of spark plug-able and in the process of implementing code for that. But because of the API requirements this will not be available until after Spark 3.1 is released.

The UDF also has a similar issue in that we do not have a generic compiler from java to cuda. This is a very difficul…

View full answer

rlu-aa · 2020-10-26T06:48:54Z

rlu-aa
Oct 26, 2020
Author

Some extra info if helps:
All relevant data set are in parquet format. Dataframe 1 contains > 1 billion rows.

0 replies

revans2 · 2020-10-26T12:42:48Z

revans2
Oct 26, 2020
Maintainer

So there are several issues here and we are working on all of them in one form or another. The reason for the initial slowness comes down to moving data between the GPU and the CPU.

In spark 3.0 there is no way to accelerate dataframe.persist so all we can do is convert the data back to rows and let the existing spark code handle it. In spark 3.1 we were able to work with the spark community to make this part of spark plug-able and in the process of implementing code for that. But because of the API requirements this will not be available until after Spark 3.1 is released.

The UDF also has a similar issue in that we do not have a generic compiler from java to cuda. This is a very difficult problem. If your UDF is simple we might be able to translate it into catalyst operations, which we have experimental support for https://github.com/NVIDIA/spark-rapids/tree/main/udf-compiler. If it is complex then all we can do right now is convert the data back into rows and let the CPU do the processing, which I suspect is what is happening.

We have found that moving data between the GPU and the CPU is often very expensive, not because of the data movement itself, which we typically can do at about 13 GB/sec, but because of cache coherency problems on the CPU. All Caches hate it when you stride through memory. Walking columnar data in a row major way really destroys the cache. In practice with enough columns I have seen this make pulling data back to the CPU take longer than it took to parse the same data in parquet. We are also working on some optimization for this too rapidsai/cudf#6578, but it may take a while before we have it fully in place.

The final issue is likely your skewed join. Without looking at your query it is hard for me to know exactly what is happening but I suspect that your skewed join is causing the GPU to run out of memory. We currently do not support out of core joins. It is something that we have on the road-map, but we have been working on decimal support and nested type support first. Even though you are using a salted join to avoid skew I still suspect this. We currently only support a hash join, which means the build side table has to fit entirely in GPU memory along with the answer. Because this is fundamentally very different from the sort merge join spark uses by default it is likely that we are not dividing the query as efficiently as we could/should.

@rlu-aa I really appreciate that you took the time to try this out. We are still trying really hard to get as much coverage for different operators as possible. I would encourage you to follow our progress and try again once Spark 3.1 is released.

0 replies

rlu-aa · 2020-10-28T15:50:45Z

rlu-aa
Oct 28, 2020
Author

Thank @revans2 for prompt reply, that's a bit disappointing but understandable. Nevertheless spark-rapids looks promising, I'll keep an eye on it.

0 replies

viadea · 2022-05-06T04:20:34Z

viadea
May 6, 2022
Collaborator

@rlu-aa Recently we have released 22.04 GA release and here are lots of features implemented and improvements as well.
For caching, we have RAPIDS Cache Serializer feature which may help.

For UDF, if it is scala UDF, it might be able to be converted to catalyst expression by setting spark.rapids.sql.udfCompiler.enabled=true. Also there is a feature to avoid CPU fallback for the whole operator by setting spark.rapids.sql.rowBasedUDF.enabled=true.

I would suggest you try the latest version of Spark RAPIDS plugin and also let us know if there is any help needed.
Thanks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] ETL performance #5379

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[QST] ETL performance #5379

rlu-aa Oct 26, 2020

Replies: 4 comments

rlu-aa Oct 26, 2020 Author

revans2 Oct 26, 2020 Maintainer

rlu-aa Oct 28, 2020 Author

viadea May 6, 2022 Collaborator

rlu-aa
Oct 26, 2020

rlu-aa
Oct 26, 2020
Author

revans2
Oct 26, 2020
Maintainer

rlu-aa
Oct 28, 2020
Author

viadea
May 6, 2022
Collaborator