-
What is your question? I'm being curious and trying to test if spark-rapids benefits Spark ETL. The ETL job used is one of our heaviest
After each step a dataframe.persist(StorageLevel.DISK_ONLY) is called, just on the safer side. In production we us AWS EMR 5 with 40 * r5.4xlarge (each with 16 vCPU and 128 GB RAM; 160xlarge in total), the job above takes ~10 minutes. In comparison with spark-rapids I setup an EMR 6 cluster with 8 * g4dn.2xlarge (8 vCPU, 32 GB RAM, 1 T4 GPU; 16xlarge in total, 1/10 as large as production cluster) + driver, with df1 downsampled to 15% of total (1/6 roughly) to save some time, and the job takes 15 minutes. Now let's do the approx. math: 1/6 data set on a 1/10 large cluster, the job takes 1.5x time, at a glance it's almost a draw game. However as I have df1 increased from 1/6 to 1/3 of total (with num of spark shuffle partition and salt join factor also doubled), the ETL job didn't finish in 30 minutes, in fact it didn't finish after 1 hour and failed tasks started to show up in Spark web UI. Obviously OOM got in the way and further tuning/debugging is required. But even if we assume the ETL performance of spark-rapids grows ideally and linearly with more g4dn nodes added into cluster, it's still not a win over r5.4x cluster. That's not impressive to me if disappointing, let alone spark-rapids requires more complicated Spark/Yarn configuration. And to think about it makes intuitive sense as it's always easy to add more RAM than more VRAM (multi-GPU aside). To wrap up, my real question I guess is, am I missing something, is the ETL job a suitable case, or where does the strength of spark-rapids ETL actually lie? And for that is more advanced infra. like nvlink and InfiniBand network needed? My Spark configuration is pasted below, if it helps. Thanks in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
Some extra info if helps: |
Beta Was this translation helpful? Give feedback.
-
So there are several issues here and we are working on all of them in one form or another. The reason for the initial slowness comes down to moving data between the GPU and the CPU. In spark 3.0 there is no way to accelerate The UDF also has a similar issue in that we do not have a generic compiler from java to cuda. This is a very difficult problem. If your UDF is simple we might be able to translate it into catalyst operations, which we have experimental support for https://github.com/NVIDIA/spark-rapids/tree/main/udf-compiler. If it is complex then all we can do right now is convert the data back into rows and let the CPU do the processing, which I suspect is what is happening. We have found that moving data between the GPU and the CPU is often very expensive, not because of the data movement itself, which we typically can do at about 13 GB/sec, but because of cache coherency problems on the CPU. All Caches hate it when you stride through memory. Walking columnar data in a row major way really destroys the cache. In practice with enough columns I have seen this make pulling data back to the CPU take longer than it took to parse the same data in parquet. We are also working on some optimization for this too rapidsai/cudf#6578, but it may take a while before we have it fully in place. The final issue is likely your skewed join. Without looking at your query it is hard for me to know exactly what is happening but I suspect that your skewed join is causing the GPU to run out of memory. We currently do not support out of core joins. It is something that we have on the road-map, but we have been working on decimal support and nested type support first. Even though you are using a salted join to avoid skew I still suspect this. We currently only support a hash join, which means the build side table has to fit entirely in GPU memory along with the answer. Because this is fundamentally very different from the sort merge join spark uses by default it is likely that we are not dividing the query as efficiently as we could/should. @rlu-aa I really appreciate that you took the time to try this out. We are still trying really hard to get as much coverage for different operators as possible. I would encourage you to follow our progress and try again once Spark 3.1 is released. |
Beta Was this translation helpful? Give feedback.
-
Thank @revans2 for prompt reply, that's a bit disappointing but understandable. Nevertheless spark-rapids looks promising, I'll keep an eye on it. |
Beta Was this translation helpful? Give feedback.
-
@rlu-aa Recently we have released 22.04 GA release and here are lots of features implemented and improvements as well. For UDF, if it is scala UDF, it might be able to be converted to catalyst expression by setting I would suggest you try the latest version of Spark RAPIDS plugin and also let us know if there is any help needed. |
Beta Was this translation helpful? Give feedback.
So there are several issues here and we are working on all of them in one form or another. The reason for the initial slowness comes down to moving data between the GPU and the CPU.
In spark 3.0 there is no way to accelerate
dataframe.persist
so all we can do is convert the data back to rows and let the existing spark code handle it. In spark 3.1 we were able to work with the spark community to make this part of spark plug-able and in the process of implementing code for that. But because of the API requirements this will not be available until after Spark 3.1 is released.The UDF also has a similar issue in that we do not have a generic compiler from java to cuda. This is a very difficul…