[QST] RowDataSourceScanExec cannot run on the GPU #5332

eyalhir74 · 2022-03-05T16:18:21Z

eyalhir74
Mar 5, 2022

I'm getting this in the SQL explain plan, what exactly does this mean?
*Exec <BroadcastExchangeExec> will run on GPU !NOT_FOUND <RowDataSourceScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RowDataSourceScanExec could be found @Expression <AttributeReference> id#1100L could run on GPU @Expression <AttributeReference> data_type#1105 could run on GPU

There's some join in the query agains a Vertica DB, is that it?
Could this be a performance hit?

Answered by revans2

Mar 8, 2022

Okay I'll get into some architecture here to try and explain things. Reading data into Spark usually involves a few operations. Note that the order of these operations and the machine that they run on can change based off of what the input format is.

Predicate push down/metadata calculations - This is to figure out what data to read in order to avoid reading too much data.
Data transfer - This is actually copying the data from where it is stored to the Spark node so it can be processed more
Data Decoding - This is translating the data into a format that Spark wants.

For file formats, like Parquet and ORC, stored in a blob store, like S3, we can only really accelerate the data decoding. …

View full answer

revans2 · 2022-03-07T19:11:52Z

revans2
Mar 7, 2022
Maintainer

It means that this small part of your query is converting an RDD<InternalRow> into a DataFrame and then broadcasting the data so another part of your query can use it. Typically that is for a broadcast hash join. It is entirely possible that it is related to data being read in from a Vertica DB. RowDataSourceScanExec is a Spark internal implementation that is used for reading in some data types. But, I don't know how the Vertica DB connector works so I cannot be 100% sure of that.

RowDataSourceScanExec is only doing a simple projection that converts the InternalRow into an UnsafeRow which Spark uses with DataFrame and SQL operations internally. Because this is for a data source there is very little chance that we could read directly from whatever source it is directly into GPU memory. So the only thing we could do is to add this to the list of operators that are confusing and we will not be able to put on the GPU, like the JDBC data source.

If you are okay with that solution I will file a follow on issue to do that.

0 replies

eyalhir74 · 2022-03-08T05:26:49Z

eyalhir74
Mar 8, 2022
Author

@revans2 I am not sure I understand what you suggest. To add this to a list of things that won't run on the GPU so it would be more visible in the SQL plan?

One more note though. Its a lookup table (we actualy have quite a few like these for some reason on Vertica while the real data is on Parquet files), so maybe it could have been read once into memory and kept there in case its small? Otherwise if there's a join with that table in a low-level part of the query, it would everytime pause the GPU while it fetches data from the JDBC data source? hence hurting performance?

0 replies

revans2 · 2022-03-08T13:51:14Z

revans2
Mar 8, 2022
Maintainer

Okay I'll get into some architecture here to try and explain things. Reading data into Spark usually involves a few operations. Note that the order of these operations and the machine that they run on can change based off of what the input format is.

Predicate push down/metadata calculations - This is to figure out what data to read in order to avoid reading too much data.
Data transfer - This is actually copying the data from where it is stored to the Spark node so it can be processed more
Data Decoding - This is translating the data into a format that Spark wants.

For file formats, like Parquet and ORC, stored in a blob store, like S3, we can only really accelerate the data decoding. Using the metadata to figure out exactly what data to read in from S3 is done on the CPU because it does not fit well with what the GPU is good at. Also transferring the raw bytes, currently has to go through the HDFS API, which does not offer any way to transfer the data more quickly to the GPU, so it stays on the CPU too.

There is a second group of input formats like JDBC and VerticaDB. They hide all of the predicate push down, data transfer, and most of the decoding steps. The only step that is exposed appears to be the RowDataSourceScanExec which, like I said is translating data stored in an abstract java class InternalRow to one stored in a concrete implementation UnsafeRow. We probably could do code generation to try and skip this step and pull data out of the InternalRow directly when we try to translate it to a columnar format for processing on the GPU, similar to what we do with UnsafeRow, but it is unlikely to have much if any performance impact to the final query.

What is more even if we dug into the details of how the JDBC or the VerticaDB connectors work, fundamentally they are sending a query to another group of servers to do all of the processing, so without working very closely with the database vendor there is really no hope for us to accelerate anything using the GPU.

With all of that in mind we really have two choices.

Set up the explain output so it marks RowDataSourceScanExec as something we are not going to put on the GPU because we decided it is not worth doing.
Keep this on the backlog until we hit a point when we can try it out and see if it makes any performance difference.

0 replies

eyalhir74 · 2022-03-08T14:38:03Z

eyalhir74
Mar 8, 2022
Author

I see @revans2 , Thanks a lot for the detailed explaination :)
So basically for both Parquet and JDBC like data sources, the only option of speeding this up, would be to implement some sort of GPUDirect RDMA / Magnum IO GPUDirect Storage support within hdfs/vertica/jdbc drivers right?

Again, thanks a lot!

0 replies

revans2 · 2022-03-08T15:09:45Z

revans2
Mar 8, 2022
Maintainer

For HDFS like APIs, yes we we can speed up the data transfer with something like that. NVIDIA has GPU Direct Storage, but Magnum IO is not common enough in the big data space to make it worth spending much time on it right now.

For JDBC/Vertica we would need to work with them to provide some kind of a way to export the data through an RDMA transfer, and have support for decoding whatever their data format is on the GPU. Even then it might not speed things up noticeably. If the data is small, like for a broadcast, then it is not likely to be contributing much to the total run time of the query. You should look at how long the broadcast tasks took to complete. My guess is that they are very small compared to doing other parts of the processing.

0 replies

eyalhir74 · 2022-03-08T17:13:50Z

eyalhir74
Mar 8, 2022
Author

@revans2 Yes you are correct. The vertica tables are on the magnitude of a few thousands rows as far as I could see.
If it is broadcasted than would the data reside on the GPU RAM or CPU ?
Probably not worth the hassle as you say :)

0 replies

revans2 · 2022-03-08T17:43:22Z

revans2
Mar 8, 2022
Maintainer

In the example you showed the data would be sent to the CPU, and then we transfer it to the GPU right after the RowDataSourceScanExec.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] RowDataSourceScanExec cannot run on the GPU #5332

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[QST] RowDataSourceScanExec cannot run on the GPU #5332

eyalhir74 Mar 5, 2022

Replies: 7 comments

revans2 Mar 7, 2022 Maintainer

eyalhir74 Mar 8, 2022 Author

revans2 Mar 8, 2022 Maintainer

eyalhir74 Mar 8, 2022 Author

revans2 Mar 8, 2022 Maintainer

eyalhir74 Mar 8, 2022 Author

revans2 Mar 8, 2022 Maintainer

eyalhir74
Mar 5, 2022

revans2
Mar 7, 2022
Maintainer

eyalhir74
Mar 8, 2022
Author

revans2
Mar 8, 2022
Maintainer

eyalhir74
Mar 8, 2022
Author

revans2
Mar 8, 2022
Maintainer

eyalhir74
Mar 8, 2022
Author

revans2
Mar 8, 2022
Maintainer