-
I'm getting this in the SQL explain plan, what exactly does this mean? There's some join in the query agains a Vertica DB, is that it? |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments
-
It means that this small part of your query is converting an
If you are okay with that solution I will file a follow on issue to do that. |
Beta Was this translation helpful? Give feedback.
-
@revans2 I am not sure I understand what you suggest. To add this to a list of things that won't run on the GPU so it would be more visible in the SQL plan? One more note though. Its a lookup table (we actualy have quite a few like these for some reason on Vertica while the real data is on Parquet files), so maybe it could have been read once into memory and kept there in case its small? Otherwise if there's a join with that table in a low-level part of the query, it would everytime pause the GPU while it fetches data from the JDBC data source? hence hurting performance? |
Beta Was this translation helpful? Give feedback.
-
Okay I'll get into some architecture here to try and explain things. Reading data into Spark usually involves a few operations. Note that the order of these operations and the machine that they run on can change based off of what the input format is.
For file formats, like Parquet and ORC, stored in a blob store, like S3, we can only really accelerate the data decoding. Using the metadata to figure out exactly what data to read in from S3 is done on the CPU because it does not fit well with what the GPU is good at. Also transferring the raw bytes, currently has to go through the HDFS API, which does not offer any way to transfer the data more quickly to the GPU, so it stays on the CPU too. There is a second group of input formats like JDBC and VerticaDB. They hide all of the predicate push down, data transfer, and most of the decoding steps. The only step that is exposed appears to be the What is more even if we dug into the details of how the JDBC or the VerticaDB connectors work, fundamentally they are sending a query to another group of servers to do all of the processing, so without working very closely with the database vendor there is really no hope for us to accelerate anything using the GPU. With all of that in mind we really have two choices.
|
Beta Was this translation helpful? Give feedback.
-
I see @revans2 , Thanks a lot for the detailed explaination :) Again, thanks a lot! |
Beta Was this translation helpful? Give feedback.
-
For HDFS like APIs, yes we we can speed up the data transfer with something like that. NVIDIA has GPU Direct Storage, but Magnum IO is not common enough in the big data space to make it worth spending much time on it right now. For JDBC/Vertica we would need to work with them to provide some kind of a way to export the data through an RDMA transfer, and have support for decoding whatever their data format is on the GPU. Even then it might not speed things up noticeably. If the data is small, like for a broadcast, then it is not likely to be contributing much to the total run time of the query. You should look at how long the broadcast tasks took to complete. My guess is that they are very small compared to doing other parts of the processing. |
Beta Was this translation helpful? Give feedback.
-
@revans2 Yes you are correct. The vertica tables are on the magnitude of a few thousands rows as far as I could see. |
Beta Was this translation helpful? Give feedback.
-
In the example you showed the data would be sent to the CPU, and then we transfer it to the GPU right after the RowDataSourceScanExec. |
Beta Was this translation helpful? Give feedback.
Okay I'll get into some architecture here to try and explain things. Reading data into Spark usually involves a few operations. Note that the order of these operations and the machine that they run on can change based off of what the input format is.
For file formats, like Parquet and ORC, stored in a blob store, like S3, we can only really accelerate the data decoding. …