[FEA] replace GpuProjectExec.project API with one that returns an Iterator #7258
Labels
feature request
New feature or request
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
Is your feature request related to a problem? Please describe.
There are a number of limitations that CUDF places on columns of data. There are also limitations based off of the hardware that we are running on. #7253 is an attempt to avoid running out of memory for operations that SparkPlan exec nodes. But most SparkPlan exec nodes use Expressions in one form or another to help implement how they process the data. It does this by trying to estimate how much memory is going to be needed to process the data. The problem with Expressions is that they are so varied that in many cases it is impossible to guess how much memory is going to be needed before we actually process it. In many cases it is also impossible to know if we are going to go over the limits that CUDF places on column vectors, even in the presence of lots of GPU memory.
As such we want to change the way in which we process expressions so that it stays under a budget instead of trying to request more data.
The first step in doing this is to remove the old API which takes a single batch as input and produces a single batch of output. Instead we should move towards an API that looks like.
And is based off of the tiered project code that already exists.
At a minimum we need to implement enough of #7253 before this so that we can get the budget from the GpuMemoryLeaseManager. In the short term this is not going to actually change the results at all. It is just going to change the API and let us find the different locations where we might run into problems trying to do this.
In the short term those problematic locations can be left for follow on issues, but we need to make sure that we file the follow on issues and figure out a plan on how to support them.
The text was updated successfully, but these errors were encountered: