[FEA] Create a new Expression execution framework #7260
Labels
feature request
New feature or request
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
Is your feature request related to a problem? Please describe.
After #7258 is at least partly done, and we have a few initial expressions from #7259 we should start to work on a framework to execute these expressions. This will likely involve a lot of back and forth with #7259 to refine things. But it is there to at least get a starting point.
The ideal for a framework like this would be to reduce the total memory usage, but that is kind of a hard problem to solve, so for the time being it will just be to keep track of the memory usage and avoid running out of memory. We may find ways to optimize memory usage in the future. The execution API from #7258 should be used. We should have a config to turn this off if we run into issues. It should dedupe expressions like tiered project does. I hope that the tiered project is stable enough that we can just have the config decide if we want to do a regular tiered project or use the memory aware version. In the memory aware version instead of executing an expression tree all at once, it should split it up. Anything under a Stateful modifying expression, including those expressions, will need to be executed as an entire group for now. For expressions that have not been updated yet, they will need to run as a part of a tree, but we can probably rewrite them with new children that are bound to specific inputs to be able to isolate them. For all other expressions we will want to try and keep everything under budget.
This would mean that we subtract the input batch size from the current budget. Starting at the leaf expressions and working towards the root we should ask each expression if they can predict up front what the size will be, and if so ask them for that size (based on the number of input rows). I think in the short term it would be best to find a subtree that can execute within the remaining budget and run it, then repeat. But if you just want to make things work we can treat them just like predictable expressions. For expressions that can only predict what they need, we need to execute these in isolation. We would need to pass their children to them and ask them to either execute it within the budget or return a response that indicates that they could not. If we see that they cannot run within budget, then we will split all pending results in half (based on the number of rows), making one half of it spillable, and the other half we will keep for execution. If we are down to one row, then we need to fail because we cannot spit it. If we think this is going to be really rare, we could also throw away all of the intermediate results and just split the input batch and redo all of the computation. I don’t like this as a long term solution, but it could be a good way to get started quickly.
If we don't do the long term solution, then we should file follow on issues, or have a good reason why we are doing it differently.
The text was updated successfully, but these errors were encountered: