You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The way MemoryDataset decides which of the 3 copy modes to use is very straightforward, but in my opinion is lacking 4 years after its original implementation.
If no parameter is passed, it uses the function _infer_copy_mode
In the case of Pandas dataframes and (I think) Spark dataframes, this is the correct logic, but is too restrictive in its logic to make everything else deepcopy by default.
This is a problem when using Ibis Tables, which are not serializable, as discussed here #2374 (comment)
The solution is simple: to add the dataset to the catalog and manually set it to be a MemoryDataset with copy_mode: assign, but this rapidly becomes cumbersome when passing tables between many nodes, and becomes an unmaintainable hell.
I would like to be able to influence the logic in _infer_copy_mode by either
supplying a "mapping of default copy mode for datatype" (@deepyaman)
overriding the default memorydataset with a subclass
@noklam correctly pointed out this last one could be done using dataset factories like so
# catalog.yml{_catch_all_}:type: my.own.MemoryDataset # with different _infer_copy_mode()ORcopy_mode: "assign"OR{table_name#_ibis}: # haven't used factories yet myself, quoting the syntax from memorytype: MemoryDatasetcopy_mode: "assign"
But we all agreed this wasn't the most ergonomic way to do it, as it forces you to do it in every project.
I would welcome your thoughts on this or any solutions I may have missed (some sort of hook?)
the possibility of extending _infer_copy_mode to be compatible with more packages as part of the core code was also raised
The text was updated successfully, but these errors were encountered:
I'd also highlight this line intended detect if we're talking about a Spark dataframe or not is probably quite fragile and is possibly a relic of assumptions that made sense a few years ago but not anymore...
@deepyaman I just realized that subclassing wouldn't be as easy an option as I thought as _infer_copy_mode is a function, not a method, and it's called within the _load and _save methods, so would need to override both of them, unless you preemptively set the copy_mode in the __init__ of the subclass
Description
Following from discussion in Slack
The way MemoryDataset decides which of the 3 copy modes to use is very straightforward, but in my opinion is lacking 4 years after its original implementation.
If no parameter is passed, it uses the function
_infer_copy_mode
In the case of Pandas dataframes and (I think) Spark dataframes, this is the correct logic, but is too restrictive in its logic to make everything else deepcopy by default.
This is a problem when using Ibis Tables, which are not serializable, as discussed here #2374 (comment)
The solution is simple: to add the dataset to the catalog and manually set it to be a MemoryDataset with copy_mode: assign, but this rapidly becomes cumbersome when passing tables between many nodes, and becomes an unmaintainable hell.
I would like to be able to influence the logic in _infer_copy_mode by either
@noklam correctly pointed out this last one could be done using dataset factories like so
But we all agreed this wasn't the most ergonomic way to do it, as it forces you to do it in every project.
I would welcome your thoughts on this or any solutions I may have missed (some sort of hook?)
_infer_copy_mode
to be compatible with more packages as part of the core code was also raisedThe text was updated successfully, but these errors were encountered: