Improve MemoryDataset copy_mode selection #3551

inigohidalgo · 2024-01-23T18:21:39Z

Description

Following from discussion in Slack

The way MemoryDataset decides which of the 3 copy modes to use is very straightforward, but in my opinion is lacking 4 years after its original implementation.

If no parameter is passed, it uses the function _infer_copy_mode

In the case of Pandas dataframes and (I think) Spark dataframes, this is the correct logic, but is too restrictive in its logic to make everything else deepcopy by default.

This is a problem when using Ibis Tables, which are not serializable, as discussed here #2374 (comment)

The solution is simple: to add the dataset to the catalog and manually set it to be a MemoryDataset with copy_mode: assign, but this rapidly becomes cumbersome when passing tables between many nodes, and becomes an unmaintainable hell.

I would like to be able to influence the logic in _infer_copy_mode by either

supplying a "mapping of default copy mode for datatype" (@deepyaman)
overriding the default memorydataset with a subclass

@noklam correctly pointed out this last one could be done using dataset factories like so

# catalog.yml

{_catch_all_}:
  type: my.own.MemoryDataset # with different _infer_copy_mode()
OR
  copy_mode: "assign"

OR

{table_name#_ibis}: # haven't used factories yet myself, quoting the syntax from memory
  type: MemoryDataset
  copy_mode: "assign"

But we all agreed this wasn't the most ergonomic way to do it, as it forces you to do it in every project.

I would welcome your thoughts on this or any solutions I may have missed (some sort of hook?)

the possibility of extending _infer_copy_mode to be compatible with more packages as part of the core code was also raised

The text was updated successfully, but these errors were encountered:

datajoely · 2024-01-23T18:28:38Z

I'd also highlight this line intended detect if we're talking about a Spark dataframe or not is probably quite fragile and is possibly a relic of assumptions that made sense a few years ago but not anymore...

kedro/kedro/io/memory_dataset.py

Line 106 in e187dca

elif type(data).__name__ == "DataFrame":

inigohidalgo · 2024-01-23T18:46:59Z

kedro/kedro/io/memory_dataset.py

Lines 60 to 70 in e187dca

    
           def _load(self) -> Any: 
        
               if self._data is _EMPTY: 
        
                   raise DatasetError("Data for MemoryDataset has not been saved yet.") 
        
               copy_mode = self._copy_mode or _infer_copy_mode(self._data) 
        
               data = _copy_with_mode(self._data, copy_mode=copy_mode) 
        
               return data 
        
           def _save(self, data: Any) -> None: 
        
               copy_mode = self._copy_mode or _infer_copy_mode(data) 
        
               self._data = _copy_with_mode(data, copy_mode=copy_mode)

@deepyaman I just realized that subclassing wouldn't be as easy an option as I thought as _infer_copy_mode is a function, not a method, and it's called within the _load and _save methods, so would need to override both of them, unless you preemptively set the copy_mode in the __init__ of the subclass

inigohidalgo added the Issue: Feature Request New feature or improvement to existing feature label Jan 23, 2024

github-actions bot mentioned this issue Feb 1, 2024

Monthly issue metrics report #3582

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MemoryDataset copy_mode selection #3551

Improve MemoryDataset copy_mode selection #3551

inigohidalgo commented Jan 23, 2024 •

edited

Loading

datajoely commented Jan 23, 2024

inigohidalgo commented Jan 23, 2024

Improve MemoryDataset copy_mode selection #3551

Improve MemoryDataset copy_mode selection #3551

Comments

inigohidalgo commented Jan 23, 2024 • edited Loading

Description

datajoely commented Jan 23, 2024

inigohidalgo commented Jan 23, 2024

inigohidalgo commented Jan 23, 2024 •

edited

Loading