Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MemoryDataset copy_mode selection #3551

Open
inigohidalgo opened this issue Jan 23, 2024 · 2 comments
Open

Improve MemoryDataset copy_mode selection #3551

inigohidalgo opened this issue Jan 23, 2024 · 2 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@inigohidalgo
Copy link
Contributor

inigohidalgo commented Jan 23, 2024

Description

Following from discussion in Slack

The way MemoryDataset decides which of the 3 copy modes to use is very straightforward, but in my opinion is lacking 4 years after its original implementation.

If no parameter is passed, it uses the function _infer_copy_mode

In the case of Pandas dataframes and (I think) Spark dataframes, this is the correct logic, but is too restrictive in its logic to make everything else deepcopy by default.

This is a problem when using Ibis Tables, which are not serializable, as discussed here #2374 (comment)

The solution is simple: to add the dataset to the catalog and manually set it to be a MemoryDataset with copy_mode: assign, but this rapidly becomes cumbersome when passing tables between many nodes, and becomes an unmaintainable hell.

I would like to be able to influence the logic in _infer_copy_mode by either

  • supplying a "mapping of default copy mode for datatype" (@deepyaman)
  • overriding the default memorydataset with a subclass

@noklam correctly pointed out this last one could be done using dataset factories like so

# catalog.yml

{_catch_all_}:
  type: my.own.MemoryDataset # with different _infer_copy_mode()
OR
  copy_mode: "assign"

OR

{table_name#_ibis}: # haven't used factories yet myself, quoting the syntax from memory
  type: MemoryDataset
  copy_mode: "assign"

But we all agreed this wasn't the most ergonomic way to do it, as it forces you to do it in every project.

I would welcome your thoughts on this or any solutions I may have missed (some sort of hook?)

  • the possibility of extending _infer_copy_mode to be compatible with more packages as part of the core code was also raised
@inigohidalgo inigohidalgo added the Issue: Feature Request New feature or improvement to existing feature label Jan 23, 2024
@datajoely
Copy link
Contributor

I'd also highlight this line intended detect if we're talking about a Spark dataframe or not is probably quite fragile and is possibly a relic of assumptions that made sense a few years ago but not anymore...

elif type(data).__name__ == "DataFrame":

@inigohidalgo
Copy link
Contributor Author

def _load(self) -> Any:
if self._data is _EMPTY:
raise DatasetError("Data for MemoryDataset has not been saved yet.")
copy_mode = self._copy_mode or _infer_copy_mode(self._data)
data = _copy_with_mode(self._data, copy_mode=copy_mode)
return data
def _save(self, data: Any) -> None:
copy_mode = self._copy_mode or _infer_copy_mode(data)
self._data = _copy_with_mode(data, copy_mode=copy_mode)

@deepyaman I just realized that subclassing wouldn't be as easy an option as I thought as _infer_copy_mode is a function, not a method, and it's called within the _load and _save methods, so would need to override both of them, unless you preemptively set the copy_mode in the __init__ of the subclass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

2 participants