Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

circularity in AbstractDataContainer definitions #57

Closed
CarloLucibello opened this issue Feb 20, 2022 · 3 comments · Fixed by #59
Closed

circularity in AbstractDataContainer definitions #57

CarloLucibello opened this issue Feb 20, 2022 · 3 comments · Fixed by #59

Comments

@CarloLucibello
Copy link
Member

CarloLucibello commented Feb 20, 2022

We currently have the following definitions for AbstractDataContainer where getindex falls back to getobs

abstract type AbstractDataContainer end

Base.getindex(x::AbstractDataContainer, i) = getobs(x, i)
Base.length(x::AbstractDataContainer) = numobs(x)
Base.size(x::AbstractDataContainer) = (length(x),)

Base.iterate(x::AbstractDataContainer, state = 1) =
    (state > length(x)) ? nothing : (x[state], state + 1)
Base.lastindex(x::AbstractDataContainer) = length(x)

and on the other end we have the generic fallback for getobs

getobs(x, i) = getindex(x, i)
numobs(x) = length(x)

I find this circularity a bit confusing and think it should be avoided. I suggest we change AbstractDataContainer to

abstract type AbstractDataContainer end

Base.iterate(x::AbstractDataContainer, state = 1) =
    (state > numobs(x)) ? nothing : (getobs(x, state), state + 1)
Base.lastindex(x::AbstractDataContainer) = numobs(x)

Then types inheriting from AbstractDataContainer:

  • Can implement getindex if they want both the "indexing" interface and the "observable" interface.
  • Implement just getobs if for some reason they don't want to expose an indexing interface
  • Implement both getobs and getindex if the two interfaces serve different purposes (e.g. as with arrays)

As an addendum, let me remark that with the getobs(x, i) = getindex(x, i) fallback we are basically saying that we consider a Dataset any type implementing getindex, which is something that maybe we should document more. Should defining getindex be the recommended way for defining custom dataset types (even if not subtyping AbstractDataContainer)?

@darsnack
related to JuliaML/MLDatasets.jl#96

@darsnack
Copy link
Member

darsnack commented Feb 20, 2022

I agree its very confusing to have these definitions. The way I make sense of it is that we have an equivalence: getobs <==> getindex. For types that don't implement anything in MLUtils, the fallback getobs(data, idx) = data[idx] ensures the equivalence. For types that do implement our interface, we automatically define getindex(data, i) = getobs(data, i) to reduce boilerplate (and ensure the equivalence holds). So, it is circular but a given type should only ever take one half of the circle path.

The main reason that I added these definitions to AbstractDataContainer is because I became very frustrated using MLDataPattern.jl in the REPL to inspect my data. No one wants to type out getobs(data, i), it's natural habit to just do data[i]. It was annoying to me that this randomly failed for certain containers in MLDataPattern.jl.

All this being said, I really like this proposal:

Should defining getindex be the recommended way for defining custom dataset types

This will ensure that any type has indexing and length working when it behaves like a vector. Thanks to the fallbacks, it should work with MLUtils.jl. And any type that has multiple dimensions can additionally specify a custom getobs. Seems like the cleanest solution to me.

For a lot of cases, this also means being able to work with the package without adding an MLUtils.jl dep.

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Feb 21, 2022

So we can remove these lines?

Base.getindex(x::AbstractDataContainer, i) = getobs(x, i)
Base.length(x::AbstractDataContainer) = numobs(x)
Base.size(x::AbstractDataContainer) = (length(x),)

@darsnack
Copy link
Member

Yeah let's do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants