Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purity assumption of map #63

Closed
oxinabox opened this issue Apr 1, 2021 · 7 comments · Fixed by #71
Closed

Purity assumption of map #63

oxinabox opened this issue Apr 1, 2021 · 7 comments · Fixed by #71

Comments

@oxinabox
Copy link
Contributor

oxinabox commented Apr 1, 2021

Originally posted by @bkamins in #44 (comment)

Also the problem is that function passed to map does not have to be pure, in which case you should apply it to each element of the array.
Related is: #36.

By the functional definition of map it kinda should be pure.
But that isn't what julia has, so it's probably not great to assume that.
It might be nice though if there was a pure_map (in DataAPI maybe?) that is documented to be assuming that the function is pure.
And that falls back to map if not overloaded (e.g. by PooledArray), or possibly even for large arrays to a memorized version of map (could even go so far as to do a little tuning step to workout how large)

@nalimilan
Copy link
Member

I think the gains are such that it's worth assuming that map is pure for PooledArray.

@shashi
Copy link
Collaborator

shashi commented Aug 12, 2021

Yeah it should be clearly documented by DataFrames, and an alternative function or a collect should be suggested if side effects are required.

@bkamins
Copy link
Member

bkamins commented Aug 12, 2021

Yeah it should be clearly documented by DataFrames

I am not sure what you mean here. DataFrames.jl is not aware of PooledArrays.jl in that part of code. It is using generic map.

I would say the question is the following. The docstring of map in Julia Base states:

Transform collection c by applying f to each element.

Which explicitly promises to apply f to each element of c.

The comment by @nalimilan in JuliaData/DataFrames.jl#2837 (comment) has the following context:

  1. DataFrames.jl uses map assuming the contract specified above (apply f to each element of c)
  2. PooledArrays.jl does not follow this contract
  3. Users of DataFrames.jl feel confused as they are not even calling map (it is called without them knowing this happens)
  4. My fix in more careful test of ByRow for PooledArray DataFrames.jl#2837 avoids using map in DataFrames.jl, but - if I understand @nalimilan correctly - he would prefer to first establish how map would be handled in PooledArrays.jl as if we decide to stop assuming f is pure then I do not need to change DataFrames.jl and we can keep using map there
  5. In general I would say that users of PooledArrays.jl will also be confused by what map currently do - if we keep current behavior we should add a docstring for map in PooledArrays.jl so that users are not confused (and then in DataFrames.jl we should implement the change I proposed in more careful test of ByRow for PooledArray DataFrames.jl#2837).

@quinnj
Copy link
Member

quinnj commented Aug 12, 2021

It does seem like we'd be better off making map on PooledArray do the more natural thing; issues have come up several times now.

@nalimilan
Copy link
Member

That's too bad for performance, but I have to admit that this issue keeps being raised... How about having a keyword argument pure=false to opt-in to the fast method?

Another, probably too clever solution: call f on the first entries, and as soon as a duplicate value is encountered, check whether the value that f returns for it is equal to the one that f returned for the previous call on the same value. If that's the case, assume purity. If not, proceed calling f on all remaining elements. That would be correct as long as f is deterministic. The only case where it could give incorrect results is when returning a random number that happens to be equal for the two calls by mere chance.

@bkamins
Copy link
Member

bkamins commented Aug 13, 2021

check whether the value that f returns for it is equal to the one that f returned

I feel is too complex.

How about having a keyword argument pure=false to opt-in to the fast method?

pure::Bool=false looks good to me. If we agree on this I can implement it (as I want to resolve JuliaData/DataFrames.jl#2837 soon for 1.3 release of DataFrames.jl).

@quinnj
Copy link
Member

quinnj commented Aug 13, 2021

Yeah, pure::Bool=false seems good to me too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants