Purity assumption of map #63

oxinabox · 2021-04-01T17:28:40Z

Originally posted by @bkamins in #44 (comment)

Also the problem is that function passed to map does not have to be pure, in which case you should apply it to each element of the array.
Related is: #36.

By the functional definition of map it kinda should be pure.
But that isn't what julia has, so it's probably not great to assume that.
It might be nice though if there was a pure_map (in DataAPI maybe?) that is documented to be assuming that the function is pure.
And that falls back to map if not overloaded (e.g. by PooledArray), or possibly even for large arrays to a memorized version of map (could even go so far as to do a little tuning step to workout how large)

The text was updated successfully, but these errors were encountered:

nalimilan · 2021-04-06T21:08:35Z

I think the gains are such that it's worth assuming that map is pure for PooledArray.

shashi · 2021-08-12T15:27:31Z

Yeah it should be clearly documented by DataFrames, and an alternative function or a collect should be suggested if side effects are required.

bkamins · 2021-08-12T17:02:20Z

Yeah it should be clearly documented by DataFrames

I am not sure what you mean here. DataFrames.jl is not aware of PooledArrays.jl in that part of code. It is using generic map.

I would say the question is the following. The docstring of map in Julia Base states:

Transform collection c by applying f to each element.

Which explicitly promises to apply f to each element of c.

The comment by @nalimilan in JuliaData/DataFrames.jl#2837 (comment) has the following context:

DataFrames.jl uses map assuming the contract specified above (apply f to each element of c)
PooledArrays.jl does not follow this contract
Users of DataFrames.jl feel confused as they are not even calling map (it is called without them knowing this happens)
My fix in more careful test of ByRow for PooledArray DataFrames.jl#2837 avoids using map in DataFrames.jl, but - if I understand @nalimilan correctly - he would prefer to first establish how map would be handled in PooledArrays.jl as if we decide to stop assuming f is pure then I do not need to change DataFrames.jl and we can keep using map there
In general I would say that users of PooledArrays.jl will also be confused by what map currently do - if we keep current behavior we should add a docstring for map in PooledArrays.jl so that users are not confused (and then in DataFrames.jl we should implement the change I proposed in more careful test of ByRow for PooledArray DataFrames.jl#2837).

quinnj · 2021-08-12T21:21:14Z

It does seem like we'd be better off making map on PooledArray do the more natural thing; issues have come up several times now.

nalimilan · 2021-08-13T15:08:00Z

That's too bad for performance, but I have to admit that this issue keeps being raised... How about having a keyword argument pure=false to opt-in to the fast method?

Another, probably too clever solution: call f on the first entries, and as soon as a duplicate value is encountered, check whether the value that f returns for it is equal to the one that f returned for the previous call on the same value. If that's the case, assume purity. If not, proceed calling f on all remaining elements. That would be correct as long as f is deterministic. The only case where it could give incorrect results is when returning a random number that happens to be equal for the two calls by mere chance.

bkamins · 2021-08-13T17:47:18Z

check whether the value that f returns for it is equal to the one that f returned

I feel is too complex.

How about having a keyword argument pure=false to opt-in to the fast method?

pure::Bool=false looks good to me. If we agree on this I can implement it (as I want to resolve JuliaData/DataFrames.jl#2837 soon for 1.3 release of DataFrames.jl).

quinnj · 2021-08-13T17:59:55Z

Yeah, pure::Bool=false seems good to me too.

quinnj mentioned this issue Apr 13, 2021

Do not pool values by default? JuliaData/CSV.jl#822

Closed

nalimilan mentioned this issue Aug 12, 2021

more careful test of ByRow for PooledArray JuliaData/DataFrames.jl#2837

Merged

bkamins mentioned this issue Aug 13, 2021

add pure kwarg to map #71

Merged

bkamins closed this as completed in #71 Sep 1, 2021

nalimilan mentioned this issue Apr 12, 2022

Add a keyword argument to disable multithreading JuliaData/DataFrames.jl#3030

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Purity assumption of map #63

Purity assumption of map #63

oxinabox commented Apr 1, 2021

nalimilan commented Apr 6, 2021

shashi commented Aug 12, 2021

bkamins commented Aug 12, 2021

quinnj commented Aug 12, 2021

nalimilan commented Aug 13, 2021

bkamins commented Aug 13, 2021

quinnj commented Aug 13, 2021

Purity assumption of map #63

Purity assumption of map #63

Comments

oxinabox commented Apr 1, 2021

nalimilan commented Apr 6, 2021

shashi commented Aug 12, 2021

bkamins commented Aug 12, 2021

quinnj commented Aug 12, 2021

nalimilan commented Aug 13, 2021

bkamins commented Aug 13, 2021

quinnj commented Aug 13, 2021