-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add pure kwarg to map #71
Conversation
That's indeed more work, but I think it's worth the increased complexity. Making a temporary copy doesn't sound acceptable to me, especially given that the temporary vector will likely take much more memory than the final |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Yes, this is also what we do in DataFrames.jl. I will do it then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good to me; couple of test fixes I suggested
@nalimilan - so how do you think we should go about this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot about it. Looks mostly good now.
@quinnj Are you fine with the strategy we use to choose the reference type?
src/PooledArrays.jl
Outdated
@@ -139,6 +140,8 @@ _widen(::Type{Int32}) = Int64 | |||
Freshly allocate `PooledArray` using the given array as a source where each | |||
element will be referenced as an integer of the given type. | |||
|
|||
`PooledArray` constructor is not type stable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about explaining why that's the case, making the link with the next paragraph? And isn't the constructor type stable when reftype
is specified?
BTW, maybe we should mark the constructor as @inline
to ensure that when the default value of keyword arguments is used, the return type is inferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added inline. I have rewritten the docstring. Indeed when reftype
is passed the constructor is type stable.
@@ -500,3 +500,56 @@ end | |||
pa2 = repeat(pa1, inner = (2, 1)) | |||
@test pa2 == [1 2; 1 2; 3 4; 3 4] | |||
end | |||
|
|||
@testset "map pure tests" begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a few @inferred
checks where applicable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add the @inferred
tests elsewhere, as here nothing is inferred since now inference produces the Union
of two concrete types, see:
julia> x = PooledArray(fill(1, 200), signed=true, compress=true);
julia> @inferred map(Int, x)
ERROR: return type PooledVector{Int64, Int8, Vector{Int8}} does not match inferred return type Union{PooledVector{Int64,
Int64, Vector{Int64}}, PooledVector{Int64, Int8, Vector{Int8}}}
julia> @inferred map(Int, x, pure=true)
ERROR: return type PooledVector{Int64, Int8, Vector{Int8}} does not match inferred return type Union{PooledVector{Int64,
Int64, Vector{Int64}}, PooledVector{Int64, Int8, Vector{Int8}}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added @inferred
with Union
for map
tests
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
src/PooledArrays.jl
Outdated
if lbl != zero(I) | ||
labels[i] = lbl | ||
else | ||
if nlabels == typemax(I) || Ti !== T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the Ti !== T
check here be Ti <: T
to account for Ti == Int64
and T == Union{Int64, Missing}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right
src/PooledArrays.jl
Outdated
else | ||
if nlabels == typemax(I) || Ti !== T | ||
I2 = nlabels == typemax(I) ? Int : I | ||
T2 = Ti isa T ? T : Base.promote_typejoin(T, Ti) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Ti isa T
is correct here; it should be Ti <: T
or vi isa T
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I switched to vi isa T
everywhere. Thank you for spotting
typeof(similar(x.refs, Int, ntuple(i -> 0, ndims(x.refs))))}} where {R, N, RA} | ||
pure && return _map_pure(f, x) | ||
length(x) == 0 && return PooledArray([f(v) for v in x]) | ||
v1 = f(x[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quinnj and @nalimilan - just to double check. Are we sure that PooledArray
always uses 1-based indexing and even if it is multidimensional it can use a linear index?
I recall @nalimilan recently giving some comment that potentially a non-standard refs
field could be used. On the other hand eachindex(IndexLinear(), ::PooledArray)
falls back to the default implementation:
eachindex(::IndexLinear, A::AbstractArray) = (@inline; oneto(length(A)))
eachindex(::IndexLinear, A::AbstractVector) = (@inline; axes1(A))
I will have a look into it later if you do not have an immediate answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in theory any AbstractArray
type can be used as refs
, but in practice we probably haven't checked that things work for non-1-based indexing, and the fallbacks you show assume 1-based indexing, right? So I'd say it's OK to assume 1 for now and fix that later if somebody complains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - then I have added an appropriate check in the inner constructor.
test/runtests.jl
Outdated
for signed in (true, false), compress in (true, false) | ||
@test_throws ErrorException @inferred PooledArray([1, 2, 3]) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking this is a bit extreme: would it be a problem if inference managed to get this right? :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The objective is different - the moment the inference will get it right we will know it.
OTOH - advanced users reading the tests will know that we have a problem here (BTW - I have forgotten to add kwargs in the call I will fix this)
test/runtests.jl
Outdated
x = PooledArray(fill(1, len), signed=true, compress=true); | ||
VERSION >= v"1.6" && @inferred PooledVector{Int, Int, Vector{Int}} map(identity, x) | ||
end | ||
VERSION >= v"1.6" && include("map_inference.jl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you avoid this by using @static
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I will try
FWIW I've made a PR to make |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
suggestions applied |
Some benchmarks before merging. This PR:
and before this PR:
so all looks relatively good. |
Thank you! I will now make a release |
Thanks. Why is |
I think the reason is:
that most likely leads to boxing. I have not checked (doing |
I have checked. It seems to be type stable, but just this loop:
is expensive (@quinnj + @nalimilan => of course double checking would be welcome). |
I confirm the function appears to be type-stable, but there are nonetheless lots of allocations: - function _map_pure(f, x::PooledArray)
80000080 ks = collect(keys(x.invpool))
40000080 vs = collect(values(x.invpool))
0 ks1 = map(f, ks)
16 uks = Set(ks1)
0 if length(uks) < length(ks1)
- # this means some keys have repeated
0 newinvpool = Dict{eltype(ks1), eltype(vs)}()
0 translate = Dict{eltype(vs), eltype(vs)}()
- i = 1
799991968 for (k, k1) in zip(ks, ks1)
0 if haskey(newinvpool, k1)
159983616 translate[x.invpool[k]] = newinvpool[k1]
- else
0 newinvpool[k1] = i
16 translate[x.invpool[k]] = i
1119999952 i+=1
- end
- end
0 refarray = map(x->translate[x], x.refs)
- else
0 newinvpool = Dict(zip(ks1, vs))
0 refarray = copy(x.refs)
- end
16 return PooledArray(RefArray(refarray), newinvpool)
- end I assume this is due to filling the Also, we call |
Yes - I confirm it allocates a lot. I have not analyzed the code. But looking at it now:
If @shashi will not be able to work on it, I can make a PR fixing the issues. |
Fixes #63
I was thinking if it is possible to make
pure=false
branch faster, but it seems to be hard in general as we are not sure about types of values returned byf
.