Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pure kwarg to map #71

Merged
merged 45 commits into from
Sep 1, 2021
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
6056598
add pure kwarg to map
bkamins Aug 13, 2021
62dcdb8
Update Project.toml
bkamins Aug 13, 2021
6745eef
Update runtests.jl
bkamins Aug 13, 2021
2de3717
Apply suggestions from code review
bkamins Aug 13, 2021
a3de2c3
Apply suggestions from code review
bkamins Aug 13, 2021
6be6bde
Update src/PooledArrays.jl
bkamins Aug 13, 2021
065629d
more efficient map for not-pure case
bkamins Aug 13, 2021
73ff1ed
Apply suggestions from code review
bkamins Aug 13, 2021
819bddf
Update src/PooledArrays.jl
bkamins Aug 13, 2021
4aef8d9
Update src/PooledArrays.jl
bkamins Aug 13, 2021
b3dc12c
Update test/runtests.jl
quinnj Aug 14, 2021
eb5f5ee
Update test/runtests.jl
quinnj Aug 14, 2021
9845443
Apply suggestions from code review
bkamins Aug 14, 2021
ef7b491
Update src/PooledArrays.jl
bkamins Aug 14, 2021
de11e9e
Update src/PooledArrays.jl
bkamins Aug 14, 2021
5d80486
Apply suggestions from code review
bkamins Aug 14, 2021
14bc92b
Update test/runtests.jl
bkamins Aug 14, 2021
75aacf9
Update test/runtests.jl
bkamins Aug 14, 2021
ef02b27
fix promotion
bkamins Aug 14, 2021
ebc4d0a
fix type chceck
bkamins Aug 14, 2021
f1b60b1
Apply suggestions from code review
bkamins Aug 14, 2021
d9a7e4a
Update src/PooledArrays.jl
bkamins Aug 14, 2021
60fdbfc
Update PooledArrays.jl
bkamins Aug 14, 2021
1716888
Apply suggestions from code review
bkamins Aug 14, 2021
8aab260
small changes
bkamins Aug 17, 2021
a6c6e65
Merge branch 'bkamins-patch-2' of https://github.com/JuliaData/Pooled…
bkamins Aug 17, 2021
a59e22a
remove type assertion
bkamins Aug 17, 2021
2f80224
Apply suggestions from code review
bkamins Aug 18, 2021
3a0b7e8
Update src/PooledArrays.jl
bkamins Aug 18, 2021
3a97346
Update src/PooledArrays.jl
bkamins Aug 18, 2021
72a6089
fix signature
bkamins Aug 18, 2021
09c2f20
Apply suggestions from code review
bkamins Aug 20, 2021
ade7029
Apply suggestions from code review
bkamins Aug 20, 2021
6acf2e8
Update src/PooledArrays.jl
bkamins Aug 20, 2021
a059bef
Update src/PooledArrays.jl
bkamins Aug 29, 2021
6a4bfa5
apply suggestions from code review
bkamins Aug 29, 2021
950d914
Apply suggestions from code review
bkamins Aug 29, 2021
5c13102
amend tests
bkamins Aug 29, 2021
4204992
fix Ti issues
bkamins Aug 31, 2021
aba380c
add Base.require_one_based_indexing test in inner constructor
bkamins Aug 31, 2021
60efc6f
fix offest checks for old Julia versions
bkamins Aug 31, 2021
b98ee8f
another fix to Julia 1.0
bkamins Aug 31, 2021
a146fa6
fix typo
bkamins Aug 31, 2021
97fe088
Apply suggestions from code review
bkamins Aug 31, 2021
0b21c63
apply review comments
bkamins Aug 31, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name = "PooledArrays"
uuid = "2dfb63ee-cc39-5dd5-95bd-886bf059d720"
version = "1.2.1"
version = "1.3.0"

[deps]
DataAPI = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a"
Expand Down
86 changes: 74 additions & 12 deletions src/PooledArrays.jl
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ const PooledArrOrSub = Union{SubArray{T, N, <:PooledArray{T, R}},
##############################################################################

# Echo inner constructor as an outer constructor
PooledArray(refs::RefArray{RA}, invpool::Dict{T,R}, pool::Vector{T}=_invert(invpool),
@inline PooledArray(refs::RefArray{RA}, invpool::Dict{T,R}, pool::Vector{T}=_invert(invpool),
refcount::Threads.Atomic{Int}=Threads.Atomic{Int}(1)) where {T,R,RA<:AbstractArray{R}} =
PooledArray{T,R,ndims(RA),RA}(refs, invpool, pool, refcount)

Expand All @@ -89,7 +89,7 @@ function _our_copy(x::SubArray{<:Any, 0})
return y
end

function PooledArray(d::PooledArrOrSub)
@inline function PooledArray(d::PooledArrOrSub)
Threads.atomic_add!(refcount(d), 1)
return PooledArray(RefArray(_our_copy(DataAPI.refarray(d))),
DataAPI.invrefpool(d), DataAPI.refpool(d), refcount(d))
Expand Down Expand Up @@ -131,6 +131,7 @@ _widen(::Type{UInt32}) = UInt64
_widen(::Type{Int8}) = Int16
_widen(::Type{Int16}) = Int32
_widen(::Type{Int32}) = Int64

# Constructor from array, invpool, and ref type

"""
Expand All @@ -139,7 +140,8 @@ _widen(::Type{Int32}) = Int64
Freshly allocate `PooledArray` using the given array as a source where each
element will be referenced as an integer of the given type.

If `reftype` is not specified, Boolean keyword arguments `signed` and `compress`
If `reftype` is not specified then `PooledArray` constructor is not type stable.
In this case boolean keyword arguments `signed` and `compress`
bkamins marked this conversation as resolved.
Show resolved Hide resolved
determine the type of integer references. By default (`signed=false`), unsigned integers
are used, as they have a greater range.
However, the Arrow standard at https://arrow.apache.org/, as implemented in
Expand All @@ -162,7 +164,7 @@ if all values already exist in the pool.
"""
PooledArray

function PooledArray{T}(d::AbstractArray, r::Type{R}) where {T,R<:Integer}
@inline function PooledArray{T}(d::AbstractArray, r::Type{R}) where {T,R<:Integer}
refs, invpool, pool = _label(d, T, R)

if length(invpool) > typemax(R)
Expand All @@ -173,19 +175,19 @@ function PooledArray{T}(d::AbstractArray, r::Type{R}) where {T,R<:Integer}
return PooledArray(RefArray(refs::Vector{R}), invpool::Dict{T,R}, pool)
end

function PooledArray{T}(d::AbstractArray; signed::Bool=false, compress::Bool=false) where {T}
@inline function PooledArray{T}(d::AbstractArray; signed::Bool=false, compress::Bool=false) where {T}
R = signed ? (compress ? Int8 : DEFAULT_SIGNED_REF_TYPE) : (compress ? UInt8 : DEFAULT_POOLED_REF_TYPE)
refs, invpool, pool = _label(d, T, R)
return PooledArray(RefArray(refs), invpool, pool)
end

PooledArray(d::AbstractArray{T}, r::Type) where {T} = PooledArray{T}(d, r)
PooledArray(d::AbstractArray{T}; signed::Bool=false, compress::Bool=false) where {T} =
@inline PooledArray(d::AbstractArray{T}, r::Type) where {T} = PooledArray{T}(d, r)
@inline PooledArray(d::AbstractArray{T}; signed::Bool=false, compress::Bool=false) where {T} =
PooledArray{T}(d, signed=signed, compress=compress)

# Construct an empty PooledVector of a specific type
PooledArray(t::Type) = PooledArray(Array(t,0))
PooledArray(t::Type, r::Type) = PooledArray(Array(t,0), r)
@inline PooledArray(t::Type) = PooledArray(Array(t,0))
@inline PooledArray(t::Type, r::Type) = PooledArray(Array(t,0), r)

##############################################################################
##
Expand Down Expand Up @@ -304,7 +306,67 @@ Base.findall(pdv::PooledVector{Bool}) = findall(convert(Vector{Bool}, pdv))
##
##############################################################################

function Base.map(f, x::PooledArray{T,R}) where {T,R<:Integer}
"""
map(f, x::PooledArray; pure::Bool=false)

Transform `PooledArray` `x` by applying `f` to each element.

If `pure=true` then `f` is applied to each element of pool of `x`
exactly once (even if some elements in pool are not present it `x`).
This will typically be much faster when the proportion of unique values
in `x` is small.
bkamins marked this conversation as resolved.
Show resolved Hide resolved

bkamins marked this conversation as resolved.
Show resolved Hide resolved
If `pure=false`, the returned array will use the same reference type
as `x`, or `Int` if the number of unique values in the result is too large
to fit in that type.
"""
function Base.map(f, x::PooledArray{<:Any, R, N, RA}; pure::Bool=false)::Union{PooledArray{<:Any, R, N, RA},
PooledArray{<:Any, Int, N,
typeof(similar(x.refs, Int, ntuple(i -> 0, ndims(x.refs))))}} where {R, N, RA}
pure && return _map_pure(f, x)
length(x) == 0 && return PooledArray([f(v) for v in x])
v1 = f(x[1])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quinnj and @nalimilan - just to double check. Are we sure that PooledArray always uses 1-based indexing and even if it is multidimensional it can use a linear index?

I recall @nalimilan recently giving some comment that potentially a non-standard refs field could be used. On the other hand eachindex(IndexLinear(), ::PooledArray) falls back to the default implementation:

eachindex(::IndexLinear, A::AbstractArray) = (@inline; oneto(length(A)))
eachindex(::IndexLinear, A::AbstractVector) = (@inline; axes1(A))

I will have a look into it later if you do not have an immediate answer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in theory any AbstractArray type can be used as refs, but in practice we probably haven't checked that things work for non-1-based indexing, and the fallbacks you show assume 1-based indexing, right? So I'd say it's OK to assume 1 for now and fix that later if somebody complains.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - then I have added an appropriate check in the inner constructor.

invpool = Dict(v1 => one(eltype(x.refs)))
pool = [v1]
labels = similar(x.refs)
labels[1] = 1
nlabels = 1
return _map_notpure(f, x, 2, invpool, pool, labels, nlabels)
end

function _map_notpure(f, xs::PooledArray, start,
invpool::Dict{T,I}, pool::Vector{T},
labels::AbstractArray{I}, nlabels::Int) where {T, I<:Integer}
for i in start:length(xs)
vi = f(xs[i])
Ti = typeof(vi)
lbl = get(invpool, vi, zero(I))
if lbl != zero(I)
labels[i] = lbl
else
if nlabels == typemax(I) || Ti !== T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the Ti !== T check here be Ti <: T to account for Ti == Int64 and T == Union{Int64, Missing}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

I2 = nlabels == typemax(I) ? Int : I
T2 = Ti isa T ? T : Base.promote_typejoin(T, Ti)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Ti isa T is correct here; it should be Ti <: T or vi isa T, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I switched to vi isa T everywhere. Thank you for spotting

nlabels += 1
invpool2 = convert(Dict{T2, I2}, invpool)
invpool2[vi] = nlabels
pool2 = convert(Vector{T2}, pool)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
push!(pool2, vi)
labels2 = convert(AbstractArray{I2}, labels)
labels2[i] = nlabels
return _map_notpure(f, xs, i + 1, invpool2, pool2,
labels2, nlabels)
end
nlabels += 1
labels[i] = nlabels
invpool[vi] = nlabels
push!(pool, vi)
end
end
return PooledArray(RefArray(labels), invpool, pool)
end

function _map_pure(f, x::PooledArray)
ks = collect(keys(x.invpool))
vs = collect(values(x.invpool))
ks1 = map(f, ks)
Expand Down Expand Up @@ -601,14 +663,14 @@ _perm(o::F, z::V) where {F, V} = Base.Order.Perm{F, V}(o, z)

Base.Order.Perm(o::Base.Order.ForwardOrdering, y::PooledArray) = _perm(o, fast_sortable(y))

function Base.repeat(x::PooledArray, m::Integer...)
function Base.repeat(x::PooledArray, m::Integer...)
Threads.atomic_add!(x.refcount, 1)
PooledArray(RefArray(repeat(x.refs, m...)), x.invpool, x.pool, x.refcount)
end

function Base.repeat(x::PooledArray; inner = nothing, outer = nothing)
Threads.atomic_add!(x.refcount, 1)
PooledArray(RefArray(repeat(x.refs; inner = inner, outer = outer)),
PooledArray(RefArray(repeat(x.refs; inner = inner, outer = outer)),
x.invpool, x.pool, x.refcount)
end

Expand Down
4 changes: 4 additions & 0 deletions test/map_inference.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
for signed in (true, false), compress in (true, false), len in (1, 100, 1000)
x = PooledArray(fill(1, len), signed=true, compress=true);
bkamins marked this conversation as resolved.
Show resolved Hide resolved
@inferred PooledVector{Int, Int, Vector{Int}} map(identity, x)
end
62 changes: 62 additions & 0 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,13 @@ end
@test PooledArrays.fast_sortable(v3) == PooledArray([1, 3, 2, 4])
@test isbitstype(eltype(PooledArrays.fast_sortable(v3)))
Base.Order.Perm(Base.Order.Forward, v3).data == PooledArray([1, 3, 2, 4])

for T in (Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64)
@inferred PooledArray([1, 2, 3], T)
end
for signed in (true, false), compress in (true, false)
@test_throws ErrorException @inferred PooledArray([1, 2, 3])
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking this is a bit extreme: would it be a problem if inference managed to get this right? :-D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The objective is different - the moment the inference will get it right we will know it.
OTOH - advanced users reading the tests will know that we have a problem here (BTW - I have forgotten to add kwargs in the call I will fix this)

end

@testset "pool non-copying constructor and copy tests" begin
Expand Down Expand Up @@ -500,3 +507,58 @@ end
pa2 = repeat(pa1, inner = (2, 1))
@test pa2 == [1 2; 1 2; 3 4; 3 4]
end

@testset "map pure tests" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a few @inferred checks where applicable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add the @inferred tests elsewhere, as here nothing is inferred since now inference produces the Union of two concrete types, see:

julia> x = PooledArray(fill(1, 200), signed=true, compress=true);

julia> @inferred map(Int, x)
ERROR: return type PooledVector{Int64, Int8, Vector{Int8}} does not match inferred return type Union{PooledVector{Int64, 
Int64, Vector{Int64}}, PooledVector{Int64, Int8, Vector{Int8}}}

julia> @inferred map(Int, x, pure=true)
ERROR: return type PooledVector{Int64, Int8, Vector{Int8}} does not match inferred return type Union{PooledVector{Int64, 
Int64, Vector{Int64}}, PooledVector{Int64, Int8, Vector{Int8}}}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added @inferred with Union for map tests

x = PooledArray([1, 2, 3])
x[3] = 1
y = map(-, x, pure=true)
@test refpool(y) == [-1, -2, -3]
@test y == [-1, -2, -1]

y = map(-, x)
@test refpool(y) == [-1, -2]
@test y == [-1, -2, -1]

function f()
i = Ref(0)
return x -> (i[] -= 1; i[])
end

# the order is strange as we iterate invpool which is a Dict
# and it depends on the version of Julia
y = map(f(), x, pure=true)
d = Dict(Set(1:3) .=> -1:-1:-3)
@test refpool(y) == [d[i] for i in 1:3]
@test y == [d[v] for v in x]

y = map(f(), x)
@test refpool(y) == [-1, -2, -3]
@test y == [-1, -2, -3]
bkamins marked this conversation as resolved.
Show resolved Hide resolved

x = PooledArray([1, missing, 2])
y = map(identity, x)
@test isequal(y, [1, missing, 2])
@test typeof(y) === PooledVector{Union{Missing, Int}, UInt32, Vector{UInt32}}

x = PooledArray([1, missing, 2], signed=true, compress=true)
y = map(identity, x)
@test isequal(y, [1, missing, 2])
@test typeof(y) === PooledVector{Union{Missing, Int}, Int8, Vector{Int8}}

x = PooledArray(fill(1, 200), signed=true, compress=true)
y = map(f(), x)
@test y == -1:-1:-200
@test typeof(y) === PooledVector{Int, Int, Vector{Int}}

x = PooledArray(reshape(fill(1, 200), 2, :), signed=true, compress=true)
y = map(f(), x)
@test y == reshape(-1:-1:-200, 2, :)
@test typeof(y) === PooledMatrix{Int, Int, Matrix{Int}}

x = PooledArray(fill("a"), signed=true, compress=true)
y = map(f(), x)
@test y == fill(-1)
@test typeof(y) === PooledArray{Int, Int8, 0, Array{Int8, 0}}

VERSION >= v"1.6" && include("map_inference.jl")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you avoid this by using @static?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will try

end