Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setindex, push and convenience constructor #12

Merged
merged 8 commits into from
Mar 11, 2018

Conversation

andreasnoack
Copy link
Member

@andreasnoack andreasnoack commented Jan 10, 2018

I'm now sure if you want this package to be minimal or if you are okay with adding some functionality like this. If you are okay with adding stuff, I'll finish up the PR. I think this could be useful as an internal replacement for the now deprecated PooledDataArrays in Gadfly but then some extra functionality similar to the methods defined here are needed.

@timholy
Copy link
Member

timholy commented Jan 10, 2018

Interesting. On one hand I worry this violates some of the basic assumptions of AbstractArrays (getting and setting values should be fast), OTOH this does seem useful.

On balance I'm leaning in favor, but would be happy if other JuliaArrays contributors chimed in. Out of curiosity, how big is the values array in most of your use cases?

@andreasnoack
Copy link
Member Author

andreasnoack commented Jan 10, 2018

On one hand I worry this violates some of the basic assumptions of AbstractArrays

When working with sparse and distributed arrays I learned to take the basic assumptions of AbstractArrays (for which setindex! is much more costly) with a grain of salt but I'm sympathetic to keeping things simple.

Out of curiosity, how big is the values array in most of your use cases?

My own use cases for these kinds of arrays have typically been statistics, i.e. what is called a factor in R and there the number of values has been very small. Typically smaller than ten. However, the current motivation was to bring Gadfly up to date which required a replacement for PooledDataArrays. When running the test suite using PooledArrays, I noticed that UInt32s were used which means that there must be quite a few values in that array since the default is to use UInt8.

@timholy
Copy link
Member

timholy commented Jan 10, 2018

I asked because I wondered if an inverse-mapping Dict might be better. But at less than 10 I suspect linear search would be hard to beat. We could also add another type that does this via a Dict (or static perfect-hashing algorithm), if the need arises.

Unless others object, I'm personally willing to take this. Thanks for stepping forward.

@codecov-io
Copy link

codecov-io commented Jan 19, 2018

Codecov Report

Merging #12 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master    #12   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files           1      1           
  Lines          16     38   +22     
=====================================
+ Hits           16     38   +22
Impacted Files Coverage Δ
src/IndirectArrays.jl 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 135a274...e32c8bb. Read the comment docs.

@andreasnoack
Copy link
Member Author

I've added some tests so I think this is ready. Some of these methods can probably be optimized but for most applications, I only think a fairly fast getindex is required.

@andreasnoack
Copy link
Member Author

@timholy Bump. This is blocking a Gadfly update so it would be great to get merged soon. Please let me know if it needs changes.

@bjarthur
Copy link
Contributor

if folks object to this, am i correct in thinking that Gadfly could import and extend IndirectArrays instead?

@andreasnoack
Copy link
Member Author

Bump

@va-andrew
Copy link

bump

Copy link
Contributor

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that @timholy said he was OK on the principle, I'm going to (ab)use my powers and merge this unless somebody objects. It's been blocking the port of Gadfly to DataFrames 0.11 for too long.

@inline function Base.setindex!(A::IndirectArray, x, i::Int)
@boundscheck checkbounds(A.index, i)
idx = findfirst(A.values, x)
if idx == 0 || idx == nothing # findfird changed in 0.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"findfirst". Could also use Compat.findfirst now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I've updated the PR.

@@ -26,10 +28,19 @@ end
Base.@propagate_inbounds IndirectArray(index::AbstractArray{<:Integer,N}, values::AbstractVector{T}) where {T,N} =
IndirectArray{T,N,typeof(index),typeof(values)}(index, values)

function (::Type{IndirectArray{T}})(A::AbstractArray, values::AbstractVector = unique(A)) where {T}
# Use map! to make sure that index is an Array
index = map!(t -> findfirst(values, t), Array{T}(size(A)...), A)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this. Much better to use indexin(A, values) here AFAICT. Not sure what should happen if some values of A do not appear in values: is throwing an error OK?

For maximal performance, it would even make sense to use values = nothing by default, and compute unique values using the same dict as the one created by indexin.

@nalimilan nalimilan merged commit b915ef0 into JuliaArrays:master Mar 11, 2018
@nalimilan
Copy link
Contributor

I've tagged a release: JuliaLang/METADATA.jl#13782

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants