-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always keep missing in pool #65
Draft
bkamins
wants to merge
3
commits into
main
Choose a base branch
from
bkamins-patch-1
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of think we should just keep this
T
; and letUnion{Missing, T}
be provided as theT
. That way if you don't havemissing
values, you're not penalized. It seems like to sosimillar_missing
, we could provide a function that just re-wraps therefs
,pool
,invpool
, etc in a newPooledArray
struct w/Union{T, Missing}
, right?But maybe you're right that it would be too expensive to have to change
pool::Vector{T}
andinvpool::Dict{T, R}
to supportmissing
as well in thesimilar_missing
operation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to avoid having to pay the following cost:
(it would also reduce the memory burden of making this operation)
With this PR it would be avoided.
My original preference was different, i.e. to never allow
Missing
in therefpool
andinvrefpool
but have0
to be a ref that is interpreted asmissing
without making a lookup, but @nalimilan thought it would be overly complex. That is why I have made a draft here to discuss the design.This would have a consequence similar to what we have in CategoricalArrays.jl:
(so magically
#undef
becomesmissing
which is probably undesirable)There is no rush to make a decision here. Let us think and decide what is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah neither solution is simple. Maybe it would be possible to make
convert
faster for dicts when the target eltype is a supertype of the source eltype?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this has been an open issue in Julia Base for a long time. Who do you think is best to ask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea. I guess we (you :-p) would have to come up with at least a draft PR to get people's attention. But this may be quite easy to do.
convert
calls this constructor when the origin type is the same as the destination:https://github.com/JuliaLang/julia/blob/be6887fa9ffa3d08315c93f1551c41b44dd3d720/base/dict.jl#L92-L95
I wonder whether what's needed is just to call
convert
on the keys and value when the types don't match. What do you think?Assuming that works, do you think it would be an acceptable replacement for this PR or would it still not be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we loose COW benefits. However, since we are not fully convinced what is best I would park this PR for now, so that @quinnj can finish CSV.jl update without this functionality (which is more pressing). We can go back to this PR and decide what to do later (especially that if we change the internals probably 2.0 release would be needed).
And maybe in the mean time what @nalimilan proposes to do in Julia Base would be implemented (I cannot do it as I do not know how to do it unfortunately as I know too little about internal design related to JuliaLang/julia#26681)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid the best person to implement no-copy
convert
forArray{Union{T, Missing}}
is also @quinnj! :-DThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quinnj - if you do not have time for this (which I assume is the case) - can you please point me to the part of the sources that would have to be changed? (I assume it is somewhere in the C codes)
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the relevant code is in https://github.com/JuliaLang/julia/blob/master/src/array.c. In the jl_array_typetagdata function, it returns the "selector bytes" for an isbits-union array. This is where we lengthen the requested array size to account for selectory bytes. We also have to ensure the selector bytes are always initialized to zero, since otherwise you could get selector bytes pointing to something not just garbage, but totally wrong.
So the operation we'd need is
convert(::Type{Vector{Union{T, S}}}, x::Vector{T}) where {T, S}
, right? And we're ok if the two arrays potentially share data? So it seems we would add a check on the julia-side for ifT
isisbitstype
andS
isisbitstype
, and if so, call a new array.c routine.This is where it gets a little hairy though, because we have
array_resize_buffer
, but that takes an existing array to resize it. We'd need to use some of the same logic in that routine and make a newjl_convert_to_isbitsunion
method that:isbitstype
Union
eltype is alsoisbitstype
a->flags.isshared
(means we can't resize it; like an mmap buffer)jl_ptr_to_array_1d
with our realloc'd pointer? that will take care of making the new array and using the realloc'd data pointerThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I was afraid it would not be that simple. What we need is only
convert(::Type{Vector{Union{T, Missing}}}, x::Vector{T}) where {T}
butT
may be any type (also non-bits, e.g.String
). So I guess it would require separate paths forT
being bits type and being non-bits type.However, this in general shows the following problem. Assume that we have
x
andy
sharing the same data. Where eltype ofx
isT
and ofy
isUnion{Missing, T}
. What will happen tox
if we doy[1] = missing
? My intuition that this operation is super problematic - right? (i.e. would lead to an undefined state ofx[1]
)So the conclusion is that these arrays may not share data. Therefore we have an easier task of adding an internal constructor in PooledArrays.jl for
Dict
type that instead of:would copy an internal structure of source
Dict
and only do promotion of eltype ofkeys
field in the source dict.So we should have a function like:
and the performance would be:
the benefit would be ~3x speedup.