Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: unstack receives kwarg fillvalue #2828

Merged
merged 11 commits into from
Sep 8, 2021

Conversation

pstorozenko
Copy link
Contributor

Closes #2698

I thought about checking every column separately for promote_type, but it would be a breaking change since right now value column with eltype T is always converted to Vector{Union{Missing, T}}. We may think about changing it in another PR.

@bkamins
Copy link
Member

bkamins commented Aug 2, 2021

but it would be a breaking change

Why do you think so? I would assume that if fillvalue is missing which is the default and old behavior in your PR we would get the same (so this means that no old code would be broken - the result would stay the same).

NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Aug 2, 2021

I thought about checking every column separately

Ah - after reviewing I see what you mean. What you have implemented is OK I think. One can run identity.(df) to do type narrowing later if needed.

However, it is important to test the additional scenarios I have mentioned I think. Especially for CategoricalVector check how levels behave in different combinations of inputs and in cases when original CategoricalVector is ordered/unordered.

Other than that - the PR looks good.

@bkamins bkamins added feature non-breaking The proposed change is not breaking labels Aug 2, 2021
@bkamins bkamins added this to the 1.3 milestone Aug 2, 2021
@bkamins
Copy link
Member

bkamins commented Aug 25, 2021

@pstorozenko - merge conflicts require fixing + have you had time to add the tests I have suggested?

@bkamins
Copy link
Member

bkamins commented Sep 1, 2021

@pstorozenko bump

@pstorozenko
Copy link
Contributor Author

Sorry for not answering.
There are a few things I found when trying to create tests for Categorical Vector:

  1. When dealing with vectors that are subtypes of CategoricalArrays.AbstractCategoricalArray we should create unstacked_val with fillvalue value and levels from valuecol + fillvalue.
  2. When fillvalue is missing then it is not added to levels (that's how CategoricalArrays work AFAIK).

@bkamins
Copy link
Member

bkamins commented Sep 4, 2021

Point 2 is correct.

Point 1 - there is a corner case when fillvalue is not allowed by CategoricalArray. As discussed - I will update the PR to check for corner cases. Thank you!

Comment on lines 428 to 431
unstacked_val = [fill!(similar(valuecol,
promote_type(eltype(valuecol), typeof(fillvalue)),
Nrow),
fillvalue) for _ in 1:Ncol]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - this definition is problematic for categorical columns the problem is that:

julia> using CategoricalArrays

julia> valuecol = categorical(["a"])
1-element CategoricalArray{String,1,UInt32}:
 "a"

julia> fillvalue = ""
""

julia> Nrow = 1
1

julia> similar(valuecol, promote_type(eltype(valuecol), typeof(fillvalue)), Nrow)
1-element Vector{String}:
 #undef

julia> promote_type(eltype(valuecol), typeof(fillvalue))
String

Is there a generic idiom that would (without introducing dependency on CategoricalArrays.jl) allow us to take the union of types and if it is OK (like it should be in the example above) produce CategoricalArray? Or maybe you think we should change the definition of similar in CategoricalArrays.jl?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I have thought about it. Given we currently allow:

julia> x = categorical([1,2,3])
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> CategoricalValue(3, x)
CategoricalValue{Int64, UInt32} 3

the behavior we have now should be OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK. If we wanted to change this we could adapt promote_type in CategoricalArrays, but it's orthogonal to this PR. There would certainly be advantages in promoting to CategoricalValue, though it could also create problems (e.g. when concatenating Vector{Int} and CategoricalArray{Union{String, Missing}}the resulting type would beAny` but CategoricalArrays doesn't handle that well).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I think there would be more problems than benefits. We only need an easy way to create a "stand alone" CategoricalValue, as now it is CategoricalValue(scalar, categorical([scalar])) which is quite inconvenient.

@bkamins bkamins requested a review from nalimilan September 5, 2021 09:02
test/reshape.jl Outdated
@test dfu.Var2 ≅ [2, 0]
@test typeof(dfu.Var2) <: CategoricalVector{Int}
@test levels(dfu.Var1) == levels(dfu.Var2) == 0:3
dfu = unstack(df, :variable, :value, fillvalue=CategoricalValue("0", categorical(["0"])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - this test in particular is something we should think of. My judgement is that it is OK to produce such an union.

src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved
Comment on lines 428 to 431
unstacked_val = [fill!(similar(valuecol,
promote_type(eltype(valuecol), typeof(fillvalue)),
Nrow),
fillvalue) for _ in 1:Ncol]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK. If we wanted to change this we could adapt promote_type in CategoricalArrays, but it's orthogonal to this PR. There would certainly be advantages in promoting to CategoricalValue, though it could also create problems (e.g. when concatenating Vector{Int} and CategoricalArray{Union{String, Missing}}the resulting type would beAny` but CategoricalArrays doesn't handle that well).

NEWS.md Outdated
@@ -22,6 +22,12 @@
(notably `PooledArray` and `CategoricalArray`) or when they contained only
integers in a small range.
([#2812](https://github.com/JuliaData/DataFrames.jl/pull/2812))
* the `unstack` function receives new keyword argument `fillvalue`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should call this just fill? Or is that name useful in other contexts as a Boolean argument?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - this name is specific only to this function. I will rename it to fill.

bkamins and others added 2 commits September 5, 2021 23:00
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member

bkamins commented Sep 5, 2021

OK - I have renamed fillvalue to fill. The only potential problem would be if we used the fill function from Julia Base in the body of the function, but we do not use it. We only use fill! so all is OK.

@bkamins
Copy link
Member

bkamins commented Sep 7, 2021

@nalimilan - any additional comments here?

src/abstractdataframe/reshape.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit d8add19 into JuliaData:main Sep 8, 2021
@bkamins
Copy link
Member

bkamins commented Sep 8, 2021

Thank you!

@pstorozenko pstorozenko deleted the pps/unstack_fillvalue branch September 8, 2021 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

control fill value for missing cells in unstack
3 participants