Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcasting methods over CategoricalString has changed #199

Closed
ablaom opened this issue Jun 10, 2019 · 17 comments
Closed

Broadcasting methods over CategoricalString has changed #199

ablaom opened this issue Jun 10, 2019 · 17 comments

Comments

@ablaom
Copy link

ablaom commented Jun 10, 2019

A recent change in CategoricalArrays has broken my code at MLJ. The change seems strange to me and I wonder if it is intentional (I hope not :-)):

Starting with:

using CategoricalArrays
lev(cat_element) = cat_element.pool.levels

julia> v = categorical(["a", "b", "c"])
3-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"

julia> lev(first(v))
3-element Array{String,1}:
 "a"
 "b"
 "c"

Under 0.5.2:

julia> V = [v, v, v]
3-element Array{CategoricalArray{String,1,UInt32,String,CategoricalString{UInt32},Union{}},1}:
 ["a", "b", "c"]
 ["a", "b", "c"]
 ["a", "b", "c"]

julia> lev(first.(V)[1])
3-element Array{String,1}:
 "a"
 "b"
 "c"

... which makes sense to me. But now, under 0.5.4, we have:

julia> V = [v, v, v]
3-element Array{CategoricalArray{String,1,UInt32,String,CategoricalString{UInt32},Union{}},1}:
 ["a", "b", "c"]
 ["a", "b", "c"]
 ["a", "b", "c"]

julia> lev(first.(V)[1])
1-element Array{String,1}:
 "a"

... which is causing me problems.

@fkiraly
Copy link

fkiraly commented Jun 10, 2019

Agreed - looks like a nonsensical/problematic, and thus hopefully unintended change to me - a "good" categorical type should always remember/encode the potential category values, rather than the set of actual category values found within the data.

That is, the set of levels should remain stable under sub-setting.

Otherwise you're conducting data summarization where you should be doing data type specification.
It's as if you would be changing the type of a float to FloatBetween{min,max} every time you add or remove a data point to the array.

I see you've raised this already in CategoricalArrays.jl issue tracker, so fingers crossed.

@nalimilan
Copy link
Member

Mmm, that's really a tricky case. This used to work because first.(V) returned a Vector{CategoricalString{UInt32}}, which is a type that's not supposed to ever be used (since it's very inefficient). We fixed that to return a CategoricalVector{String} instead. But that means CategoricalString objects are no longer stored as-is: they are "interpreted" when storing them in the CategoricalVector.

In the end, the issue boils down to this: what do you expect the following code to return?

v1 = categorical(["a", "b", "c"])
v2 = categorical(["a", "a", "a"])
v2[1] = v1[1]
levels(v2)

Currently, levels are just ["a"]: we only use the value of v1[1], not its potential levels. We could change assignment so that we always add all levels from the RHS to the LHS. That would certainly have some advantages, notably that copyto! could preserve levels and be equivalent to calling setindex! repeatedly (#172). But that could also have surprising or problematic effects: in particular, what should happen if the LHS is ordered and the RHS is unordered or with a conflicting order? The LHS would have to be marked as unordered, which can be annoying.

Until we sort this out, can you explain your actual use case? I suspect there are simpler ways of achieving the same result, which don't involve all these broadcasting subtleties.

@ablaom
Copy link
Author

ablaom commented Jun 13, 2019

Thanks for that.

because first.(V) returned a Vector{CategoricalString{UInt32}}, which is a type that's not supposed to ever be used

I think its useful to allow vectors of categorical elements. In a general toolbox like MLJ, where usability is at least as important as efficiency, we often focus on elements rather than how the elements are wrapped, to avoid a multiplicity of case distinctions. In any case, CategoricalString is a public object in your design, so there is nothing to stop users constructing vectors of them, intentionally or unintentionally, and if they do so, broadcasting will have perverse consequences, it seems to me.

There are several use cases we have and I have not isolated all the places where our code has broken. Here's one: The predict method of a probabilistic classifier returns a vector of (custom) Distribution objects, one for each input pattern. A model API interface may implement a predict_mode method for "point" predictions but a fallback just broadcasts mode over the vector. Now mode(d) for an individual distribution d returns a single CategoricalString or CategoricalValue pointing to the pool of the target used for training (an essential requirement for us). However, since the change in CategoricalArrays, broadcasting mode is returning a vector of categorical elements each having a mutated singleton pool.

@ablaom ablaom closed this as completed Jun 13, 2019
@ablaom ablaom reopened this Jun 13, 2019
@nalimilan
Copy link
Member

The problem is that in many situations we really need broadcasting to return a CategoricalArray, nor an Array{CategoricalString{UInt32}}. We made this change so that df[col] .= v with v::CategoricalString creates a CategoricalArray column filled with v.

I wonder whether we could restrict the new behavior to cases where one broadcasts over a CategoricalArray or a CategoricalValue, and not another kind of array as in your example.

Anyway, for now couldn't you use a comprehension instead of broadcasting to work around the issue?

@ablaom
Copy link
Author

ablaom commented Jun 13, 2019

You mean as in [first(x) for x in V]? That is giving me the same issue:

julia> [first(x) for x in V]
3-element CategoricalArray{String,1,UInt32}:
 "a"
 "a"
 "a"

julia> ans[1].pool
CategoricalPool{String,UInt32}(["a"])

(continuing example from above)

@nalimilan
Copy link
Member

Any[first(x) for x in V] should give you want you want.

@ablaom
Copy link
Author

ablaom commented Jun 13, 2019

Yes, thank you!

@nalimilan
Copy link
Member

Cool. So do you think this affects many places in your code? Any chance you could show me a diff? I'm trying to assess what would be the most appropriate behavior to be both convenient most of the time but also flexible enough to cover all use cases.

@bkamins
Copy link
Member

bkamins commented Jun 14, 2019

Hi. My thinking is the following. We have two options:

Option 1

CategoricalValue is a valid object type on its own. A value of this type comes from some space of categorical values. You can check if two CategoricalValues come from the same space by checking if they have the same pool.

In this interpretation a value has levels. Manipulating levels of value manipulates levels of the space from which it comes. We can easily compare CategoricalValues if their pool allows it.

Under this interpretation you should be able to store CategoricalValues in any container without losing the reference of the space they come from. And you should be able to create CategoricalValue on its own (without a wrapping collection). This also means that creating a new CategoricalArray from some old CategoricalValues should be explicit (as it is essentially a conversion of the space from which CategoricalValue comes from).

Option 2

It is the collection (array) that is categorical. Then essentially we have a PooledArray on steroids (providing some more functionality of managing levels). Getting a value from a CategoricalArray should then produce a normal unwrapped value. The only challenge in this case is comparing categorical values, which would require some non-standard function allowing the user to pass two values along with the categorical space with respect to which they should be compared.

@nalimilan
Copy link
Member

I think we clearly don't want to support just option 2, or it would be impossible to retrieve possible levels, nor to know that a variable is categorical, from a single value. And of course as you note comparison would be impossible. Also we couldn't support CategoricalValue{Int} as it would be treated as a continuous variable in models.

Option 1 is already mostly implemented AFAICT: we don't prevent creating Array{CategoricalValue{T}} objects at all. The question is just, when should we generate a CategoricalArray rather than a Array{CategoricalValue{T}} (knowing that there are ways around that default behavior).

To illustrate the problems with Array{CategoricalValue{T}}, let us imagine replacing CategoricalArray{T} with Array{CategoricalValue{T}} (disregarding performance issues): each value could have a different pool, and when calling levels on the array we would extract the pools for all values and merge them. That way, when all values have the same pool, everything would behave like CategoricalArray currently; but values with different pools could also be stored without losing their specific levels. Though this model could also be confusing: if you extracted a value and accessed its levels, they might not reflect all possible values in the array. Fundamentally, this limitation is due to the fact that CategoricalArray represents a variable, carrying information which is necessarily common to all values in the array. We need the CategoricalArray type to ensure the desirable property that a single CategoricalValue contains information about all other possible values.

So in the end we should we generate a CategoricalArray when we expect that users are generating a single variable (levels should be combined). Otherwise, if we think that values are coming from different variables, we should return an Array{CategoricalValue{T}}. Can we assume that the former corresponds to map/broadcast on CategoricalArray inputs, and the latter to other cases? That's what I'm wondering.

@bkamins
Copy link
Member

bkamins commented Jun 18, 2019

I agree that option 1 is more sensible 😄.

I would say that we should create CategoricalArray if we transform CategoricalArray as a whole (this is what you call variable I guess). Natural candidates are broadcasting and map as you say. This way people get what they want most of the time.

If I understand you correctly - this is the same what you say - right?

If we agree on this then usages Tables.allocatecolumn probably should be thought over again.

@ablaom
Copy link
Author

ablaom commented Aug 15, 2019

The basic problem with the status quo is that any kind of function that generates CategoricalValue (or String) is likely to have unexpected behaviour when broadcasted or mapped or used in comprehension. This is true no matter what the arguments of this function are. Take a zero-argument function as an example:

julia> box = categorical(['a', 'b', 'c', 'd'])
julia> choose() = rand(box)
julia> levels([choose() for i in 1:2])
2-element Array{CategoricalValue{Char,UInt32},1}:
 'b'
 'd'

What happened to levels a and c?

Order, in addition to levels, is lost:

julia> isordered([choose() for i in 1:2])
false

@nalimilan In answer to your query, the impact for MLJ of the new behaviour is extensive; I have given up patching the problem and just added a restriction on CategoricalArrays. The problem is that the unexpected behaviour extends to what the user does - ie this is not just an internal implementation problem. See the use case example below.

The following options should work for me, if you are going to insist on attempting to pack CategoricalValues returned by map, broadcast and comprehension into CategorialArrays:

  • Test that all CategoricalValues have == pools; otherwise throw an error (my preference)
  • Test that all CategoricalValues have == pools; otherwise return in ordinary Array

Here == means same .index and same .order

The problem with merging the pools (assuming they are compatible) is, as I understand it, that you either have to make copies of the elements (to bring their pools into line) or mutate them. In the first case a mutation of a pool somewhere else doesn't propagate to the copies (unexpected) ; in the second case the mutation will effect every CategoricalValue pointing to the same pool (unexpected).

MLJ use-case example

setup

using Pkg
Pkg.activate("junk")
Pkg.add(PackageSpec(name="MLJBase", rev="broadcast"))
Pkg.add("CategoricalArrays")
Pkg.add("StatsBase")

julia> lev(cat_element) = cat_element.pool.levels
lev (generic function with 1 method)

In an MLJ classification problem there is a categorical target to predict, and in training this target is given to us (along with the inputs features omitted here):

julia> y = categorical(["yes", "no", "yes", "yes", "maybe"])
5-element CategoricalArray{String,1,UInt32}:
 "yes"  
 "no"   
 "yes"  
 "yes"  
 "maybe"

A probabilistic classifier is trained using this data and, for each new input pattern, makes a probabilistic prediction, which in MLJ is a Distribution object. Let's synthesise an example:

julia> yes = y[1]
CategoricalString{UInt32} "yes"

julia> no = y[2]
CategoricalString{UInt32} "no"

julia> d = UnivariateFinite(Dict(yes=>0.7, no=>0.3))
"UnivariateFinite(no=>0.3, yes=>0.7)"

The most likely outcome under this probabilistic prediction is:

julia> ŷ = mode(d) # single prediction
CategoricalString{UInt32} "yes"

and this retains all the target levels (essential):

julia> lev(ŷ)
3-element Array{String,1}:
 "maybe"
 "no"   
 "yes"  

However, more commonly the user will request a vector of probabilistic predictions - one for each input pattern - something like:

julia> [d, d, d, d, d]
5-element Array{UnivariateFinite{String,UInt32,Float64},1}:
"UnivariateFinite(yes=>0.7, no=>0.3)"
"UnivariateFinite(yes=>0.7, no=>0.3)"
"UnivariateFinite(yes=>0.7, no=>0.3)"
"UnivariateFinite(yes=>0.7, no=>0.3)"
"UnivariateFinite(yes=>0.7, no=>0.3)"


But now, to get the point-predictions, the user will do:

julia> ŷ = mode.([d, d, d, d, d])
5-element CategoricalArray{String,1,UInt32}:
 "yes"
 "yes"
 "yes"
 "yes"
 "yes"

But the levels have disappeared:

julia> lev(ŷ[1])
1-element Array{String,1}:
 "yes"

@nalimilan
Copy link
Member

Thanks for the details. I think the best behavior for the code you present is what we have agreed on with @bkamins on Slack: have map return a CategoricalArray, but ensure that setindex!(x::CategoricalArray, v::CategoricalValue, ...) merges levels from v into x. In the most frequent case where all values v have the same levels, this will simply mean that levels from v will be added to x.

This is relatively easy to implement. What will be harder is making this efficient. But this seems doable by keeping a global table of pools indicating subset relations (in which case levels don't need to be changed). Efficiency can be implemented later, as long as the behavior is correct.

The problem with merging the pools (assuming they are compatible) is, as I understand it, that you either have to make copies of the elements (to bring their pools into line) or mutate them. In the first case a mutation of a pool somewhere else doesn't propagate to the copies (unexpected) ; in the second case the mutation will effect every CategoricalValue pointing to the same pool (unexpected).

Well, even if we don't merge the pools, storing values in a CategoricalArray breaks the relationship with the original pool. We really need to do that, or it wouldn't be possible to do e.g. y = copy(x); x[1] = y[1]. I don't think that's terrible: it even makes things simpler, by ensuring that pools cannot be accidentally shared between arrays.

@ablaom
Copy link
Author

ablaom commented Aug 22, 2019

@nalimilan Many thanks for the update!

Sound like you have a plan that will accommodate us. Looking forward to the implementation.

I'm still nervous about the mutability of pools in CategoricalArrays.jl. For our use case, I would prefer immutable categorical elements with immutable pools (and ordered being a type parameter). If users start messing with the levels or their order in the ML context, then I think the results are going to be difficult to manage.

Has there been any interest or discussion about immutable categorical types elsewhere?

@xiaodaigh
Copy link

Can't finish my Julia course without MLJ supporting DataFrames.jl which is pending this! Happy to contribute some documentation afterwards :)

I want to teach Julia but don't want my material to be outdate the day I finish teaching. Thanks people

@nalimilan
Copy link
Member

Can this be closed now that #211 has been merged?

@ablaom
Copy link
Author

ablaom commented Sep 26, 2019

Fine with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants