-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broadcasting methods over CategoricalString has changed #199
Comments
Agreed - looks like a nonsensical/problematic, and thus hopefully unintended change to me - a "good" categorical type should always remember/encode the potential category values, rather than the set of actual category values found within the data. That is, the set of levels should remain stable under sub-setting. Otherwise you're conducting data summarization where you should be doing data type specification. I see you've raised this already in CategoricalArrays.jl issue tracker, so fingers crossed. |
Mmm, that's really a tricky case. This used to work because In the end, the issue boils down to this: what do you expect the following code to return? v1 = categorical(["a", "b", "c"])
v2 = categorical(["a", "a", "a"])
v2[1] = v1[1]
levels(v2) Currently, levels are just Until we sort this out, can you explain your actual use case? I suspect there are simpler ways of achieving the same result, which don't involve all these broadcasting subtleties. |
Thanks for that.
I think its useful to allow vectors of categorical elements. In a general toolbox like MLJ, where usability is at least as important as efficiency, we often focus on elements rather than how the elements are wrapped, to avoid a multiplicity of case distinctions. In any case, There are several use cases we have and I have not isolated all the places where our code has broken. Here's one: The |
The problem is that in many situations we really need broadcasting to return a I wonder whether we could restrict the new behavior to cases where one broadcasts over a Anyway, for now couldn't you use a comprehension instead of broadcasting to work around the issue? |
You mean as in julia> [first(x) for x in V]
3-element CategoricalArray{String,1,UInt32}:
"a"
"a"
"a"
julia> ans[1].pool
CategoricalPool{String,UInt32}(["a"]) (continuing example from above) |
|
Yes, thank you! |
Cool. So do you think this affects many places in your code? Any chance you could show me a diff? I'm trying to assess what would be the most appropriate behavior to be both convenient most of the time but also flexible enough to cover all use cases. |
Hi. My thinking is the following. We have two options: Option 1
In this interpretation a value has levels. Manipulating levels of value manipulates levels of the space from which it comes. We can easily compare Under this interpretation you should be able to store Option 2 It is the collection (array) that is categorical. Then essentially we have a |
I think we clearly don't want to support just option 2, or it would be impossible to retrieve possible levels, nor to know that a variable is categorical, from a single value. And of course as you note comparison would be impossible. Also we couldn't support Option 1 is already mostly implemented AFAICT: we don't prevent creating To illustrate the problems with So in the end we should we generate a |
I agree that option 1 is more sensible 😄. I would say that we should create If I understand you correctly - this is the same what you say - right? If we agree on this then usages |
The basic problem with the status quo is that any kind of function that generates CategoricalValue (or String) is likely to have unexpected behaviour when broadcasted or mapped or used in comprehension. This is true no matter what the arguments of this function are. Take a zero-argument function as an example: julia> box = categorical(['a', 'b', 'c', 'd'])
julia> choose() = rand(box)
julia> levels([choose() for i in 1:2])
2-element Array{CategoricalValue{Char,UInt32},1}:
'b'
'd' What happened to levels Order, in addition to levels, is lost: julia> isordered([choose() for i in 1:2])
false @nalimilan In answer to your query, the impact for MLJ of the new behaviour is extensive; I have given up patching the problem and just added a restriction on CategoricalArrays. The problem is that the unexpected behaviour extends to what the user does - ie this is not just an internal implementation problem. See the use case example below. The following options should work for me, if you are going to insist on attempting to pack CategoricalValues returned by map, broadcast and comprehension into CategorialArrays:
Here == means same .index and same .order The problem with merging the pools (assuming they are compatible) is, as I understand it, that you either have to make copies of the elements (to bring their pools into line) or mutate them. In the first case a mutation of a pool somewhere else doesn't propagate to the copies (unexpected) ; in the second case the mutation will effect every CategoricalValue pointing to the same pool (unexpected). MLJ use-case example setup using Pkg
Pkg.activate("junk")
Pkg.add(PackageSpec(name="MLJBase", rev="broadcast"))
Pkg.add("CategoricalArrays")
Pkg.add("StatsBase")
julia> lev(cat_element) = cat_element.pool.levels
lev (generic function with 1 method) In an MLJ classification problem there is a categorical target to predict, and in training this target is given to us (along with the inputs features omitted here): julia> y = categorical(["yes", "no", "yes", "yes", "maybe"])
5-element CategoricalArray{String,1,UInt32}:
"yes"
"no"
"yes"
"yes"
"maybe" A probabilistic classifier is trained using this data and, for each new input pattern, makes a probabilistic prediction, which in MLJ is a julia> yes = y[1]
CategoricalString{UInt32} "yes"
julia> no = y[2]
CategoricalString{UInt32} "no"
julia> d = UnivariateFinite(Dict(yes=>0.7, no=>0.3))
"UnivariateFinite(no=>0.3, yes=>0.7)" The most likely outcome under this probabilistic prediction is: julia> ŷ = mode(d) # single prediction
CategoricalString{UInt32} "yes" and this retains all the target levels (essential): julia> lev(ŷ)
3-element Array{String,1}:
"maybe"
"no"
"yes" However, more commonly the user will request a vector of probabilistic predictions - one for each input pattern - something like: julia> [d, d, d, d, d]
But the levels have disappeared: julia> lev(ŷ[1])
1-element Array{String,1}:
"yes" |
Thanks for the details. I think the best behavior for the code you present is what we have agreed on with @bkamins on Slack: have This is relatively easy to implement. What will be harder is making this efficient. But this seems doable by keeping a global table of pools indicating subset relations (in which case levels don't need to be changed). Efficiency can be implemented later, as long as the behavior is correct.
Well, even if we don't merge the pools, storing values in a |
@nalimilan Many thanks for the update! Sound like you have a plan that will accommodate us. Looking forward to the implementation. I'm still nervous about the mutability of pools in CategoricalArrays.jl. For our use case, I would prefer immutable categorical elements with immutable pools (and Has there been any interest or discussion about immutable categorical types elsewhere? |
Can't finish my Julia course without MLJ supporting DataFrames.jl which is pending this! Happy to contribute some documentation afterwards :) I want to teach Julia but don't want my material to be outdate the day I finish teaching. Thanks people |
Can this be closed now that #211 has been merged? |
Fine with me. |
A recent change in CategoricalArrays has broken my code at MLJ. The change seems strange to me and I wonder if it is intentional (I hope not :-)):
Starting with:
Under 0.5.2:
... which makes sense to me. But now, under 0.5.4, we have:
... which is causing me problems.
The text was updated successfully, but these errors were encountered: