Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Support fitting arbitrary StatisticalModels with DataFrames #571

Merged
merged 3 commits into from
Mar 31, 2014

Conversation

simonster
Copy link
Contributor

This PR (and a minor change to StatsBase) would allow any statistical model package that implements the StatisticalModel/RegressionModel interfaces from StatsBase and accepts a design matrix and response vector to work with DataFrames without depending on DataFrames and without any additional code. It turns calls to StatsBase.fit(::Type{T<:StatisticalModel}, f::Formula, df::AbstractDataFrame, ...) into calls to StatsBase.fit(::Type{T<:StatisticalModel}, X::Matrix, y::Vector), and wraps the returned model so that all of the generic StatisticalModel/RegressionModel methods from StatsBase work on it. Additionally, it alters the CoefTable returned by coeftable(::UnderlyingModel) to add the coefficient names. This could be extended to wrap confint to return DataFrames, predict to accept DataFrames, etc.

Demo, with a hacked up StatsBase and GLM that has no DataFrames dependency:

julia> using GLM, DataFrames

julia> dobson = DataFrame(Counts = [18.,17,15,20,10,20,25,13,12],
                          Outcome = gl(3,1,9),
                          Treatment = gl(3,3))
9x3 DataFrame
|-------|--------|---------|-----------|
| Row # | Counts | Outcome | Treatment |
| 1     | 18.0   | 1       | 1         |
| 2     | 17.0   | 2       | 1         |
| 3     | 15.0   | 3       | 1         |
| 4     | 20.0   | 1       | 2         |
| 5     | 10.0   | 2       | 2         |
| 6     | 20.0   | 3       | 2         |
| 7     | 25.0   | 1       | 3         |
| 8     | 13.0   | 2       | 3         |
| 9     | 12.0   | 3       | 3         |

julia> gm1 = fit(GlmMod, Counts ~ Outcome + Treatment, dobson, Poisson())
DataFrameRegressionModel{GlmMod,Float64}:

Coefficients:                   Estimate Std.Error     z value Pr(>|z|)
(Intercept)         3.04452  0.170899     17.8148  < eps()
Outcome - 2       -0.454255  0.202171    -2.24689   0.0246
Outcome - 3       -0.292987  0.192742     -1.5201   0.1285
Treatment - 2   2.62621e-16       0.2  1.3131e-15      1.0
Treatment - 3  -5.44239e-18       0.2 -2.7212e-17      1.0

julia> coef(gm1)
5-element Array{Float64,1}:
  3.04452    
 -0.454255   
 -0.292987   
  2.62621e-16
 -5.44239e-18

julia> stderr(gm1)
5-element Array{Float64,1}:
 0.170899
 0.202171
 0.192742
 0.2     
 0.2

The main downside to this approach is that methods that were defined on UnderlyingModel but not StatisticalModel cannot be called directly on the returned model. One presently needs to access dfmodel.model to get the underlying model and call the method there. One option I'm considering is to call methodswith(UnderlyingModel) and dynamically wrap any methods the first time a model is constructed. At least in 0.3, I think that should work, although type inference may not be optimal.

Related to the decoupling of GLM and DataFrames, mentioned by @lindahua in JuliaStats/Roadmap.jl#11. cc @johnmyleswhite (yes, I borrowed your @delegate macro) and @dmbates

@johnmyleswhite
Copy link
Contributor

This seems really promising. Let me give it a proper read through tomorrow morning.

As always, thanks for doing so much work on this!

@simonster
Copy link
Contributor Author

Another thing to keep in mind: adding fit to StatsBase would conflict with fit in Distributions, but since Distributions already depends on StatsBase, I think we could just import StatsBase.fit.

args...; kwargs...)
mf = ModelFrame(f, df)
mm = ModelMatrix(mf)
y = model_response(mf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises the question of how we should handle unsupervised methods that won't take a y input. R does this often with formulas that have a . on the left-hand side. Not sure we need that, but seems worth thinking about.

@johnmyleswhite
Copy link
Contributor

This all looks good to me. I'm still really shaky on the distinction between RegressionModel and StatisticalModel. We shouldn't hold this up because of that question, but we'll want to return here if our understanding of those ideas changes.

@johnmyleswhite
Copy link
Contributor

Also, I agree that we should place fit in StatsBase and then import it into Distributions.

@lindahua
Copy link
Contributor

Agree to put fit and other common names to StatsBase.

@simonster
Copy link
Contributor Author

Any more thoughts on this, or shall I go ahead and merge the relevant PRs?

@johnmyleswhite
Copy link
Contributor

Let's merge this, then we can fix the fit exporting issue. I think it's easiest to export everything from StatsBase so that we lay claim to those names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants