-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Support fitting arbitrary StatisticalModels with DataFrames #571
Conversation
This seems really promising. Let me give it a proper read through tomorrow morning. As always, thanks for doing so much work on this! |
Another thing to keep in mind: adding |
args...; kwargs...) | ||
mf = ModelFrame(f, df) | ||
mm = ModelMatrix(mf) | ||
y = model_response(mf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises the question of how we should handle unsupervised methods that won't take a y
input. R does this often with formulas that have a .
on the left-hand side. Not sure we need that, but seems worth thinking about.
This all looks good to me. I'm still really shaky on the distinction between RegressionModel and StatisticalModel. We shouldn't hold this up because of that question, but we'll want to return here if our understanding of those ideas changes. |
Also, I agree that we should place |
Agree to put |
Any more thoughts on this, or shall I go ahead and merge the relevant PRs? |
Let's merge this, then we can fix the |
This PR (and a minor change to StatsBase) would allow any statistical model package that implements the StatisticalModel/RegressionModel interfaces from StatsBase and accepts a design matrix and response vector to work with DataFrames without depending on DataFrames and without any additional code. It turns calls to
StatsBase.fit(::Type{T<:StatisticalModel}, f::Formula, df::AbstractDataFrame, ...)
into calls toStatsBase.fit(::Type{T<:StatisticalModel}, X::Matrix, y::Vector)
, and wraps the returned model so that all of the generic StatisticalModel/RegressionModel methods from StatsBase work on it. Additionally, it alters the CoefTable returned bycoeftable(::UnderlyingModel)
to add the coefficient names. This could be extended to wrapconfint
to return DataFrames,predict
to accept DataFrames, etc.Demo, with a hacked up StatsBase and GLM that has no DataFrames dependency:
The main downside to this approach is that methods that were defined on
UnderlyingModel
but notStatisticalModel
cannot be called directly on the returned model. One presently needs to accessdfmodel.model
to get the underlying model and call the method there. One option I'm considering is to callmethodswith(UnderlyingModel)
and dynamically wrap any methods the first time a model is constructed. At least in 0.3, I think that should work, although type inference may not be optimal.Related to the decoupling of GLM and DataFrames, mentioned by @lindahua in JuliaStats/Roadmap.jl#11. cc @johnmyleswhite (yes, I borrowed your
@delegate
macro) and @dmbates