-
Notifications
You must be signed in to change notification settings - Fork 3
Unify the efforts for Regression/GLM #14
Comments
I like the design/choices of abstraction! |
I very much like this idea. On Sat, Jul 19, 2014 at 11:33 PM, Dahua Lin notifications@github.com
|
Alright, now I see that it is a issue in the repository. I might suggest putting MixedModels within this framework too. Linear mixed models, generalized linear mixed models and nonlinear mixed models are all in the regression model family. |
Great Initiative! I agree with this abstraction. But your argumentation holds in general for all regularized empirical risk minimization approaches. Is it necessary to restrict the base package
I am totally fine with using cc: @gusl |
@BigCrunsh In my mind, the term If you don't mind, we can just use |
sounds good. |
@BigCrunsh: I have added you as one the owners of JuliaStats, so you have privilege to move packages here. |
Generally, I support the idea of more standardized APIs and unification of our many regression packages. Some more specific comments below. It would be great if our API supported fitting multiple dependent variables in some way, either explicitly or by offering a L1 solvers are often used to fit many models spanning the entire regularization path because 1) fitting the entire regularization path is often not much more computationally expensive than fitting a single model (esp. for LARS, which has to fit the preceding part of the regularization path anyway) and 2) the regularization parameters are typically selected by cross validation, so knowledge of the entire regularization path is useful. We should thus have a standardized API for holding the regularization path and performing cross validation. Perhaps we should support the same for ridge, although the standard Cholesky algorithm doesn't benefit as much from fitting the entire regularization path and generalized cross validation is often used in place of ordinary cross validation. As far as a high-level interface for fitting models goes, as of JuliaData/DataFrames.jl#571, you can fit any model that defines |
I have nothing to add except that this is my favorite issue in a long time. (Besides the "Can" issue.) |
I'm hoping to receive comments / edits on the wiki, and that this document On Sunday, July 20, 2014, Stefan Karpinski notifications@github.com wrote:
Gustavo Lacerda |
Sorry for the late reply, I am on holidays for the next couple of weeks thus the delay. This is a great initiative as I favor the proposed abstraction and unification for regression models. More generally, I favor the unification of model specification across packages as discussed in |
@gusl thanks for creating the wiki. I am not completely sure that can be a common interface that work for all statistical models. Generally, generative Bayesian network, discriminative models, Markov random fields, time series, stochastic processes -- most of these can be called statistical models. I can't imagine one interface that can fit them all. For example, a Bayesian network may involve multiple variables, not just x and y, that are related to each other in a complicated way; while a time series model need to be updated over time. I think it is more pragmatic to consider interface designs for individual family of models. Within this restricted context, many of your proposals do make a lot of sense. This issue, in particular, focuses on a common family of problems -- regression analysis. Generally, regression analysis aims to find relations between dependent variables (also known as responses) and independent variables (e.g. features/attributes). A typical classification problem can be considered as a special case of regression problem that try to find relations between the features and the class labels. From a mathematical standpoint, a regression problem can be formalized in two ways:
Generalized Linear Model is a special case of the regression analysis problem as outlined above, where the dependent variable A generalized linear model can be estimated in two ways: (1) cast to a regularized risk minimization problem; or (2) use algorithms dedicated to GLMs. Conceptually, all these things can be divided into three levels:
A major principle in modern software engineering is separation of concerns. This principle also applies here. I can imagine that different groups of developers (of course these groups may overlap in reality) may focus on different levels:
Particularly, people who implement solvers or machine learning algorithms should not be concerned about things like data frames etc. It is the responsibility of the higher level packages to convert data frames into a problem in standardized forms (that only involve numerical matrices and vectors). I hope this further clarify my thoughts. |
to @Scidom: the model level of this formulation (as outlined above) can be seen as a factor in a probabilistic graphical model, and thus can be incorporated in a larger probabilistic framework. |
My experience with developing Distributions and Graphs is that interface may change a lot as opposed to what is planned originally. It would be useful to start building up a package and make changes as necessary as we move forward. We can update the wiki as the API matures. As to how we may proceed, I think the next would be to start working on the regression codebase (starting from the solver level). @BigCrunsh: would you please move |
@lindahua is right about getting started. Look to JuliaOpt for inspiration that it can work, although it was a smaller set of developers. We have a solver level (i.e. a package for each solver wrapper), a generic interface level to all solvers that defines a canonical form (MathProgBase.jl) and then currently one modeling interface (JuMP.jl, although CVX.jl will join this soon). |
I looked at the codes in We probably need to enrich that system with more discussions. However, I think it is already a good starting point. |
This breakdown into the solver, model and semantics levels is very good. It might be a bit premature, but I find that making the names of things line up with the concepts can be very helpful to get everyone on the same page conceptually. (This is why I'm so picky about naming.) Perhaps there should be packages named |
@StefanKarpinski These names would be useful as abstract types. This whole thing involves close interaction between these types, hence it would make sense to put this type hierarchy in a foundational package, together with a clear document about how they work. Other packages can extend those or build on top of them. Originally, I proposed to have a package named |
On Mon, Jul 21, 2014 at 6:49 AM, Dahua Lin notifications@github.com wrote:
The issue with graphical models is that 'fit' can mean many different StatisticalModel e.g. a specific graphical model such as an A -- B -- C (I'm introducing an extra level, between Model and Solver) My idea is that 'fit' should still be used, with extra arguments that have e.g. given an instance of the A--B--C Ising Model with a free parameter for while a time series model need to be updated over time.
I think it is more pragmatic to consider interface designs for individual
g is said to be the link function if E[Y_i] = g(X_i beta). If g :: Real ->
|
@lindahua: I already moved |
@gusl you touched various matters in your last message. As far as passing a user-defined proposal to the Metropolis-Hastings sampler is concerned, I have thought it last month and have a clear idea on how to do it. In fact I have pretty much completed coding it and once finished, I will merge this generalisation to the |
P.S. in fact the structure of MCMC will undergo several already planned changes and refactoring though this goes beyond the scope of the present thread. |
I like the idea of solver, model, and semantics level. I agree with @lindahua and @IainNZ, let's get started and perhaps with revising the interfaces in Just one thing, which is probably to earlier, but sooner or later there is a large zoo of solvers and at some point it might be useful to have some benchmarking to derive default choices depending on the number of examples, dimensions, sparsity,... |
Thanks @BigCrunsh. Let's keep the high level discussions (those that affect the reorganization of packages) here. Detailed API design of regression problems should go to JuliaStats/RegERMs.jl#3, as @BigCrunsh suggested. |
A general ensemble package would be great to have under the @svs14 has done a lot of work on the Orchestra.jl package which provides heterogeneous ensemble methods and has it's own API. I don't know the details but it might be a good starting place if the API can be made consistent with |
I have OnlineLearning.jl which fits GLMs (linear regression, logistic, and quantile regression for now) (optionally with L1 and/or L2 regularization) with SGD. Standard SGD and some variants (ADAGRAD, ADADELTA, and averaged SGD) are implemented. I also started on linear SVMs but the implementation is not done. I'll keep an eye on JuliaStats/RegERMs.jl#3 and can update the API when that's more fleshed out. |
@lendle: Feel free to do that in that framework 😉 |
I'd be happy to get a clean version of the newly proposed L0 EM algorithm into the proper format once the regularized regression design/interfaces has been set. For a spike on the L0 EM algorithm see: https://github.com/robertfeldt/FeldtLib.jl/blob/master/spikes/L0EM_regularized_regression.jl cc: @lindahua |
What happend to this project? Are there any new developments? The idea is really great. |
Check this out https://github.com/Evizero/SupervisedLearning.jl |
Regression (e.g. linear regression, logistic regression, poisson regression, etc) is a very important in machine learning. Many problems can be formulated in the form of (regularized) regression.
Regression is closely related to generalized linear models. A major portion of regression problems can be considered as estimation of generalized linear models (GLM). In other words, estimating a GLM can be casted as a regression problem where the loss function can be considered as the negative log-likeliehood.
There have been a few Julia packages in this domain. Just to name a few, we already have:
and probably some others that I am missing.
Functionalities provided by these packages are substantially overlapped. Yet, they are not working with each other.
Unifying these efforts towards a great framework for regression/GLM would definitely make Julia a much more appealing option for machine learning. I am opening this thread to initiate the discussions.
Below is a proposal about how we may proceed:
Front-end and Back-end should be decoupled. To me, a regression framework consists of four basic aspects:
The front-end modules should provide functions to help users turn their data and domain-specific knowledges into optimization problems; while the back-end should focus on solving the given problems. These two parts require different skills (the former is mainly concerned with user API design; while the latter is mainly about efficient optimization algorithms).
I propose the following way to reorganize packages:
RegressionBase.jl
: provide types to represent regression problems and models. This package should also provide other facilities to express a regression problems, e.g. loss functions, regularizers, etc. This package can also provide some classical/basic algorithms to solve a regression problem. (This may more or less adopt whatRegERM
is doing).GLMNet.jl
(depend onRegressionBase.jl
): wrap the external glmnet library to provide efficient solvers for certain regression problems. The part that depend onDataFrames
should be separated out.SGD.jl
,LARS.jl
, etc should also depend onRegressionBase.jl
and provide different kind of solvers. Note thatGLMNet
,SGD
,LARS
should accept the same problem types, and have consistent interface. They just implement different algorithms.Regression.jl
: a meta package that includeRegressionBase.jl
and a curated set of solver packages (e.g.GLMNet
,SGD
, etc)GLMBase.jl
(depend onDistributions.jl
andRegression.jl
): provide types to represent generalized linear models, relevant machinery such as link functions, etc. This package can take advantage ofRegression.jl
for model estimation.GLM.jl
(depend onGLMBase.jl
andDataFrames.jl
): provide high level UI for end users to perform data analysis. (The user interface can remain the same as it is).Your suggestions and opinions are really appreciated.
The first question that we need to answer is whether we should introduce
RegressionBase.jl
(which would borrow part of the stuff fromRegERM
). If there's no objection, I can setup this package and then we can discuss interface designs from there.We can then proceed with the adjustment of other packages.
cc: @dmbates @johnmyleswhite @simonster @BigCrunsh @Scidom @StefanKarpinski
The text was updated successfully, but these errors were encountered: