-
Notifications
You must be signed in to change notification settings - Fork 3
Machine Learning Roadmap #11
Comments
I agree with all of this. I've got a lot of prototype SGD code already. I like the idea of meta-packages. If we're going to have Classification.jl, maybe Regression.jl should be a similar meta-package? |
I'm not an expert in this area, but I've been interested for awhile and am willing to help. |
@johnmyleswhite: Will you please move Clustering, SVM, and DimensionalityReduction over to JuliaStats? These are very basic for machine learning. I recently get some time to work on those. For regression, when there are several quite different techniques implemented, it will make sense to make a meta package. |
I transferred Clustering and SVM over. I'm going to announce that I'm moving DimensionalityReduction over, then we can go ahead and make the move tomorrow. |
Also, I think it is important to separate packages that provide core algorithms and those integrated with DataFrames. We may consider to provide tools such that they can be worked nicely with machine learning algorithms. However, I think core machine learning packages should not depend on DataFrames -- which are not used as frequently in machine learning. |
I agree completely. I would very strongly prefer that we implement integration with DataFrames in the following way throughout all packages:
This makes it easy to work with pure numerical data without any dependencies on DataFrames, while making it easy for people working with DataFrames to take advantage of the core ML algorithms by efficiently translating DataFrames into matrices. |
The only hiccup with what I just described is deciding whether the interfaces that mix DataFrames + ML should live. Arguably there should be one big package that does all of this by wrapping the other ML packages with a DataFrames interface. |
@johnmyleswhite are there issues of providing these in DataFrames.jl ? |
Providing what? |
Sorry, I seemed to misread part of your comments. I agree with your suggestions. |
Just that I am not sure whether we really another meta-package to couple DataFrames and ML, if the tools provided in DataFrames are convenient enough. |
You're right: we could encourage users to explicitly call the DataFrame -> Matrix conversion routines. That would simplify things considerably. |
The two main difficulties with this approach:
|
For GLM, my consideration is to have two packages:
|
So this is basically your idea of having a higher-level package that relies on core ML packages + DataFrames to provide useful tools for analyzing data frames. |
On phone right now, but weren't there some CART/Random Forest packages if not in METADATA then just mentioned in mailing list? |
Decision trees, by their nature, can work on heterogeneous data (each observation may be composed of variables of different kinds). For such methods, implementation based on DataFrames makes sense. There do exist a large number of machine learning methods (e.g. PCA, SVM, LASSO, K-means, etc) that are designed to work with real vectors/matrices. Heterogeneous data need to be converted to numerical arrays before such methods can apply. Packages that provide such methodologies are encouraged to be independent of DataFrames. |
You're right: there's a DecisionTree package. To me, working with factors is actually a really strong argument for pushing a representation of categorical data into an earlier layer of our infrastructure like StatsBase. But we're actively debating ways to do this in JuliaStats/DataArrays.jl/issues/73. If we could avoid some of the issues @simonster raised in his issue, I think it would be a big help to move the representation of categorical data closer to Julia's Base. Also worth keeping in mind that nominal data is often worked with using dummy variables, which do fit in the If DecisionTree.jl needs DataFrames.jl, I fully agree with Dahua: that's not a problem. But if it only needs a simpler abstraction, pushing things towards that simpler abstraction seems desirable. |
There are some cases where As far as the model fitting interface for DataFrames, it would be cool if we could get this to work on top of fit(::Type{MyModelType}, X::AbstractMatrix, y::AbstractVector, args...) and DataFrames could implement: function fit{T<:StatisticalModel}(::Type{T}, f::Formula, df::DataFrame, args...)
mf = ModelFrame(f, df)
DFStatisticalModel(mf, fit(T, ModelMatrix(mf).m, model_response(mf), args...)
end or similar. |
This sounds a lot like the discussion we had in JuliaLinearAlgebra/IterativeSolvers.jl#2 a little while ago. |
@simonster GLM can use a sparse model model matrix, but I think you'll have to define your own subtype of |
It would be great if as part of the roadmap, we can also plan to put together some large datasets in place, so that the community can work on optimizing performance and designing APIs accordingly. Having RDatasets is so useful, and something that makes large public datasets easily available for people to work with will greatly help this effort. |
@ViralBShah Good point. Datasets are important. I think we already have a MNIST package, we can definitely have more. Just that we need to be cautious about the licenses that come with the datasets. |
There are surprisingly few large data sets that are publicly available. I'd guess that the easiest way to generate "large" data is to do n-grams on something like the 20 Newsgroup data set. Classifying one of the newsgroup against all the others is a simple enough binary classification problem that we can scale out to arbitrarily high size (in terms of features) by working with 2-grams, 3-grams, etc. Other useful examples might be processing the old Audioscrobbler data (http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html) or something similar. |
We also have CommonCrawl.jl. The point about the datasets is not as much to distribute them as julia packages, but to have easy APIs to access them, load them, and work with them. Often, I find that the pain of figuring out all the plumbing is enough to discourage people, and making the plumbing easy could get a lot more people to contribute. |
Perhaps not too big, but there's also the Netflix and MovieLens datasets - which could be made easier to access. |
The Netflix data set is illegal to distribute. |
This adds a method for fitting a GLM by explicitly specifying the design matrix and response vectors. The resulting GlmMod object has empty ModelFrame and formula fields, and I've changed the few functions that reference these fields to first check if they are defined. Eventually it is probably a good idea to follow @lindahua's suggestion from JuliaStats/Roadmap.jl#11 and split out functionality that depends on DataFrames into a separate package, but most of these changes will be necessary for that as well. I have also added a method for fitting a GLM on a new response vector using the same design matrix. Closes JuliaStats#54
This adds a method for fitting a GLM by explicitly specifying the design matrix and response vectors. The resulting GlmMod object has empty ModelFrame and formula fields, and I've changed the few functions that reference these fields to first check if they are defined. Eventually it is probably a good idea to follow @lindahua's suggestion from JuliaStats/Roadmap.jl#11 and split out functionality that depends on DataFrames into a separate package, but most of these changes will be necessary for that as well. I have also added a method for fitting a GLM on a new response vector using the same design matrix. Closes JuliaStats#54
It sounds like there probably would be enough interest for a dedicated JuliaDeepLearning organization. It would have some requirements for interoperating with classical subcomponents that exist in JuliaStats. If there were a Julia equivalent of scikit-learn it would probably go in JuliaStats, but a julia equivalent of theano or cgt could go in a new JuliaDeepLearning org. At a bare minimum, start by moving Mocha there, and figure out the best next steps from there? |
I am assuming you - @pluskid - should have a pretty good picture of the state of the Julia deep learning community. So my guess is you probably have the most educated idea of what needs to be done to move it forward. We all know that deep learning is pretty much the most active subfield of ML right now, so I think it would be a good investment to make the Julia part more official. Question is if there is enough of a community to maintain the packages if there is no explicit owner any more. MLBase is a good example for a package that I don't touch (even though it would make sense to add some code to it), simply because it takes a week to get a version tagging request replied to. Basically I don't think organizations are automatically a good idea; especially not if the author is actively maintaining his/her packages. As a sidenote, I agree with @tbreloff and think a general JuliaLearn/JuliaML org would make more sense than moving the deep learning packages into JuliaStats; especially given the MLBase situation. I don't think the JuliaStats community has the resources at the moment to maintain ML packages to be frank. I don't want to step on anyone’s toes here. All the JuliaStats members that I had contact with were very nice and very helpful. I just think that they are busy with other things (such as Nullable Arrays) these days and don't have enough time to spend it on Machine Learning. |
Hi @pluskid, thanks for starting discussion. I'm currently working on a DL library in Julia which closely follows the design of Torch7, but makes use of Julia's features. It's not on github yet and progress is unfortunately slow because it's a side-project, my research is (still) in Theano. A friend of mine does a similar thing, so there definitely is interest. I also believe there's interest in the DL community at large, because Theano is suboptimal for RNNs and people generally don't like lua. I agree that the current state of GPU array operations makes this task more painful than it ought to be, and a lot of work on this could probably be shared across DL packages. PS: CGT looks promising, but it is not a successor of Theano. |
I have OnlineAI.jl, which extends OnlineStats.jl into neural nets and reservoir computing. I don't think it's appropriate for inclusion in a new organization, but there are pieces which overlap with other packages, and I think it would be great to have a unifying initiative for an MLBase that can support many different approaches to learning from data. My experience with many learning frameworks is that they tend to focus heavily on image classification and other similar (static) problems. I would really like to see something like the OnlineStats interface, which allows for both static (image classification, deep learning, etc) and dynamic (video analysis, time series, reinforcement learning, etc) modeling, allowing for analyzing both large distributed datasets and streaming data. Some of this exists already, and I hope we can create a best of breed base package to supply overlapping functionality.
|
Just wanted to make sure you were aware of another SVM package in Julia called SALSA.jl: https://github.com/jumutc/SALSA.jl @jumutc |
I agree. Actually I think MLBase is more of an "MLTools" in the sense that it provides design agnostic functionality. We should maybe think of collaborating on a common MLBase or MLAbstractions that does impose some design decisions such as function names. I know that I will sooner or later reach a point where I need to factor out a common base package for my stuff. I don't know much about OnlineStats.jl but I was thinking a little more high-level and really lightweight that evolves as we go along. Not everything falls under online learning, and probably not everything can be boxed in into the same kind of framework. Avoiding name collisions and settling on function names would be a good first step. |
@pluskid If you create organization, I'll be glad to join. Recently I added cuRAND.jl to JuliaGPU to support stochastic algorithms and currently am in a process of designing common library for unified CPU/GPU array programming - something similar to Theano/Torch7 (we should probably start a separate discussion about it). So if you are looking for people ready to contribute, include me to the list. |
I'd be interested. I am currently working on a Theano alternative model best, On Sat, Oct 10, 2015 at 3:05 PM Andrei Zhabinski notifications@github.com
|
I think there is a lot of value in a consistent API, and I'm ready to put in some effort to make this roadmap a reality. For the last few weeks I've been working on a very similar process with Plots.jl... putting a complex-but-lightweight interface into the plotting world. I think the approach should be very similar for the ML community. I propose that we create an organization JuliaLearn, and that we create a repo LearnBase.jl which will be home to both the design discussions and an implementation of what I describe (or something similar):
This methodology has been (in my opinion) incredibly powerful for Plots.jl. I have a simple, flexible API which can still access functionality from very different underlying packages, and requiring no cooperation from existing package authors. It requires a little extra work up front to support a new backend package, but that is a much smaller effort than if that package would need to be re-written with a new interface, or to start a new package from scratch. A user can make an API call which initially calls a python-wrapped library, but is then later replaced by a better julia implementation, with no change to their code. There are two really important advantages to the approach that I described:
I am willing to take the lead on this effort, if you'll let me. With a few 👍 I will form the org and get this started. |
The first three points are already implemented. StatsBase defines StatisticalModel and RegressionModel types along with methods for them. DataFrames defines a fit method that takes a subtype of StatisticalModel, a formula, and a DataFrame, converts the DataFrame into a design matrix according to the formula, and calls the I'm not sure there's a need for a separate organizations for statistics and machine learning. It may make sense to have a separate organization for deep learning, since it's substantially different from both stats and classical ML. But much of what is currently in JuliaStats could qualify as either statistics or ML. At least at this time I think it's better to do this work in a single organization. |
To be clear, I see immense value in using as much of the current stats framework as is reasonable. I think StatsBase would be one of the few required dependencies, and things like As to whether LearnBase.jl should live in JuliaStats or JuliaLearn (or some other name the community agrees on), I can see both sides. Pros for JuliaStats:
Pros for JuliaLearn:
I could be convinced either way... |
@Evizero Thank you for the comments! I'm starting to agree with you about the concerns of hosting projects under organizations. But I'm very glad to see that there are quite a few people with interest or already started working on theano / torch like system in Julia. We might consider creating an umbrella organization to host wiki pages pointing to those related projects and maybe host general discussions about deep learning libraries in Julia. May I ask, @lucasb-eyer, @dfdx, @denizyuret, when your new projects started to getting in shapes, could you come back and comment here? I think at that stage, we could consider creating such a repo. Having a wiki page summarizing different possible choices of libraries in Julia for deep learning will be at least very helpful for new users. |
@simonster I don't think that is true. I have been playing with the idea of using @tbreloff I like the way you think, but I would really like to keep it much simpler and more realistic for now. I wouldn't got the Plots.jl route with the backends. For now we should just dictate the interface and class hierarchy, otherwise it is going to get ugly at one point or another. There should just be enough stuff in there that would be reasonable to expect new ML packages to follow. I think the two main goals should be
I would also like to move my class-encoding code there (that builds on MLBase labelmap). Since it influences both our current efforts I'd suggest we just establish the package and get as many ML people in the loop as we can so that people can provide feedback. Since this package is a group effort it would make sense to me if it lived in an org. We can always move the package to JuliaStats later if it makes more sense, but for now let's just make some progress while we're motivated EDIT: And to address the potential question of why not put this into MLBase: It doesn't even define the function name |
If you don't define I'm sympathetic to the concern than MLBase is not being sufficiently actively maintained, but it also looks like that PR failed its own tests. |
I get that, but that doesn't sound like a good solution
Yes, but I think this does need to be a group outcome. Since it is a problem that some people (which includes me) are currently actively concerned with I think it is a good time to brainstorm about this
It's the not-even-replied-to part that bugs me, in the sense that anything non-trivial gets no reaction. I don't blame anyone who losses interest in contributing to Julia (or just a specific package) if no one even takes time to acknowledge the attempted contribution. I am not pointing fingers here. It's no one's fault. In fact, it's pretty cool that MLBase even exists to begin with. I think the StatsBase community is doing a tremendous job. But I do think it is a problem that needs to be addressed I just think that given that a few people are currently very interested in actively working and improving Julia's ML aspects, that we should talk and address such problems that are crippling (for the lack of a better word) to the progress of the ML ecosystem. But long story short, @tbreloff and I have started the discussion in LearnBase and we will try to code up a good solution. Anyone who is interested in the discussion or providing feedback is very welcomed |
FWIW, I think the best way to move forward is to punt on the abstraction layer problem for now (since we don't all agree on it and reaching group consensus is always extremely difficult) -- and instead focus on just nailing certain specific models. Simon's done amazing work to get regularized linear regression working well in pure Julia. It would be great to have similarly nice tools for things like kNN. I suspect it's easier to get people to collaborate (or at least offer useful feedback to one another) if everyone is coordinating on a single purely technical problem (e.g. how to make nearest neighbor search fast) that doesn't require people to come to consensus about purely aesthetic considerations. |
+1 John. |
Hmm maybe I have gone a little off track. I didn't know about the JuliaCon 2016 plans and I am very happy to hear about them (or at least the consideration) But the two points I stated before still make sense to me
I don't think settling on function names and defining them in a single place to avoid collisions is too far out there. I'm not talking about some fictive issues here. These are things that currently concern me in my efforts for SVMs. Some coordination, even if its just for exchanging ideas, is at least educating. I want to at least try and fail rather than not attempt at all. |
Let's leave it at this for now: It looks like @tbreloff and I will put our heads together and try to coordinate at least both of our current ML efforts in a meaningful way. Hopefully the outcome will be useful to others as well. |
Hi I am interested in what is the current state of ML ecosystem in Julia. By reading this (and other) issue(s) and having a look at the mentioned packages, it seems to me that:
Are my impressions correct? If so, I assume people are not using Julia for day-to-day ML experiments like they use for example Python+scikit-learn? Or is there perhaps a ML ‘workbench’ package I missed? |
Please check out https://github.com/denizyuret/Knet.jl
…On Mon, May 1, 2017 at 8:25 PM ValdarT ***@***.***> wrote:
Hi
I am interested in what is the current state of ML ecosystem in Julia. By
reading this (and other) issue(s) and having a look at the mentioned
packages, it seems to me that:
- Hard work going on in JuliaML but Learn.jl will not be ready for use
any time soon
- Orchestra.jl and SupervisedLearning.jl not maintained (I assume
Learn.jl will fill their place in the future)
- ScikitLearn.jl maintained and works well but is not very actively
improved/developed. (As it currently stands, it is more of an interface to
the original code rather than a reimplementation in Julia.)
Are my impressions correct? If so, I assume people are not using Julia for
day-to-day ML experiments like they use for example Python+scikit-learn? Or
is there perhaps a ML ‘workbench’ package I missed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvNptgAkv-HT_0sHFQ-XPt4e2htTz6xks5r1hV2gaJpZM4BgVAI>
.
|
@ValdarT I think most people using julia for "day-to-day ML" either use very specific packages for their use case (e.g. Boltzmann.jl, BayesNets.jl, GaussianProcesses.jl, Mocha.jl) or implement their own methods. I imagine the folks in the JuliaML organization are the most likely to come up with a good/cohesive julia framework all the different ML methods out there, but that's a pretty tough job. |
Wow, JuliaML looks pretty great but also pretty ambitious. It has a much larger scope than scikit-learn and tensorflow combined... Is there any documentation on the "learn" package or a simple intro somewhere? |
Discussion of the JuliaML organization should take place in their roadmap: https://github.com/JuliaML/Roadmap.jl/issues. The focus of JuliaStats is more classical statistics, as the more ML-oriented packages in this organization are unmaintained (e.g. SVM and RegERMs). |
Locking and closing this issue so that discussion can continue in the right place: https://github.com/JuliaML/Roadmap.jl |
Currently, the development of machine learning tools are in several different packages without little coordination. Consequently, some efforts are repetitive, while some important aspects remain lacking.
Hopefully, we may coordinate our efforts through this issue. Below, I try to outline a tentative roadmap:
Generalized Linear Models
Current efforts: GLMNet, GLM, Regression
Support Vector Machines
Current efforts: SVM, LIBSVM
DimensionalityReduction
Current efforts: DimensionalityReduction
Non-negative Matrix Factorization
This may be categorized into dimensionality reduction. However, NNMF in itself has a plethora of methodologies, and thus deserves a separate package.
Classification
There are many techniques for classification. It may be useful to have multiple packages respective techniques (e.g. GLM, SVM, kNN), and have a meta-package Classification.jl to incorporate them all.
Clustering
Current efforts: Clustering.jl
Many machine learning applications also require some supporting functionalities, such as performance evaluation, data preprocessing, etc. These can all go into MLBase
Probabilistic Modeling (e.g. Bayesian Network, Markov Random Field, etc)
This is a huge field in itself, and may be discussed separately.
cc: @johnmyleswhite @dmbates @simonster @ViralBShah
Edit
I created an NMF.jl package, which is dedicated to non-negative matrix factorization.
Also, a detailed plan for DimensionalityReduction is outlined here.
The text was updated successfully, but these errors were encountered: