-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-1212, Part II] Support sparse data in MLlib #245
Conversation
mark createModel protected mark predictPoint protected
update some ml algorithms to use Vector (cont.)
Merged build triggered. |
Merged build started. |
Merged build finished. |
One or more automated tests failed |
Build triggered. |
Build started. |
Build finished. |
One or more automated tests failed |
def compute(data: Vector, label: Double, weights: Vector): (Vector, Double) | ||
|
||
/** | ||
* Compute the gradient and loss given the features of a single data point, add the gradient to a provided vector to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100 characters limit.
val margin: Double = -1.0 * brzWeights.dot(brzData) | ||
val gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label | ||
|
||
brzAxpy(gradientMultiplier, brzData, gradientAddTo.toBreeze) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are too many toBreeze
s when using the Vector
trait. How about using implicit
to eliminate them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
breeze uses implicits a lot. Scala do not look for second degree implicit conversions.
Build triggered. One or more automated tests failed |
Build started. One or more automated tests failed |
Build finished. One or more automated tests failed |
One or more automated tests failed |
@yinxusen This is WIP. I will let you know when it is ready for review. |
Jenkins, retest this please. |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
* Binary label parser, which outputs 1.0 (positive) if the value is greater than 0.5, | ||
* or 0.0 (negative) otherwise. | ||
*/ | ||
val binaryLabelParser: String => Double = label => if (label.toDouble > 0.5) 1.0 else 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using String => Double
as the type of these, we should create a trait LabelParser
and have some implementations of it so that it becomes friendly to call from Java.
@mengxr I made some comments on the libSVM stuff but I'd also be okay fixing them alter if that's more convenient for you, because a lot of other MLlib patches depend on this. |
I've merged this in now, Xiangrui sent me an IM that he will handle the libSVM input stuff later. |
Thanks @mateiz ! |
Right now, there is still use of jblas in addition to breeze in the code base and in APIs. In theory there are no more API changes before 1.0 now. This seems pretty important to get sorted before the API freezes though, since it theoretically can't change before 2.0 then. Or is this like a lot of stuff in MLlib going to be marked as not-stable-yet? |
@srowen I don't think jblas's DoubleMatrix is exposed in public APIs. But if there are, yes, we should clean them before v1.0. We will mark some APIs developer/experimental in the v1.0 release. |
@mengxr On closer inspection, yes almost all the uses in an API are actually not public, or in a test method. There are a few places where they turn up, I think, like I would wholeheartedly agree with you reserving the right to change all of these APIs before 2.x by marking them experimental, which makes it a non-issue. I am almost certain some of the good stuff coming over the next year will want at least a few API changes. |
This is a patch to address @mateiz 's comment in #245 MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass. Author: Xiangrui Meng <meng@databricks.com> Closes #345 from mengxr/label-parser and squashes the following commits: ac44409 [Xiangrui Meng] use singleton objects for label parsers 3b1a7c6 [Xiangrui Meng] add tests for label parsers c2e571c [Xiangrui Meng] rename LabelParser.apply to LabelParser.parse use extends for singleton 11c94e0 [Xiangrui Meng] add return types 7f8eb36 [Xiangrui Meng] change labelParser from annoymous function to trait
In PR apache#117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes: 1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`. 2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure. 3. Mark 'createModel' and 'predictPoint' protected because they are not for end users. 4. Add libSVMFile to MLContext. 5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`). 6. Gradient computation no longer creates temp vectors. 7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training. TODO: 1. ~~Use axpy when possible.~~ 2. ~~Optimize Naive Bayes.~~ Author: Xiangrui Meng <meng@databricks.com> Closes apache#245 from mengxr/vector and squashes the following commits: eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector] 11999c7 [Xiangrui Meng] Merge branch 'master' into vector f7da54b [Xiangrui Meng] add minSplits to libSVMFile da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning 493f26f [Xiangrui Meng] Merge branch 'master' into vector 7c1bc01 [Xiangrui Meng] add a TODO to NB b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM 4addc50 [Xiangrui Meng] merge master 4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests d088552 [Xiangrui Meng] use static constructor for MLContext 6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically 3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data 0f8759b [Xiangrui Meng] minor updates to NB b11659c [Xiangrui Meng] style update 78c4671 [Xiangrui Meng] add libSVMFile to MLContext f0fe616 [Xiangrui Meng] add a test for sparse linear regression 44733e1 [Xiangrui Meng] use in-place gradient computation e981396 [Xiangrui Meng] use axpy in Updater db808a1 [Xiangrui Meng] update JavaLR example befa592 [Xiangrui Meng] passed scala/java tests 75c83a4 [Xiangrui Meng] passed test compile 1859701 [Xiangrui Meng] passed compile 834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.) 135ab72 [Xiangrui Meng] merge glm 0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used 3f346ba [Xiangrui Meng] update some ml algorithms to use Vector
This is a patch to address @mateiz 's comment in apache#245 MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass. Author: Xiangrui Meng <meng@databricks.com> Closes apache#345 from mengxr/label-parser and squashes the following commits: ac44409 [Xiangrui Meng] use singleton objects for label parsers 3b1a7c6 [Xiangrui Meng] add tests for label parsers c2e571c [Xiangrui Meng] rename LabelParser.apply to LabelParser.parse use extends for singleton 11c94e0 [Xiangrui Meng] add return types 7f8eb36 [Xiangrui Meng] change labelParser from annoymous function to trait
Add Rd files for sampleByKey() of [SPARKR-163] and sumRDD() of [SPARKR-92]
This PR pulls in recent changes in SparkR-pkg, including cartesian, intersection, sampleByKey, subtract, subtractByKey, except, and some API for StructType and StructField. Author: cafreeman <cfreeman@alteryx.com> Author: Davies Liu <davies@databricks.com> Author: Zongheng Yang <zongheng.y@gmail.com> Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com> Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Author: Sun Rui <rui.sun@intel.com> Closes #5436 from davies/R3 and squashes the following commits: c2b09be [Davies Liu] SQLTypes -> schema a5a02f2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R3 168b7fe [Davies Liu] sort generics b1fe460 [Davies Liu] fix conflict in README.md e74c04e [Davies Liu] fix schema.R 4f5ac09 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R5 41f8184 [Davies Liu] rm man ae78312 [Davies Liu] Merge pull request #237 from sun-rui/SPARKR-154_3 1bdcb63 [Zongheng Yang] Updates to README.md. 5a553e7 [cafreeman] Use object attribute instead of argument 71372d9 [cafreeman] Update docs and examples 8526d2e [cafreeman] Remove `tojson` functions 6ef5f2d [cafreeman] Fix spacing 7741d66 [cafreeman] Rename the SQL DataType function 141efd8 [Shivaram Venkataraman] Merge pull request #245 from hqzizania/upstream 9387402 [Davies Liu] fix style 40199eb [Shivaram Venkataraman] Move except into sorted position 07d0dbc [Sun Rui] [SPARKR-244] Fix test failure after integration of subtract() and subtractByKey() for RDD. 7e8caa3 [Shivaram Venkataraman] Merge pull request #246 from hlin09/fixCombineByKey ed66c81 [cafreeman] Update `subtract` to work with `generics.R` f3ba785 [cafreeman] Fixed duplicate export 275deb4 [cafreeman] Update `NAMESPACE` and tests 1a3b63d [cafreeman] new version of `CreateDF` 836c4bf [cafreeman] Update `createDataFrame` and `toDF` be5d5c1 [cafreeman] refactor schema functions 40338a4 [Zongheng Yang] Merge pull request #244 from sun-rui/SPARKR-154_5 20b97a6 [Zongheng Yang] Merge pull request #234 from hqzizania/assist ba54e34 [Shivaram Venkataraman] Merge pull request #238 from sun-rui/SPARKR-154_4 c9497a3 [Shivaram Venkataraman] Merge pull request #208 from lythesia/master b317aa7 [Zongheng Yang] Merge pull request #243 from hqzizania/master 136a07e [Zongheng Yang] Merge pull request #242 from hqzizania/stats cd66603 [cafreeman] new line at EOF 8b76e81 [Shivaram Venkataraman] Merge pull request #233 from redbaron/fail-early-on-missing-dep 7dd81b7 [cafreeman] Documentation 0e2a94f [cafreeman] Define functions for schema and fields
Changing the autocommit behaviour of the JDBC connection tests has brought a number of issues. To fix that, went through all the tests and cleaned it up to use autocommit=true everywhere. Author: Juliusz Sompolski <julek@databricks.com> Closes apache#245 from juliuszsompolski/SC-5621-fixup.
In PR #117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other
Array[Double]
usage byVector
in generalized linear models (GLMs) and Naive Bayes. Major changes:LabeledPoint
becomesLabeledPoint(Double, Vector)
.RDD[Array[Double]]
now acceptRDD[Vector]
. We cannot support both in an elegant way because of type erasure.NaiveBayesModel
).TODO:
Use axpy when possible.Optimize Naive Bayes.