[MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. #18035

yanboliang · 2017-05-19T11:41:47Z

What changes were proposed in this pull request?

Joint coefficients with intercept for SparkR linear SVM summary.

How was this patch tested?

Existing tests.

yanboliang

@felixcheung I'd propose to rename spark.svmLinear to spark.svm, since svm is widely used for R users by e1071 package and we may support non linear model in the future (although with low probability), we can reuse this SparkR API. It would be like spark.gbt which can call two ML algorithms with the single SparkR API. What do you think of it?

yanboliang · 2017-05-19T12:03:33Z

mllib/src/main/scala/org/apache/spark/ml/r/LinearSVCWrapper.scala

@@ -38,9 +38,17 @@ private[r] class LinearSVCWrapper private (
  private val svcModel: LinearSVCModel =
    pipeline.stages(1).asInstanceOf[LinearSVCModel]

-  lazy val coefficients: Array[Double] = svcModel.coefficients.toArray
+  lazy val rFeatures: Array[String] = if (svcModel.getFitIntercept) {
+    Array("(Intercept)") ++ features


In R we stack intercept with other feature names, you can refer spark.glm, spark.logit, spark.survreg.

yanboliang · 2017-05-19T12:04:53Z

R/pkg/R/mllib_classification.R

            numClasses <- callJMethod(jobj, "numClasses")
            numFeatures <- callJMethod(jobj, "numFeatures")
-            if (nCol == 1) {


ML LinearSVC only supports binary classification, and will not support multiple classification in the near future, so we can simplify here.

why not label, intercept? i think they are common in R to include what goes into the model (although in many cases it just include the formula in the model summary)

@felixcheung The change here is to make coefficients matrix has only one column named Estimate. I speculate the original code referred to spark.logit which supports multiple classification, so it should have multiple columns and each columns' name should be corresponding label. For binary classification, the coefficients are not bind to any labels, so we use Estimate as the column name like what R does. LinearSVC will not support multiple classification in the future, so I simplified it at here.
The followings are summary outputs for binomial and multinomial logistic regression in SparkR:
Binomial logistic regression model:

Multinomial logistic regression model:

SparkQA · 2017-05-19T12:44:47Z

Test build #77094 has finished for PR 18035 at commit 1ed3ba0.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-19T17:13:17Z

Test build #77097 has finished for PR 18035 at commit 39317c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-05-19T18:31:29Z

are you targeting these changes for 2.2.0 - since we are making API/return results changes here

felixcheung · 2017-05-19T18:34:37Z

I'd propose to rename spark.svmLinear to spark.svm

I see your point but svmLinear is also a popular name in the caret package. My concern would be coming at too general and end up having to find a strange name later because svm is taken, or having parameter name conflicts and so on.

Also from various threads it seems really really unlikely that we will implement non-linear form of svm like you said :)

felixcheung · 2017-05-19T18:27:22Z

R/pkg/R/mllib_classification.R

@@ -111,10 +112,10 @@ setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = "formu
            new("LinearSVCModel", jobj = jobj)
          })

-#  Predicted values based on an LinearSVCModel model
+#  Predicted values based on a linear SVM model.


I think these are intentional - we have # Predicted values based on an LogisticRegressionModel model
they are prefix by # and not in generated doc - only for developers

there are a couple of these starting with #

felixcheung · 2017-05-19T18:29:49Z

R/pkg/R/mllib_classification.R

            numClasses <- callJMethod(jobj, "numClasses")
            numFeatures <- callJMethod(jobj, "numFeatures")
-            if (nCol == 1) {


why not label, intercept? i think they are common in R to include what goes into the model (although in many cases it just include the formula in the model summary)

yanboliang · 2017-05-22T10:33:39Z

@felixcheung Thanks for your comments. I'm targeting this for 2.2, in case for breaking change. With respect to the name issue, I'm still more prefer to rename to spark.svm. There are lots of R packages which implement same functions, but we should follow the most authoritative or frequently-used packages. For SVM, I think e1071 is the one we should refer, you can check the search result of r svm in google, all items in the first page are e1071::svm. To your concern about potential name conflicts, I think we can prevent it by providing parameters such as classification/regression, kernel function, loss function, etc. However, I'm still open to hear your thoughts.

SparkQA · 2017-05-22T11:02:48Z

Test build #77179 has finished for PR 18035 at commit 207d674.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-22T11:06:58Z

Test build #77181 has finished for PR 18035 at commit 3c14d15.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-05-22T17:34:25Z

@yanboliang Appreciate discussing this matter with me, and it is important to sort this out now. Normally I wouldn't mind either way; but in this case I kinda feel strongly about not making this name change for 2 main reasons:

first, the work has been done by a contributor. I feel we are at some level undoing his work by making this change now after his work is merged, instead of providing valuable timely feedback during the review process
second, being concise is important. I understand the popularly of the search term. Aside from future supportability, naming conflicts etc, I think we choose to name it LinearSVC in Scala because it concisely describes what it does and supports. We could have named it SVM but we didn't? So I'm not sure we should name it svm for R. We also didn't call boosted tree gbm which is hugely popular, but instead gbt. Also, as you are aware, we get a lot of feedback and requests on adding new ML algorithm support in Spark. I think it is very important to set expectation in this case so that people does not search and find svm but it doesn't do what people thinks it should do? Unless you think we will go beyond linear and support polynomial etc. at some point? But I think you agree that is rather unlikely.

Anyway, what do you think?

bdwyer2 · 2017-05-22T23:03:42Z

R/pkg/R/mllib_classification.R

 #'
-#' Fits an linear SVM model against a SparkDataFrame. It is a binary classifier, similar to svm in glmnet package
+#' Fits a linear SVM model against a SparkDataFrame, similar to svm in e1071 package.
+#' Currently only supports binary classification model with linear kernal.


Do you mean kernel instead of kernal?

yanboliang · 2017-05-23T01:59:59Z

@felixcheung For the name issue, I'm OK to keep as it is, thanks for your clarification. What about other changes in this PR?

SparkQA · 2017-05-23T03:02:36Z

Test build #77216 has finished for PR 18035 at commit 5d9afe0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

thanks! LGTM

felixcheung · 2017-05-23T06:35:24Z

let's ignore the appveyor intermitted error - since it passed before simple typo changes

yanboliang · 2017-05-23T08:15:40Z

Merged into master and branch-2.2. Thanks for reviewing.

…ar SVM summary. ## What changes were proposed in this pull request? Joint coefficients with intercept for SparkR linear SVM summary. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18035 from yanboliang/svm-r. (cherry picked from commit ad09e4c) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

…ar SVM summary. ## What changes were proposed in this pull request? Joint coefficients with intercept for SparkR linear SVM summary. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#18035 from yanboliang/svm-r.

Code reorg and cleanup for SparkR linear SVM.

1ed3ba0

yanboliang commented May 19, 2017

View reviewed changes

Update test case.

39317c1

yanboliang changed the title ~~[MINOR][SPARKR][ML] Fix coefficients issue and code cleanup for SparkR linear SVM.~~ [MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. May 19, 2017

felixcheung reviewed May 19, 2017

View reviewed changes

Update docs.

3c14d15

yanboliang force-pushed the svm-r branch from 207d674 to 3c14d15 Compare May 22, 2017 10:00

bdwyer2 reviewed May 22, 2017

View reviewed changes

Fix typo.

5d9afe0

felixcheung approved these changes May 23, 2017

View reviewed changes

asfgit closed this in ad09e4c May 23, 2017

yanboliang deleted the svm-r branch May 23, 2017 08:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. #18035

[MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. #18035

yanboliang commented May 19, 2017 •

edited

Loading

yanboliang left a comment

yanboliang May 19, 2017 •

edited

Loading

yanboliang May 19, 2017 •

edited

Loading

felixcheung May 19, 2017

yanboliang May 22, 2017 •

edited

Loading

SparkQA commented May 19, 2017

SparkQA commented May 19, 2017

felixcheung commented May 19, 2017

felixcheung commented May 19, 2017 •

edited

Loading

felixcheung May 19, 2017

felixcheung May 19, 2017

felixcheung May 19, 2017

yanboliang commented May 22, 2017

SparkQA commented May 22, 2017

SparkQA commented May 22, 2017

felixcheung commented May 22, 2017

bdwyer2 May 22, 2017

yanboliang commented May 23, 2017 •

edited

Loading

SparkQA commented May 23, 2017

felixcheung left a comment

felixcheung commented May 23, 2017

yanboliang commented May 23, 2017

[MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. #18035

[MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. #18035

Conversation

yanboliang commented May 19, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

yanboliang left a comment

Choose a reason for hiding this comment

yanboliang May 19, 2017 • edited Loading

Choose a reason for hiding this comment

yanboliang May 19, 2017 • edited Loading

Choose a reason for hiding this comment

felixcheung May 19, 2017

Choose a reason for hiding this comment

yanboliang May 22, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented May 19, 2017

SparkQA commented May 19, 2017

felixcheung commented May 19, 2017

felixcheung commented May 19, 2017 • edited Loading

felixcheung May 19, 2017

Choose a reason for hiding this comment

felixcheung May 19, 2017

Choose a reason for hiding this comment

felixcheung May 19, 2017

Choose a reason for hiding this comment

yanboliang commented May 22, 2017

SparkQA commented May 22, 2017

SparkQA commented May 22, 2017

felixcheung commented May 22, 2017

bdwyer2 May 22, 2017

Choose a reason for hiding this comment

yanboliang commented May 23, 2017 • edited Loading

SparkQA commented May 23, 2017

felixcheung left a comment

Choose a reason for hiding this comment

felixcheung commented May 23, 2017

yanboliang commented May 23, 2017

yanboliang commented May 19, 2017 •

edited

Loading

yanboliang May 19, 2017 •

edited

Loading

yanboliang May 19, 2017 •

edited

Loading

yanboliang May 22, 2017 •

edited

Loading

felixcheung commented May 19, 2017 •

edited

Loading

yanboliang commented May 23, 2017 •

edited

Loading