Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. #18035

Closed
wants to merge 4 commits into from

Conversation

yanboliang
Copy link
Contributor

@yanboliang yanboliang commented May 19, 2017

What changes were proposed in this pull request?

Joint coefficients with intercept for SparkR linear SVM summary.

How was this patch tested?

Existing tests.

Copy link
Contributor Author

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung I'd propose to rename spark.svmLinear to spark.svm, since svm is widely used for R users by e1071 package and we may support non linear model in the future (although with low probability), we can reuse this SparkR API. It would be like spark.gbt which can call two ML algorithms with the single SparkR API. What do you think of it?

@@ -38,9 +38,17 @@ private[r] class LinearSVCWrapper private (
private val svcModel: LinearSVCModel =
pipeline.stages(1).asInstanceOf[LinearSVCModel]

lazy val coefficients: Array[Double] = svcModel.coefficients.toArray
lazy val rFeatures: Array[String] = if (svcModel.getFitIntercept) {
Array("(Intercept)") ++ features
Copy link
Contributor Author

@yanboliang yanboliang May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In R we stack intercept with other feature names, you can refer spark.glm, spark.logit, spark.survreg.

numClasses <- callJMethod(jobj, "numClasses")
numFeatures <- callJMethod(jobj, "numFeatures")
if (nCol == 1) {
Copy link
Contributor Author

@yanboliang yanboliang May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML LinearSVC only supports binary classification, and will not support multiple classification in the near future, so we can simplify here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not label, intercept? i think they are common in R to include what goes into the model (although in many cases it just include the formula in the model summary)

Copy link
Contributor Author

@yanboliang yanboliang May 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung The change here is to make coefficients matrix has only one column named Estimate. I speculate the original code referred to spark.logit which supports multiple classification, so it should have multiple columns and each columns' name should be corresponding label. For binary classification, the coefficients are not bind to any labels, so we use Estimate as the column name like what R does. LinearSVC will not support multiple classification in the future, so I simplified it at here.
The followings are summary outputs for binomial and multinomial logistic regression in SparkR:
Binomial logistic regression model:
image
Multinomial logistic regression model:
image

@SparkQA
Copy link

SparkQA commented May 19, 2017

Test build #77094 has finished for PR 18035 at commit 1ed3ba0.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang yanboliang changed the title [MINOR][SPARKR][ML] Fix coefficients issue and code cleanup for SparkR linear SVM. [MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. May 19, 2017
@SparkQA
Copy link

SparkQA commented May 19, 2017

Test build #77097 has finished for PR 18035 at commit 39317c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

are you targeting these changes for 2.2.0 - since we are making API/return results changes here

@felixcheung
Copy link
Member

felixcheung commented May 19, 2017

I'd propose to rename spark.svmLinear to spark.svm

I see your point but svmLinear is also a popular name in the caret package. My concern would be coming at too general and end up having to find a strange name later because svm is taken, or having parameter name conflicts and so on.

Also from various threads it seems really really unlikely that we will implement non-linear form of svm like you said :)

@@ -111,10 +112,10 @@ setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = "formu
new("LinearSVCModel", jobj = jobj)
})

# Predicted values based on an LinearSVCModel model
# Predicted values based on a linear SVM model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are intentional - we have # Predicted values based on an LogisticRegressionModel model
they are prefix by # and not in generated doc - only for developers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a couple of these starting with #

numClasses <- callJMethod(jobj, "numClasses")
numFeatures <- callJMethod(jobj, "numFeatures")
if (nCol == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not label, intercept? i think they are common in R to include what goes into the model (although in many cases it just include the formula in the model summary)

@yanboliang
Copy link
Contributor Author

@felixcheung Thanks for your comments. I'm targeting this for 2.2, in case for breaking change. With respect to the name issue, I'm still more prefer to rename to spark.svm. There are lots of R packages which implement same functions, but we should follow the most authoritative or frequently-used packages. For SVM, I think e1071 is the one we should refer, you can check the search result of r svm in google, all items in the first page are e1071::svm. To your concern about potential name conflicts, I think we can prevent it by providing parameters such as classification/regression, kernel function, loss function, etc. However, I'm still open to hear your thoughts.

@SparkQA
Copy link

SparkQA commented May 22, 2017

Test build #77179 has finished for PR 18035 at commit 207d674.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2017

Test build #77181 has finished for PR 18035 at commit 3c14d15.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

@yanboliang Appreciate discussing this matter with me, and it is important to sort this out now. Normally I wouldn't mind either way; but in this case I kinda feel strongly about not making this name change for 2 main reasons:

  • first, the work has been done by a contributor. I feel we are at some level undoing his work by making this change now after his work is merged, instead of providing valuable timely feedback during the review process
  • second, being concise is important. I understand the popularly of the search term. Aside from future supportability, naming conflicts etc, I think we choose to name it LinearSVC in Scala because it concisely describes what it does and supports. We could have named it SVM but we didn't? So I'm not sure we should name it svm for R. We also didn't call boosted tree gbm which is hugely popular, but instead gbt. Also, as you are aware, we get a lot of feedback and requests on adding new ML algorithm support in Spark. I think it is very important to set expectation in this case so that people does not search and find svm but it doesn't do what people thinks it should do? Unless you think we will go beyond linear and support polynomial etc. at some point? But I think you agree that is rather unlikely.

Anyway, what do you think?

#'
#' Fits an linear SVM model against a SparkDataFrame. It is a binary classifier, similar to svm in glmnet package
#' Fits a linear SVM model against a SparkDataFrame, similar to svm in e1071 package.
#' Currently only supports binary classification model with linear kernal.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean kernel instead of kernal?

@yanboliang
Copy link
Contributor Author

yanboliang commented May 23, 2017

@felixcheung For the name issue, I'm OK to keep as it is, thanks for your clarification. What about other changes in this PR?

@SparkQA
Copy link

SparkQA commented May 23, 2017

Test build #77216 has finished for PR 18035 at commit 5d9afe0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! LGTM

@felixcheung
Copy link
Member

let's ignore the appveyor intermitted error - since it passed before simple typo changes

@yanboliang
Copy link
Contributor Author

Merged into master and branch-2.2. Thanks for reviewing.

asfgit pushed a commit that referenced this pull request May 23, 2017
…ar SVM summary.

## What changes were proposed in this pull request?
Joint coefficients with intercept for SparkR linear SVM summary.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18035 from yanboliang/svm-r.

(cherry picked from commit ad09e4c)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
@asfgit asfgit closed this in ad09e4c May 23, 2017
@yanboliang yanboliang deleted the svm-r branch May 23, 2017 08:19
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
…ar SVM summary.

## What changes were proposed in this pull request?
Joint coefficients with intercept for SparkR linear SVM summary.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#18035 from yanboliang/svm-r.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants