Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. #18035

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 16 additions & 24 deletions R/pkg/R/mllib_classification.R
Original file line number Diff line number Diff line change
Expand Up @@ -46,15 +46,16 @@ setClass("MultilayerPerceptronClassificationModel", representation(jobj = "jobj"
#' @note NaiveBayesModel since 2.0.0
setClass("NaiveBayesModel", representation(jobj = "jobj"))

#' linear SVM Model
#' Linear SVM Model
#'
#' Fits an linear SVM model against a SparkDataFrame. It is a binary classifier, similar to svm in glmnet package
#' Fits a linear SVM model against a SparkDataFrame, similar to svm in e1071 package.
#' Currently only supports binary classification model with linear kernal.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean kernel instead of kernal?

#' Users can print, make predictions on the produced model and save the model to the input path.
#'
#' @param data SparkDataFrame for training.
#' @param formula A symbolic description of the model to be fitted. Currently only a few formula
#' operators are supported, including '~', '.', ':', '+', and '-'.
#' @param regParam The regularization parameter.
#' @param regParam The regularization parameter. Only supports L2 regularization currently.
#' @param maxIter Maximum iteration number.
#' @param tol Convergence tolerance of iterations.
#' @param standardization Whether to standardize the training features before fitting the model. The coefficients
Expand Down Expand Up @@ -111,10 +112,10 @@ setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = "formu
new("LinearSVCModel", jobj = jobj)
})

# Predicted values based on an LinearSVCModel model
# Predicted values based on a linear SVM model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are intentional - we have # Predicted values based on an LogisticRegressionModel model
they are prefix by # and not in generated doc - only for developers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a couple of these starting with #


#' @param newData a SparkDataFrame for testing.
#' @return \code{predict} returns the predicted values based on an LinearSVCModel.
#' @return \code{predict} returns the predicted values based on a linear SVM model.
#' @rdname spark.svmLinear
#' @aliases predict,LinearSVCModel,SparkDataFrame-method
#' @export
Expand All @@ -124,39 +125,30 @@ setMethod("predict", signature(object = "LinearSVCModel"),
predict_internal(object, newData)
})

# Get the summary of an LinearSVCModel
# Get the summary of a linear SVM model.

#' @param object an LinearSVCModel fitted by \code{spark.svmLinear}.
#' @param object a linear SVM model fitted by \code{spark.svmLinear}.
#' @return \code{summary} returns summary information of the fitted model, which is a list.
#' The list includes \code{coefficients} (coefficients of the fitted model),
#' \code{intercept} (intercept of the fitted model), \code{numClasses} (number of classes),
#' \code{numFeatures} (number of features).
#' \code{numClasses} (number of classes), \code{numFeatures} (number of features).
#' @rdname spark.svmLinear
#' @aliases summary,LinearSVCModel-method
#' @export
#' @note summary(LinearSVCModel) since 2.2.0
setMethod("summary", signature(object = "LinearSVCModel"),
function(object) {
jobj <- object@jobj
features <- callJMethod(jobj, "features")
labels <- callJMethod(jobj, "labels")
coefficients <- callJMethod(jobj, "coefficients")
nCol <- length(coefficients) / length(features)
coefficients <- matrix(unlist(coefficients), ncol = nCol)
intercept <- callJMethod(jobj, "intercept")
features <- callJMethod(jobj, "rFeatures")
coefficients <- callJMethod(jobj, "rCoefficients")
coefficients <- as.matrix(unlist(coefficients))
colnames(coefficients) <- c("Estimate")
rownames(coefficients) <- unlist(features)
numClasses <- callJMethod(jobj, "numClasses")
numFeatures <- callJMethod(jobj, "numFeatures")
if (nCol == 1) {
Copy link
Contributor Author

@yanboliang yanboliang May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML LinearSVC only supports binary classification, and will not support multiple classification in the near future, so we can simplify here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not label, intercept? i think they are common in R to include what goes into the model (although in many cases it just include the formula in the model summary)

Copy link
Contributor Author

@yanboliang yanboliang May 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung The change here is to make coefficients matrix has only one column named Estimate. I speculate the original code referred to spark.logit which supports multiple classification, so it should have multiple columns and each columns' name should be corresponding label. For binary classification, the coefficients are not bind to any labels, so we use Estimate as the column name like what R does. LinearSVC will not support multiple classification in the future, so I simplified it at here.
The followings are summary outputs for binomial and multinomial logistic regression in SparkR:
Binomial logistic regression model:
image
Multinomial logistic regression model:
image

colnames(coefficients) <- c("Estimate")
} else {
colnames(coefficients) <- unlist(labels)
}
rownames(coefficients) <- unlist(features)
list(coefficients = coefficients, intercept = intercept,
numClasses = numClasses, numFeatures = numFeatures)
list(coefficients = coefficients, numClasses = numClasses, numFeatures = numFeatures)
})

# Save fitted LinearSVCModel to the input path
# Save fitted linear SVM model to the input path.

#' @param path The directory where the model is saved.
#' @param overwrite Overwrites or not if the output path already exists. Default is FALSE
Expand Down
3 changes: 1 addition & 2 deletions R/pkg/inst/tests/testthat/test_mllib_classification.R
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,8 @@ test_that("spark.svmLinear", {
expect_true(class(summary$coefficients[, 1]) == "numeric")

coefs <- summary$coefficients[, "Estimate"]
expected_coefs <- c(-0.1563083, -0.460648, 0.2276626, 1.055085)
expected_coefs <- c(-0.06004978, -0.1563083, -0.460648, 0.2276626, 1.055085)
expect_true(all(abs(coefs - expected_coefs) < 0.1))
expect_equal(summary$intercept, -0.06004978, tolerance = 1e-2)

# Test prediction with string label
prediction <- predict(model, training)
Expand Down
12 changes: 10 additions & 2 deletions mllib/src/main/scala/org/apache/spark/ml/r/LinearSVCWrapper.scala
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,17 @@ private[r] class LinearSVCWrapper private (
private val svcModel: LinearSVCModel =
pipeline.stages(1).asInstanceOf[LinearSVCModel]

lazy val coefficients: Array[Double] = svcModel.coefficients.toArray
lazy val rFeatures: Array[String] = if (svcModel.getFitIntercept) {
Array("(Intercept)") ++ features
Copy link
Contributor Author

@yanboliang yanboliang May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In R we stack intercept with other feature names, you can refer spark.glm, spark.logit, spark.survreg.

} else {
features
}

lazy val intercept: Double = svcModel.intercept
lazy val rCoefficients: Array[Double] = if (svcModel.getFitIntercept) {
Array(svcModel.intercept) ++ svcModel.coefficients.toArray
} else {
svcModel.coefficients.toArray
}

lazy val numClasses: Int = svcModel.numClasses

Expand Down