Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21622][ML][SparkR] Support offset in SparkR GLM #18831

Closed
wants to merge 3 commits into from

Conversation

actuaryzhang
Copy link
Contributor

What changes were proposed in this pull request?

Support offset in SparkR GLM #16699

@SparkQA
Copy link

SparkQA commented Aug 3, 2017

Test build #80194 has finished for PR 18831 at commit 6ec068e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Aug 3, 2017

Test build #80213 has finished for PR 18831 at commit 6ec068e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -125,7 +127,7 @@ setClass("IsotonicRegressionModel", representation(jobj = "jobj"))
#' @seealso \link{glm}, \link{read.ml}
setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
function(data, formula, family = gaussian, tol = 1e-6, maxIter = 25, weightCol = NULL,
regParam = 0.0, var.power = 0.0, link.power = 1.0 - var.power,
offsetCol = NULL, regParam = 0.0, var.power = 0.0, link.power = 1.0 - var.power,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid adding a param in the middle - it breaks code passing param by order

offsetCol <- NULL
} else if (!is.null(offsetCol)) {
offsetCol <- as.character(offsetCol)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps

if (!is.null(offsetCol)) {
  offsetCol <- as.character(offsetCol)
  if (nchar(offsetCol) == 0) {
    offsetCol <- NULL
  }
}

not sure if you want to cover other cases when offsetCol cannot be coerced - eg. NA

@SparkQA
Copy link

SparkQA commented Aug 3, 2017

Test build #80218 has finished for PR 18831 at commit dc8ccbc.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2017

Test build #80219 has finished for PR 18831 at commit 3c4ebf9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

Thanks for your comments, Felix.
Addressed all issues.
@yanboliang Could you take a quick look?

stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
family = poisson(), offsetCol = "Petal_Length"))
rStats <- suppressWarnings(summary(glm(Sepal.Width ~ Sepal.Length + Species,
data = iris, family = poisson(), offset = iris$Petal.Length)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's interesting - perhaps we should take col in addition to col name too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then do you want to make the change for weight as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably across every in ml.
let's discuss this in a new JIRA.

Copy link
Contributor

@yanboliang yanboliang Aug 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote to keep the name as it is, because it's the column name of offset rather than the offset itself. weightCol is the same. We would like to keep SparkR MLlib wrappers' argument name consistent with R only when it's applicable. I'm ok to create a new JIRA to discuss it. Thanks.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@felixcheung
Copy link
Member

felixcheung commented Aug 5, 2017 via email

@yanboliang
Copy link
Contributor

@felixcheung Sorry for misunderstand, I agree we can support df$myoffset as well, the requirement make sense for R users. Let's create a separate JIRA to track it and do this change for other similar arguments like weightCol as well. Thanks.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented Aug 6, 2017

Thanks both of you for the comments. Yes, I think it's best to keep this PR on offset and we can address the other improvements later.

@felixcheung
Copy link
Member

merged to master

@asfgit asfgit closed this in 55aa4da Aug 6, 2017
@actuaryzhang actuaryzhang deleted the sparkROffset branch August 7, 2017 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants