[SPARK-11215][ML] Add multiple columns support to StringIndexer #20146

viirya · 2018-01-04T03:57:34Z

What changes were proposed in this pull request?

This takes over #19621 to add multi-column support to StringIndexer:

Supports encoding multiple columns.
Previously, when specifying frequencyDesc or frequencyAsc as stringOrderType param in StringIndexer, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically.

How was this patch tested?

Added tests.

viirya · 2018-01-04T04:00:52Z

Based on the @WeichenXu123's original codes, I did some clean-up and refactoring.

viirya · 2018-01-04T04:04:53Z

R/pkg/tests/fulltests/test_mllib_classification.R

@@ -313,7 +313,7 @@ test_that("spark.mlp", {
  # Test predict method
  mlpTestDF <- df
  mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction"))
-  expect_equal(head(mlpPredictions$prediction, 6), c("1.0", "0.0", "0.0", "0.0", "0.0", "0.0"))
+  expect_equal(head(mlpPredictions$prediction, 6), c("0.0", "1.0", "1.0", "1.0", "1.0", "1.0"))


This is due to the change of how we sort string labels with same frequency under the setting of frequencyDesc/Asc.

I checked the predictions. All 0.0 -> 1.0, and all1.0 -> 0.0.

viirya · 2018-01-04T04:06:13Z

R/pkg/tests/fulltests/test_mllib_classification.R

@@ -348,12 +348,12 @@ test_that("spark.mlp", {

  # Test random seed
  # default seed
-  model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 10)
+  model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 100)


Seems maxIter = 10 is not stable. I increased to 100 to stabilize the predictions.

can you check if the run time increases significantly? this is an issue before - see SPARK-21693

> start.time <- Sys.time() > model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 10) > end.time <- Sys.time() > time.taken <- end.time - start.time > time.taken Time difference of 1.780564 secs

> start.time <- Sys.time() > model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 100) > end.time <- Sys.time() > time.taken <- end.time - start.time > time.taken Time difference of 5.728089 secs

Ahh, @viirya, would you mind if I ask to check it after setting spark.sparkr.use.daemon to false too?

Ran it again with the config:

> sparkR.conf("spark.sparkr.use.daemon") $spark.sparkr.use.daemon [1] "false" > start.time <- Sys.time() > model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 10) > end.time <- Sys.time() > time.taken <- end.time - start.time > time.taken Time difference of 1.704288 secs

> start.time <- Sys.time() > model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 100) > end.time <- Sys.time() > time.taken <- end.time - start.time > time.taken Time difference of 5.135418 secs

viirya · 2018-01-04T04:52:54Z

R/pkg/tests/fulltests/test_mllib_classification.R

@@ -313,7 +313,7 @@ test_that("spark.mlp", {
  # Test predict method


Actually I think we may remove this test Test predict method. Seems to me, with the tol = 0.5, the prediction may not be very meaningful.

felixcheung

As discussed before in #19621 (comment)

it's failing now as

[00:54:27] Failed -------------------------------------------------------------------------
[00:54:27] 1. Failure: spark.glm summary (@test_mllib_regression.R#116) -------------------
[00:54:27] all(abs(rCoefs - coefs) < 1e-04) isn't true.
[00:54:27] 
[00:54:27] 
[00:54:27] 2. Failure: spark.glm summary (@test_mllib_regression.R#117) -------------------
[00:54:27] all(...) isn't true.
[00:54:27] 
[00:54:27] 
[00:54:27] 3. Failure: spark.glm summary (@test_mllib_regression.R#182) -------------------
[00:54:27] all(abs(rStats$coefficients - stats$coefficients) < 0.001) isn't true.
[00:54:27] 
[00:54:27] 
[00:54:27] 4. Failure: glm summary (@test_mllib_regression.R#311) -------------------------
[00:54:27] all(abs(rCoefs - coefs) < 1e-04) isn't true.
[00:54:27] 
[00:54:27] 
[00:54:27] 5. Failure: glm summary (@test_mllib_regression.R#312) -------------------------
[00:54:27] all(...) isn't true.
[00:54:27]

SparkQA · 2018-01-04T07:46:55Z

Test build #85656 has finished for PR 20146 at commit bb990f1.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-01-06T00:06:16Z

Seems to me we can't set string indexer order for R glm.

A workaround is to encode the Species manually first. Then let R glm and spark.glm to fit the encoded Species column, instead of the original Species.

I check the coefficients produced by this way. The coefficients from R glm and spark.glm are the same. But as Species is converted to numeric, the coefficients don't include Species_versicolor and Species_virginica now.

> encodedSpecies <- factor(iris$Species, levels=c("setosa", "versicolor", "virginica"))
> iris$encodedSpecies <- as.numeric(encodedSpecies)
> training <- suppressWarnings(createDataFrame(iris))
> stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + encodedSpecies))                             
> rStats <- summary(glm(Sepal.Width ~ Sepal.Length + encodedSpecies, data = iris))
> stats$coefficients
                 Estimate Std. Error   t value     Pr(>|t|)
(Intercept)     2.2595167  0.2604476  8.675514 7.105427e-15
Sepal_Length    0.2937617  0.0582277  5.045051 1.318724e-06
encodedSpecies -0.4593655  0.0588556 -7.804958 1.023848e-12
> rStats$coefficients
                 Estimate Std. Error   t value     Pr(>|t|)
(Intercept)     2.2595167  0.2604476  8.675514 7.079930e-15
Sepal.Length    0.2937617  0.0582277  5.045051 1.318724e-06
encodedSpecies -0.4593655  0.0588556 -7.804958 1.023755e-12
> coefs <- stats$coefficients
> rCoefs <- rStats$coefficients
> all(abs(rCoefs - coefs) < 1e-4)
[1] TRUE

viirya · 2018-01-06T00:18:27Z

Another workaround is, we can add some rows into iris dataset and make the three values in Species column not frequency equal anymore.

For example, we add three more rows into iris. Now the frequency of versicolor is 52, virginica is 51, setosa is 50. This can make RFormula index them as: "setosa"->2, "versicolor"->0, "virginica"->1 to meet the encoding of R glm.

> iris
...
151          5.9         3.0          5.1         1.8  virginica              3
152          5.7         2.8          4.1         1.3 versicolor              2
153          5.7         2.8          4.1         1.3 versicolor              2
> training <- suppressWarnings(createDataFrame(iris))                                                          
> stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + Species))                                  
> rStats <- summary(glm(Sepal.Width ~ Sepal.Length + Species, data = iris))                                    
> stats$coefficients                                                                                             
                     Estimate Std. Error    t value     Pr(>|t|)
(Intercept)         1.7057547 0.23221834   7.345478 1.246270e-11
Sepal_Length        0.3440362 0.04567319   7.532564 4.439116e-12
Species_versicolor -0.9736771 0.07073874 -13.764410 0.000000e+00
Species_virginica  -0.9931144 0.09164069 -10.837046 0.000000e+00
> rStats$coefficients                                                                                            
                    Estimate Std. Error    t value     Pr(>|t|)
(Intercept)        1.7057547 0.23221834   7.345478 1.246279e-11
Sepal.Length       0.3440362 0.04567319   7.532564 4.439021e-12
Speciesversicolor -0.9736771 0.07073874 -13.764410 2.469196e-28
Speciesvirginica  -0.9931144 0.09164069 -10.837046 1.527430e-20
> coefs <- stats$coefficients
> rCoefs <- rStats$coefficients
> all(abs(rCoefs - coefs) < 1e-4)
[1] TRUE

@felixcheung @WeichenXu123 What do you think?

felixcheung

possibly this #20146 (comment) but why not consider what was suggested earlier - use a dataset that doesn't have a duplicated values?

viirya · 2018-01-06T09:10:30Z

Ah, I didn't see that suggestion, sounds good to me to use a dataset without duplicated values. I will look up for a proper dataset. Or you have a suggested one already?

viirya · 2018-01-06T15:02:13Z

Hmm, I reconsider this #20146 (review). Even we use a dataset without duplicate values, if the string indexer order from R glm is different than the index used by RFormula, we still can't get the same results because looks like R glm doesn't follow frequency/alphabet.

For example, I've tried the dataset Puromycin:

> training <- suppressWarnings(createDataFrame(Puromycin))                                                       
> stats <- summary(spark.glm(training, conc ~ rate + state))
> rStats <- summary(glm(conc ~ rate + state, data = Puromycin))
> rStats$coefficients
                   Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)    -0.595461828 0.157462177 -3.781618 1.171709e-03
rate            0.006642461 0.001022196  6.498228 2.464757e-06
stateuntreated  0.136323828 0.095090605  1.433620 1.671302e-01
> stats$coefficients
                  Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)   -0.459138000 0.130420375 -3.520447 2.150817e-03
rate           0.006642461 0.001022196  6.498228 2.464757e-06
state_treated -0.136323828 0.095090605 -1.433620 1.671302e-01

You can see because the string index of state column is still different between R glm and RFormula, we can't get the same results.

A workaround to this is that we can use a dataset which doesn't need string indexing. What do you think? @felixcheung

felixcheung · 2018-01-06T18:08:55Z

I think all dataset with a string order get indexed, as far as I recall? Pick existing R dataset is just a convenience, we can also make up a few lines of data if that works out better. Although as a separate note the difference in sort order is potentially something we should document, esp if it goes beyond glm, for example in sql functions too

viirya · 2018-01-07T01:48:09Z

Ok. I use a handcrafted tiny dataset to replace iris in the failed test.

SparkQA · 2018-01-07T05:28:58Z

Test build #85765 has finished for PR 20146 at commit 26cc94b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-01-07T05:46:57Z

also cc @jkbradley @MLnick for review.

felixcheung

can you double check the changed values, that they match the result from R's glm?
(looks like there are 2)

viirya · 2018-01-08T01:01:33Z

R/pkg/tests/fulltests/test_mllib_regression.R

@@ -126,15 +134,15 @@ test_that("spark.glm summary", {

  out <- capture.output(print(stats))
  expect_match(out[2], "Deviance Residuals:")
-  expect_true(any(grepl("AIC: 59.22", out)))
+  expect_true(any(grepl("AIC: 35.84", out)))


R glm's AIC: 35.839:

> out <- capture.output(print(rStats)) > out [1] "" [2] "Call:" [3] "glm(formula = Sepal.Width ~ Sepal.Length + Species, data = dataset)" [4] "" [5] "Deviance Residuals: " [6] " 1 2 3 4 5 6 7 8 " [7] " 0.0000 -1.4932 1.5491 0.5411 -0.8581 -1.2228 -0.5969 2.0809 " [8] "" [9] "Coefficients:" [10] " Estimate Std. Error t value Pr(>|t|)" [11] "(Intercept) 1.7150 2.0492 0.837 0.450" [12] "Sepal.Length 0.1925 0.5566 0.346 0.747" [13] "Speciesversicolor 1.7894 1.9240 0.930 0.405" [14] "Speciesvirginica 1.2613 2.0735 0.608 0.576" [15] "" [16] "(Dispersion parameter for gaussian family taken to be 2.960032)" [17] "" [18] " Null deviance: 14.719 on 7 degrees of freedom" [19] "Residual deviance: 11.840 on 4 degrees of freedom" [20] "AIC: 35.839" [21] "" [22] "Number of Fisher Scoring iterations: 2" [23] ""

viirya · 2018-01-08T01:09:40Z

R/pkg/tests/fulltests/test_mllib_regression.R

  baseSummary <- summary(baseModel)
-  expect_true(abs(baseSummary$deviance - 12.19313) < 1e-4)
+  expect_true(abs(baseSummary$deviance - 11.84013) < 1e-4)


R glm:

> baseSummary <- summary(stats::glm(Sepal.Width ~ Sepal.Length + Species, data = dataset)) > baseSummary$deviance [1] 11.84013

Spark glm:

> baseSummary <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + Species)) > baseSummary$deviance [1] 11.84013

viirya · 2018-01-08T01:11:04Z

@felixcheung Double checked in the comments. Thanks.

felixcheung · 2018-01-08T18:34:15Z

ok SGTM

felixcheung

I reviewed the R pieces, LGTM, someone else should review.

If we are changing behavior perhaps it's better to include this in 2.3.0 release

viirya · 2018-01-10T08:49:24Z

ping @jkbradley @MLnick Should we make this into 2.3.0?

WeichenXu123

Overall good. I leave some minor comments. Thanks!

WeichenXu123 · 2018-01-11T18:36:50Z

mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala

+    assert(inOutCols2._2 === Array("out1", "out2"))
+
+    intercept[IllegalArgumentException] {
+      new StringIndexer().setInputCol("in").setOutputCols(Array("out1", "out2")).getInOutCols()


It seems better that, use the way calling stringIndexer.fit and check exception thrown.

WeichenXu123 · 2018-01-11T18:39:32Z

mllib/src/main/scala/org/apache/spark/ml/param/params.scala

@@ -249,6 +249,16 @@ object ParamValidators {
  def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) =>
    value.length > lowerBound
  }
+
+  /** Check if more than one param in a set of exclusive params are set. */
+  def checkExclusiveParams(model: Params, params: String*): Unit = {


@hhbyyh Is this the common helper methods you want for checking in/out single/multiple columns ?

WeichenXu123 · 2018-01-11T18:41:11Z

mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala

@@ -331,4 +357,51 @@ class StringIndexerSuite
    val dfWithIndex = model.transform(dfNoBristol)
    assert(dfWithIndex.filter($"CITYIndexed" === 1.0).count == 1)
  }
+
+  test("StringIndexer multiple input columns") {


Is there a test case, for testing frequencyAsc/Desc, when frequency equal, it will secondary sort by alphabets ?

Good point. I added one test for this.

WeichenXu123 · 2018-01-11T21:32:49Z

@viirya
Discuss with @jkbradley offline, we're now busy fixing some issues (e.g. #20238) in ML structured streaming support, it looks bad after the code freeze, and we may not be able to merge this into 2.3 .

viirya · 2018-01-16T08:17:31Z

@WeichenXu123 Ok. Thanks. Then let's revisit this after 2.3.

SparkQA · 2018-01-16T12:17:52Z

Test build #86161 has finished for PR 20146 at commit 540c364.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-01-19T19:48:20Z

mllib/src/main/scala/org/apache/spark/ml/param/params.scala

+
+  /** Check if more than one param in a set of exclusive params are set. */
+  def checkExclusiveParams(model: Params, params: String*): Unit = {
+    if (params.filter(paramName => model.hasParam(paramName) &&


Why is this checking to see if the Param belongs to the Model? If this method is called with irrelevant Params, shouldn't it throw an error?

The purpose of this method is to check if more than one Params are set among some exclusive Params within a Model. Is it useful to put an irrelevant Param into the exclusive Params to check? As we already know what Params the model has, it sounds like we want to check an irrelevant Param that we already know non-existing?

felixcheung · 2019-01-06T19:52:28Z

FYI - I’m happy to defer to Holden and DB on this

viirya · 2019-01-07T01:06:04Z

I am curious why the need to change the doctests in the DT classification one?

I looked at it. The change is from original PR #19621. Looking at the discussion there, I didn't find related info.

I reverted this change and tested it locally. Seems the change is not necessary. I'd revert it first and see if @WeichenXu123 has more comments on this.

viirya · 2019-01-07T01:12:33Z

Thanks @felixcheung

SparkQA · 2019-01-07T04:47:58Z

Test build #100851 has finished for PR 20146 at commit 0137d67.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-07T05:13:00Z

Test build #100852 has finished for PR 20146 at commit 7a5be12.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-01-07T05:53:00Z

retest this please.

SparkQA · 2019-01-07T08:05:02Z

Test build #100863 has finished for PR 20146 at commit 7a5be12.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-01-07T08:26:59Z

retest this please.

SparkQA · 2019-01-07T13:09:41Z

Test build #100870 has finished for PR 20146 at commit 7a5be12.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-01-07T13:22:55Z

@holdenk Thanks for continuing to review this change. Last comments are all addressed now. Please take a look if you have time.

R/pkg/tests/fulltests/test_mllib_regression.R

srowen · 2019-01-07T13:50:39Z

docs/ml-guide.md

+ `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of
+ strings is undefined. Since Spark 3.0, the strings with equal frequency are further
+ sorted by alphabet. And since Spark 3.0, `StringIndexer` supports encoding multiple
+ columns. Because of this change, `StringIndexerModel`'s public constructor `def this(uid: String, labels: Array[String])`


Can we just keep the constructor? that seems like an easy way to avoid a behavior change, even if we can break things in 3.0

Because I think it is very rare to construct a StringIndexerModel manually by calling this constructor, so it is removed. Do you think we should keep it? How about keep it and make it deprecated?

shrug It seems more common to transform one column than many. You mean that the constructor is rarely used vs the setters? I agree. I could see deprecating all constructors that fit this pattern for 3.0, across other classes too, if so. It just seemed pretty easy to keep the constructor right here, esp. as it wasn't deprecated.

I will revert it.

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

srowen · 2019-01-07T14:41:11Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+  private[feature] def getSortFunc(
+      ascending: Boolean): ((String, Long), (String, Long)) => Boolean = {
+    if (ascending) {
+     (a: (String, Long), b: (String, Long)) => {


I think indent if off one space here. Can you write something like ((strA, freqA), (strB, freqB)) => here for clarity? You could also write this comparison once and then wrap it in something that flips the comparison if ascending is false, but no big deal.

I can only write it as a pattern matching anonymous function: { case (param1, param2) => ... }, because Scala compiler complains Tuples cannot be directly destructured in method or function parameters..

srowen · 2019-01-07T14:46:27Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

  extends Model[StringIndexerModel] with StringIndexerBase with MLWritable {

  import StringIndexerModel._

+  @deprecated("`this(labels: Array[String])` is deprecated and will be removed in 3.1.0. " +


I think it's still valid to index one column; I don't see a need to deprecate or remove this? it can just call the other constructor

I will revert this.

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

viirya · 2019-01-12T16:05:42Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

-    while (i < n) {
-      map.update(labels(i), i)
-      i += 1
+  def this(uid: String, labels: Array[String]) = this(uid, Array(labels))


Because labels was added in 1.5.0, I write this constructor's Since tag as 1.5.0 too.

SparkQA · 2019-01-12T20:36:58Z

Test build #101129 has finished for PR 20146 at commit b33556b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-01-14T04:08:59Z

@holdenk @srowen Thanks for your review! I think I addressed your comments now. Please take a look when you have time.

srowen

Looking pretty good, still a few comments

srowen · 2019-01-14T15:40:44Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+  private[feature] def getSortFunc(
+      ascending: Boolean): ((String, Long), (String, Long)) => Boolean = {
+    if (ascending) {
+      { case ((strA: String, freqA: Long), (strB: String, freqB: Long)) =>


NIt: the indentation is still off below. I think that you can omit the outer braces and case, as this isn't a partial function:

if (ascending) { ((strA, ...), (..., ...)) => if (...) {

If I write it as suggested, the compiler complains:

/spark/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala:283: not a legal formal parameter. [error] Note: Tuples cannot be directly destructured in method or function parameters. [error] Either create a single parameter accepting the Tuple2, [error] or consider a pattern matching anonymous function: `{ case (param1, param2) => ... } [error] ((strA: String, freqA: Long), (strB: String, freqB: Long)) => [error] ^ [error] /spark/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala:290: not a legal formal parameter. [error] Note: Tuples cannot be directly destructured in method or function parameters. [error] Either create a single parameter accepting the Tuple2, [error] or consider a pattern matching anonymous function: `{ case (param1, param2) => ... } [error] ((strA: String, freqA: Long), (strB: String, freqB: Long)) => [error] ^ [error] two errors found

Looks the outer braces can be omitted, but the case can't.

srowen · 2019-01-14T15:42:06Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+    "instead.", "3.0.0")
+  @Since("1.5.0")
+  def labels: Array[String] = {
+    require(labelsArray.length == 1, "This StringIndexerModel is fitted by multi-columns, " +


Nit: the description needs some rewording. "This StringIndexerModel is fit on multiple columns. Call labelArray insetad."

Ok. Reworded.

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

srowen · 2019-01-14T15:49:06Z

project/MimaExcludes.scala

+    // [SPARK-11215][ML] Add multiple columns support to StringIndexer
+    ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.feature.StringIndexer.validateAndTransformSchema"),
+    ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.feature.StringIndexerModel.validateAndTransformSchema"),
+    ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.feature.StringIndexerModel.this"),


Is this exclusion still needed? Looks like you kept the existing constructors.

Oh, right. Can be removed.

SparkQA · 2019-01-24T22:56:15Z

Test build #101643 has finished for PR 20146 at commit b33556b.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-01-25T05:30:45Z

Test build #101656 has finished for PR 20146 at commit 867e001.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-01-25T05:45:14Z

@srowen Thanks for continuing the review! I think I addressed your latest comments. Please take a look at this. Thanks.

srowen

Seems pretty OK to me. I'll see if anyone else has comments.

srowen · 2019-01-29T15:21:51Z

Merged to master

viirya · 2019-01-29T15:42:27Z

Thanks all reviewers!

## What changes were proposed in this pull request? This takes over apache#19621 to add multi-column support to StringIndexer: 1. Supports encoding multiple columns. 2. Previously, when specifying `frequencyDesc` or `frequencyAsc` as `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically. ## How was this patch tested? Added tests. Closes apache#20146 from viirya/SPARK-11215. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

Add multi-column support to StringIndexer.

bb990f1

viirya commented Jan 4, 2018

View reviewed changes

viirya mentioned this pull request Jan 4, 2018

[SPARK-22799][ML] Bucketizer should throw exception if single- and multi-column params are both set #19993

Closed

viirya commented Jan 4, 2018

View reviewed changes

felixcheung reviewed Jan 4, 2018

View reviewed changes

felixcheung reviewed Jan 6, 2018

View reviewed changes

Fix glm test.

26cc94b

felixcheung reviewed Jan 7, 2018

View reviewed changes

viirya commented Jan 8, 2018

View reviewed changes

felixcheung reviewed Jan 8, 2018

View reviewed changes

WeichenXu123 reviewed Jan 11, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into SPARK-11215

540c364

jkbradley reviewed Jan 19, 2018

View reviewed changes

viirya added 2 commits January 7, 2019 08:19

Unpersist if input is not originally cached. Add deprecated info.

0137d67

Revert classification doctests change.

7a5be12

srowen requested changes Jan 7, 2019

View reviewed changes

Fix style. Revert deprecated methods.

b33556b

viirya commented Jan 12, 2019

View reviewed changes

srowen requested changes Jan 14, 2019

View reviewed changes

viirya added 2 commits January 25, 2019 08:11

Merge remote-tracking branch 'upstream/master' into SPARK-11215

c3d2522

Address comments.

867e001

srowen reviewed Jan 25, 2019

View reviewed changes

srowen approved these changes Jan 26, 2019

View reviewed changes

srowen closed this in 3310789 Jan 29, 2019

viirya deleted the SPARK-11215 branch December 27, 2023 18:36

		@@ -313,7 +313,7 @@ test_that("spark.mlp", {
		# Test predict method

[SPARK-11215][ML] Add multiple columns support to StringIndexer #20146

[SPARK-11215][ML] Add multiple columns support to StringIndexer #20146

Conversation

viirya commented Jan 4, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

viirya commented Jan 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung left a comment

Choose a reason for hiding this comment

SparkQA commented Jan 4, 2018

viirya commented Jan 6, 2018

viirya commented Jan 6, 2018

felixcheung left a comment

Choose a reason for hiding this comment

viirya commented Jan 6, 2018

viirya commented Jan 6, 2018

felixcheung commented Jan 6, 2018 via email

viirya commented Jan 7, 2018

SparkQA commented Jan 7, 2018

viirya commented Jan 7, 2018

felixcheung left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Jan 8, 2018

felixcheung commented Jan 8, 2018

felixcheung left a comment

Choose a reason for hiding this comment

viirya commented Jan 10, 2018

WeichenXu123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 commented Jan 11, 2018

viirya commented Jan 16, 2018

SparkQA commented Jan 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Jan 6, 2019 via email

viirya commented Jan 7, 2019

viirya commented Jan 7, 2019

SparkQA commented Jan 7, 2019

SparkQA commented Jan 7, 2019

viirya commented Jan 7, 2019

SparkQA commented Jan 7, 2019

viirya commented Jan 7, 2019

SparkQA commented Jan 7, 2019

viirya commented Jan 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Jan 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 12, 2019

viirya commented Jan 14, 2019

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Jan 4, 2018 •

edited

Loading

felixcheung left a comment •

edited

Loading

viirya Jan 8, 2019 •

edited

Loading