Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11215][ML] Add multiple columns support to StringIndexer #20146

Closed
wants to merge 28 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Jan 4, 2018

What changes were proposed in this pull request?

This takes over #19621 to add multi-column support to StringIndexer:

  1. Supports encoding multiple columns.
  2. Previously, when specifying frequencyDesc or frequencyAsc as stringOrderType param in StringIndexer, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically.

How was this patch tested?

Added tests.

@viirya
Copy link
Member Author

viirya commented Jan 4, 2018

Based on the @WeichenXu123's original codes, I did some clean-up and refactoring.

@@ -313,7 +313,7 @@ test_that("spark.mlp", {
# Test predict method
mlpTestDF <- df
mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction"))
expect_equal(head(mlpPredictions$prediction, 6), c("1.0", "0.0", "0.0", "0.0", "0.0", "0.0"))
expect_equal(head(mlpPredictions$prediction, 6), c("0.0", "1.0", "1.0", "1.0", "1.0", "1.0"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to the change of how we sort string labels with same frequency under the setting of frequencyDesc/Asc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the predictions. All 0.0 -> 1.0, and all1.0 -> 0.0.

@@ -348,12 +348,12 @@ test_that("spark.mlp", {

# Test random seed
# default seed
model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 10)
model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 100)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems maxIter = 10 is not stable. I increased to 100 to stabilize the predictions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check if the run time increases significantly? this is an issue before - see SPARK-21693

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> start.time <- Sys.time()
> model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 10)
> end.time <- Sys.time()
> time.taken <- end.time - start.time
> time.taken
Time difference of 1.780564 secs
> start.time <- Sys.time()
> model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 100)
> end.time <- Sys.time()
> time.taken <- end.time - start.time
> time.taken
Time difference of 5.728089 secs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, @viirya, would you mind if I ask to check it after setting spark.sparkr.use.daemon to false too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran it again with the config:

> sparkR.conf("spark.sparkr.use.daemon")                                                                         
$spark.sparkr.use.daemon
[1] "false"
> start.time <- Sys.time()
> model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 10)
> end.time <- Sys.time()
> time.taken <- end.time - start.time
> time.taken
Time difference of 1.704288 secs
> start.time <- Sys.time()
> model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 100)
> end.time <- Sys.time()
> time.taken <- end.time - start.time
> time.taken
Time difference of 5.135418 secs

@@ -313,7 +313,7 @@ test_that("spark.mlp", {
# Test predict method
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we may remove this test Test predict method. Seems to me, with the tol = 0.5, the prediction may not be very meaningful.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed before in #19621 (comment)

it's failing now as

[00:54:27] Failed -------------------------------------------------------------------------
[00:54:27] 1. Failure: spark.glm summary (@test_mllib_regression.R#116) -------------------
[00:54:27] all(abs(rCoefs - coefs) < 1e-04) isn't true.
[00:54:27] 
[00:54:27] 
[00:54:27] 2. Failure: spark.glm summary (@test_mllib_regression.R#117) -------------------
[00:54:27] all(...) isn't true.
[00:54:27] 
[00:54:27] 
[00:54:27] 3. Failure: spark.glm summary (@test_mllib_regression.R#182) -------------------
[00:54:27] all(abs(rStats$coefficients - stats$coefficients) < 0.001) isn't true.
[00:54:27] 
[00:54:27] 
[00:54:27] 4. Failure: glm summary (@test_mllib_regression.R#311) -------------------------
[00:54:27] all(abs(rCoefs - coefs) < 1e-04) isn't true.
[00:54:27] 
[00:54:27] 
[00:54:27] 5. Failure: glm summary (@test_mllib_regression.R#312) -------------------------
[00:54:27] all(...) isn't true.
[00:54:27] 

@SparkQA
Copy link

SparkQA commented Jan 4, 2018

Test build #85656 has finished for PR 20146 at commit bb990f1.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jan 6, 2018

Seems to me we can't set string indexer order for R glm.

A workaround is to encode the Species manually first. Then let R glm and spark.glm to fit the encoded Species column, instead of the original Species.

I check the coefficients produced by this way. The coefficients from R glm and spark.glm are the same. But as Species is converted to numeric, the coefficients don't include Species_versicolor and Species_virginica now.

> encodedSpecies <- factor(iris$Species, levels=c("setosa", "versicolor", "virginica"))
> iris$encodedSpecies <- as.numeric(encodedSpecies)
> training <- suppressWarnings(createDataFrame(iris))
> stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + encodedSpecies))                             
> rStats <- summary(glm(Sepal.Width ~ Sepal.Length + encodedSpecies, data = iris))
> stats$coefficients
                 Estimate Std. Error   t value     Pr(>|t|)
(Intercept)     2.2595167  0.2604476  8.675514 7.105427e-15
Sepal_Length    0.2937617  0.0582277  5.045051 1.318724e-06
encodedSpecies -0.4593655  0.0588556 -7.804958 1.023848e-12
> rStats$coefficients
                 Estimate Std. Error   t value     Pr(>|t|)
(Intercept)     2.2595167  0.2604476  8.675514 7.079930e-15
Sepal.Length    0.2937617  0.0582277  5.045051 1.318724e-06
encodedSpecies -0.4593655  0.0588556 -7.804958 1.023755e-12
> coefs <- stats$coefficients
> rCoefs <- rStats$coefficients
> all(abs(rCoefs - coefs) < 1e-4)
[1] TRUE

@viirya
Copy link
Member Author

viirya commented Jan 6, 2018

Another workaround is, we can add some rows into iris dataset and make the three values in Species column not frequency equal anymore.

For example, we add three more rows into iris. Now the frequency of versicolor is 52, virginica is 51, setosa is 50. This can make RFormula index them as: "setosa"->2, "versicolor"->0, "virginica"->1 to meet the encoding of R glm.

> iris
...
151          5.9         3.0          5.1         1.8  virginica              3
152          5.7         2.8          4.1         1.3 versicolor              2
153          5.7         2.8          4.1         1.3 versicolor              2
> training <- suppressWarnings(createDataFrame(iris))                                                          
> stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + Species))                                  
> rStats <- summary(glm(Sepal.Width ~ Sepal.Length + Species, data = iris))                                    
> stats$coefficients                                                                                             
                     Estimate Std. Error    t value     Pr(>|t|)
(Intercept)         1.7057547 0.23221834   7.345478 1.246270e-11
Sepal_Length        0.3440362 0.04567319   7.532564 4.439116e-12
Species_versicolor -0.9736771 0.07073874 -13.764410 0.000000e+00
Species_virginica  -0.9931144 0.09164069 -10.837046 0.000000e+00
> rStats$coefficients                                                                                            
                    Estimate Std. Error    t value     Pr(>|t|)
(Intercept)        1.7057547 0.23221834   7.345478 1.246279e-11
Sepal.Length       0.3440362 0.04567319   7.532564 4.439021e-12
Speciesversicolor -0.9736771 0.07073874 -13.764410 2.469196e-28
Speciesvirginica  -0.9931144 0.09164069 -10.837046 1.527430e-20
> coefs <- stats$coefficients
> rCoefs <- rStats$coefficients
> all(abs(rCoefs - coefs) < 1e-4)
[1] TRUE

@felixcheung @WeichenXu123 What do you think?

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly this #20146 (comment) but why not consider what was suggested earlier - use a dataset that doesn't have a duplicated values?

@viirya
Copy link
Member Author

viirya commented Jan 6, 2018

Ah, I didn't see that suggestion, sounds good to me to use a dataset without duplicated values. I will look up for a proper dataset. Or you have a suggested one already?

@viirya
Copy link
Member Author

viirya commented Jan 6, 2018

Hmm, I reconsider this #20146 (review). Even we use a dataset without duplicate values, if the string indexer order from R glm is different than the index used by RFormula, we still can't get the same results because looks like R glm doesn't follow frequency/alphabet.

For example, I've tried the dataset Puromycin:

> training <- suppressWarnings(createDataFrame(Puromycin))                                                       
> stats <- summary(spark.glm(training, conc ~ rate + state))
> rStats <- summary(glm(conc ~ rate + state, data = Puromycin))
> rStats$coefficients
                   Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)    -0.595461828 0.157462177 -3.781618 1.171709e-03
rate            0.006642461 0.001022196  6.498228 2.464757e-06
stateuntreated  0.136323828 0.095090605  1.433620 1.671302e-01
> stats$coefficients
                  Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)   -0.459138000 0.130420375 -3.520447 2.150817e-03
rate           0.006642461 0.001022196  6.498228 2.464757e-06
state_treated -0.136323828 0.095090605 -1.433620 1.671302e-01

You can see because the string index of state column is still different between R glm and RFormula, we can't get the same results.

A workaround to this is that we can use a dataset which doesn't need string indexing. What do you think? @felixcheung

@felixcheung
Copy link
Member

felixcheung commented Jan 6, 2018 via email

@viirya
Copy link
Member Author

viirya commented Jan 7, 2018

Ok. I use a handcrafted tiny dataset to replace iris in the failed test.

@SparkQA
Copy link

SparkQA commented Jan 7, 2018

Test build #85765 has finished for PR 20146 at commit 26cc94b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jan 7, 2018

also cc @jkbradley @MLnick for review.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you double check the changed values, that they match the result from R's glm?
(looks like there are 2)

@@ -126,15 +134,15 @@ test_that("spark.glm summary", {

out <- capture.output(print(stats))
expect_match(out[2], "Deviance Residuals:")
expect_true(any(grepl("AIC: 59.22", out)))
expect_true(any(grepl("AIC: 35.84", out)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R glm's AIC: 35.839:

> out <- capture.output(print(rStats))
> out
 [1] ""                                                                        
 [2] "Call:"                                                                   
 [3] "glm(formula = Sepal.Width ~ Sepal.Length + Species, data = dataset)"     
 [4] ""                                                                        
 [5] "Deviance Residuals: "                                                    
 [6] "      1        2        3        4        5        6        7        8  "
 [7] " 0.0000  -1.4932   1.5491   0.5411  -0.8581  -1.2228  -0.5969   2.0809  "
 [8] ""                                                                        
 [9] "Coefficients:"                                                           
[10] "                  Estimate Std. Error t value Pr(>|t|)"                  
[11] "(Intercept)         1.7150     2.0492   0.837    0.450"                  
[12] "Sepal.Length        0.1925     0.5566   0.346    0.747"                  
[13] "Speciesversicolor   1.7894     1.9240   0.930    0.405"                  
[14] "Speciesvirginica    1.2613     2.0735   0.608    0.576"                  
[15] ""                                                                        
[16] "(Dispersion parameter for gaussian family taken to be 2.960032)"         
[17] ""                                                                        
[18] "    Null deviance: 14.719  on 7  degrees of freedom"                     
[19] "Residual deviance: 11.840  on 4  degrees of freedom"                     
[20] "AIC: 35.839"                                                             
[21] ""                                                                        
[22] "Number of Fisher Scoring iterations: 2"                                  
[23] ""                                                                        

baseSummary <- summary(baseModel)
expect_true(abs(baseSummary$deviance - 12.19313) < 1e-4)
expect_true(abs(baseSummary$deviance - 11.84013) < 1e-4)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R glm:

> baseSummary <- summary(stats::glm(Sepal.Width ~ Sepal.Length + Species, data = dataset))
> baseSummary$deviance
[1] 11.84013

Spark glm:

> baseSummary <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + Species))
> baseSummary$deviance
[1] 11.84013

@viirya
Copy link
Member Author

viirya commented Jan 8, 2018

@felixcheung Double checked in the comments. Thanks.

@felixcheung
Copy link
Member

ok SGTM

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the R pieces, LGTM, someone else should review.

If we are changing behavior perhaps it's better to include this in 2.3.0 release

@viirya
Copy link
Member Author

viirya commented Jan 10, 2018

ping @jkbradley @MLnick Should we make this into 2.3.0?

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall good. I leave some minor comments. Thanks!

assert(inOutCols2._2 === Array("out1", "out2"))

intercept[IllegalArgumentException] {
new StringIndexer().setInputCol("in").setOutputCols(Array("out1", "out2")).getInOutCols()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems better that, use the way calling stringIndexer.fit and check exception thrown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@@ -249,6 +249,16 @@ object ParamValidators {
def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) =>
value.length > lowerBound
}

/** Check if more than one param in a set of exclusive params are set. */
def checkExclusiveParams(model: Params, params: String*): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hhbyyh Is this the common helper methods you want for checking in/out single/multiple columns ?

@@ -331,4 +357,51 @@ class StringIndexerSuite
val dfWithIndex = model.transform(dfNoBristol)
assert(dfWithIndex.filter($"CITYIndexed" === 1.0).count == 1)
}

test("StringIndexer multiple input columns") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test case, for testing frequencyAsc/Desc, when frequency equal, it will secondary sort by alphabets ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I added one test for this.

@WeichenXu123
Copy link
Contributor

@viirya
Discuss with @jkbradley offline, we're now busy fixing some issues (e.g. #20238) in ML structured streaming support, it looks bad after the code freeze, and we may not be able to merge this into 2.3 .

@viirya
Copy link
Member Author

viirya commented Jan 16, 2018

@WeichenXu123 Ok. Thanks. Then let's revisit this after 2.3.

@SparkQA
Copy link

SparkQA commented Jan 16, 2018

Test build #86161 has finished for PR 20146 at commit 540c364.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/** Check if more than one param in a set of exclusive params are set. */
def checkExclusiveParams(model: Params, params: String*): Unit = {
if (params.filter(paramName => model.hasParam(paramName) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this checking to see if the Param belongs to the Model? If this method is called with irrelevant Params, shouldn't it throw an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this method is to check if more than one Params are set among some exclusive Params within a Model. Is it useful to put an irrelevant Param into the exclusive Params to check? As we already know what Params the model has, it sounds like we want to check an irrelevant Param that we already know non-existing?

@felixcheung
Copy link
Member

felixcheung commented Jan 6, 2019 via email

@viirya
Copy link
Member Author

viirya commented Jan 7, 2019

I am curious why the need to change the doctests in the DT classification one?

I looked at it. The change is from original PR #19621. Looking at the discussion there, I didn't find related info.

I reverted this change and tested it locally. Seems the change is not necessary. I'd revert it first and see if @WeichenXu123 has more comments on this.

@viirya
Copy link
Member Author

viirya commented Jan 7, 2019

Thanks @felixcheung

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100851 has finished for PR 20146 at commit 0137d67.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100852 has finished for PR 20146 at commit 7a5be12.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jan 7, 2019

retest this please.

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100863 has finished for PR 20146 at commit 7a5be12.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jan 7, 2019

retest this please.

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100870 has finished for PR 20146 at commit 7a5be12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jan 7, 2019

@holdenk Thanks for continuing to review this change. Last comments are all addressed now. Please take a look if you have time.

docs/ml-guide.md Outdated
`stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of
strings is undefined. Since Spark 3.0, the strings with equal frequency are further
sorted by alphabet. And since Spark 3.0, `StringIndexer` supports encoding multiple
columns. Because of this change, `StringIndexerModel`'s public constructor `def this(uid: String, labels: Array[String])`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just keep the constructor? that seems like an easy way to avoid a behavior change, even if we can break things in 3.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I think it is very rare to construct a StringIndexerModel manually by calling this constructor, so it is removed. Do you think we should keep it? How about keep it and make it deprecated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shrug It seems more common to transform one column than many. You mean that the constructor is rarely used vs the setters? I agree. I could see deprecating all constructors that fit this pattern for 3.0, across other classes too, if so. It just seemed pretty easy to keep the constructor right here, esp. as it wasn't deprecated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert it.

private[feature] def getSortFunc(
ascending: Boolean): ((String, Long), (String, Long)) => Boolean = {
if (ascending) {
(a: (String, Long), b: (String, Long)) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think indent if off one space here. Can you write something like ((strA, freqA), (strB, freqB)) => here for clarity? You could also write this comparison once and then wrap it in something that flips the comparison if ascending is false, but no big deal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only write it as a pattern matching anonymous function: { case (param1, param2) => ... }, because Scala compiler complains Tuples cannot be directly destructured in method or function parameters..

extends Model[StringIndexerModel] with StringIndexerBase with MLWritable {

import StringIndexerModel._

@deprecated("`this(labels: Array[String])` is deprecated and will be removed in 3.1.0. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still valid to index one column; I don't see a need to deprecate or remove this? it can just call the other constructor

Copy link
Member Author

@viirya viirya Jan 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert this.

while (i < n) {
map.update(labels(i), i)
i += 1
def this(uid: String, labels: Array[String]) = this(uid, Array(labels))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because labels was added in 1.5.0, I write this constructor's Since tag as 1.5.0 too.

@SparkQA
Copy link

SparkQA commented Jan 12, 2019

Test build #101129 has finished for PR 20146 at commit b33556b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jan 14, 2019

@holdenk @srowen Thanks for your review! I think I addressed your comments now. Please take a look when you have time.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good, still a few comments

private[feature] def getSortFunc(
ascending: Boolean): ((String, Long), (String, Long)) => Boolean = {
if (ascending) {
{ case ((strA: String, freqA: Long), (strB: String, freqB: Long)) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIt: the indentation is still off below. I think that you can omit the outer braces and case, as this isn't a partial function:

if (ascending) {
  ((strA, ...), (..., ...)) =>
    if (...) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I write it as suggested, the compiler complains:

/spark/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala:283: not a legal formal parameter.
[error] Note: Tuples cannot be directly destructured in method or function parameters.
[error]       Either create a single parameter accepting the Tuple2,                                                                                  
[error]       or consider a pattern matching anonymous function: `{ case (param1, param2) => ... }                                                    
[error]       ((strA: String, freqA: Long), (strB: String, freqB: Long)) =>                                                                           
[error]        ^              
[error] /spark/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala:290: not a legal formal parameter.
[error] Note: Tuples cannot be directly destructured in method or function parameters.                                                                
[error]       Either create a single parameter accepting the Tuple2,                                                                                  
[error]       or consider a pattern matching anonymous function: `{ case (param1, param2) => ... }
[error]       ((strA: String, freqA: Long), (strB: String, freqB: Long)) =>                                                                           
[error]        ^                                                                                                                                      
[error] two errors found                

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks the outer braces can be omitted, but the case can't.

"instead.", "3.0.0")
@Since("1.5.0")
def labels: Array[String] = {
require(labelsArray.length == 1, "This StringIndexerModel is fitted by multi-columns, " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the description needs some rewording. "This StringIndexerModel is fit on multiple columns. Call labelArray insetad."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Reworded.

// [SPARK-11215][ML] Add multiple columns support to StringIndexer
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.feature.StringIndexer.validateAndTransformSchema"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.feature.StringIndexerModel.validateAndTransformSchema"),
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.feature.StringIndexerModel.this"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this exclusion still needed? Looks like you kept the existing constructors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. Can be removed.

@SparkQA
Copy link

SparkQA commented Jan 24, 2019

Test build #101643 has finished for PR 20146 at commit b33556b.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2019

Test build #101656 has finished for PR 20146 at commit 867e001.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jan 25, 2019

@srowen Thanks for continuing the review! I think I addressed your latest comments. Please take a look at this. Thanks.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems pretty OK to me. I'll see if anyone else has comments.

@srowen
Copy link
Member

srowen commented Jan 29, 2019

Merged to master

@srowen srowen closed this in 3310789 Jan 29, 2019
@viirya
Copy link
Member Author

viirya commented Jan 29, 2019

Thanks all reviewers!

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

This takes over apache#19621 to add multi-column support to StringIndexer:

1. Supports encoding multiple columns.
2. Previously, when specifying `frequencyDesc` or `frequencyAsc` as `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically.

## How was this patch tested?

Added tests.

Closes apache#20146 from viirya/SPARK-11215.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
@viirya viirya deleted the SPARK-11215 branch December 27, 2023 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants