-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11215][ML] Add multiple columns support to StringIndexer #20146
Conversation
Based on the @WeichenXu123's original codes, I did some clean-up and refactoring. |
@@ -313,7 +313,7 @@ test_that("spark.mlp", { | |||
# Test predict method | |||
mlpTestDF <- df | |||
mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction")) | |||
expect_equal(head(mlpPredictions$prediction, 6), c("1.0", "0.0", "0.0", "0.0", "0.0", "0.0")) | |||
expect_equal(head(mlpPredictions$prediction, 6), c("0.0", "1.0", "1.0", "1.0", "1.0", "1.0")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to the change of how we sort string labels with same frequency under the setting of frequencyDesc/Asc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the predictions. All 0.0
-> 1.0
, and all1.0
-> 0.0
.
@@ -348,12 +348,12 @@ test_that("spark.mlp", { | |||
|
|||
# Test random seed | |||
# default seed | |||
model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 10) | |||
model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems maxIter = 10
is not stable. I increased to 100 to stabilize the predictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check if the run time increases significantly? this is an issue before - see SPARK-21693
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> start.time <- Sys.time()
> model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 10)
> end.time <- Sys.time()
> time.taken <- end.time - start.time
> time.taken
Time difference of 1.780564 secs
> start.time <- Sys.time()
> model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 100)
> end.time <- Sys.time()
> time.taken <- end.time - start.time
> time.taken
Time difference of 5.728089 secs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, @viirya, would you mind if I ask to check it after setting spark.sparkr.use.daemon
to false
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran it again with the config:
> sparkR.conf("spark.sparkr.use.daemon")
$spark.sparkr.use.daemon
[1] "false"
> start.time <- Sys.time()
> model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 10)
> end.time <- Sys.time()
> time.taken <- end.time - start.time
> time.taken
Time difference of 1.704288 secs
> start.time <- Sys.time()
> model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3), maxIter = 100)
> end.time <- Sys.time()
> time.taken <- end.time - start.time
> time.taken
Time difference of 5.135418 secs
@@ -313,7 +313,7 @@ test_that("spark.mlp", { | |||
# Test predict method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think we may remove this test Test predict method
. Seems to me, with the tol = 0.5
, the prediction may not be very meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed before in #19621 (comment)
it's failing now as
[00:54:27] Failed -------------------------------------------------------------------------
[00:54:27] 1. Failure: spark.glm summary (@test_mllib_regression.R#116) -------------------
[00:54:27] all(abs(rCoefs - coefs) < 1e-04) isn't true.
[00:54:27]
[00:54:27]
[00:54:27] 2. Failure: spark.glm summary (@test_mllib_regression.R#117) -------------------
[00:54:27] all(...) isn't true.
[00:54:27]
[00:54:27]
[00:54:27] 3. Failure: spark.glm summary (@test_mllib_regression.R#182) -------------------
[00:54:27] all(abs(rStats$coefficients - stats$coefficients) < 0.001) isn't true.
[00:54:27]
[00:54:27]
[00:54:27] 4. Failure: glm summary (@test_mllib_regression.R#311) -------------------------
[00:54:27] all(abs(rCoefs - coefs) < 1e-04) isn't true.
[00:54:27]
[00:54:27]
[00:54:27] 5. Failure: glm summary (@test_mllib_regression.R#312) -------------------------
[00:54:27] all(...) isn't true.
[00:54:27]
Test build #85656 has finished for PR 20146 at commit
|
Seems to me we can't set string indexer order for R glm. A workaround is to encode the Species manually first. Then let R glm and spark.glm to fit the encoded Species column, instead of the original Species. I check the coefficients produced by this way. The coefficients from R glm and spark.glm are the same. But as Species is converted to numeric, the coefficients don't include Species_versicolor and Species_virginica now. > encodedSpecies <- factor(iris$Species, levels=c("setosa", "versicolor", "virginica"))
> iris$encodedSpecies <- as.numeric(encodedSpecies)
> training <- suppressWarnings(createDataFrame(iris))
> stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + encodedSpecies))
> rStats <- summary(glm(Sepal.Width ~ Sepal.Length + encodedSpecies, data = iris))
> stats$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2595167 0.2604476 8.675514 7.105427e-15
Sepal_Length 0.2937617 0.0582277 5.045051 1.318724e-06
encodedSpecies -0.4593655 0.0588556 -7.804958 1.023848e-12
> rStats$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2595167 0.2604476 8.675514 7.079930e-15
Sepal.Length 0.2937617 0.0582277 5.045051 1.318724e-06
encodedSpecies -0.4593655 0.0588556 -7.804958 1.023755e-12
> coefs <- stats$coefficients
> rCoefs <- rStats$coefficients
> all(abs(rCoefs - coefs) < 1e-4)
[1] TRUE |
Another workaround is, we can add some rows into iris dataset and make the three values in Species column not frequency equal anymore. For example, we add three more rows into iris. Now the frequency of > iris
...
151 5.9 3.0 5.1 1.8 virginica 3
152 5.7 2.8 4.1 1.3 versicolor 2
153 5.7 2.8 4.1 1.3 versicolor 2
> training <- suppressWarnings(createDataFrame(iris))
> stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + Species))
> rStats <- summary(glm(Sepal.Width ~ Sepal.Length + Species, data = iris))
> stats$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7057547 0.23221834 7.345478 1.246270e-11
Sepal_Length 0.3440362 0.04567319 7.532564 4.439116e-12
Species_versicolor -0.9736771 0.07073874 -13.764410 0.000000e+00
Species_virginica -0.9931144 0.09164069 -10.837046 0.000000e+00
> rStats$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7057547 0.23221834 7.345478 1.246279e-11
Sepal.Length 0.3440362 0.04567319 7.532564 4.439021e-12
Speciesversicolor -0.9736771 0.07073874 -13.764410 2.469196e-28
Speciesvirginica -0.9931144 0.09164069 -10.837046 1.527430e-20
> coefs <- stats$coefficients
> rCoefs <- rStats$coefficients
> all(abs(rCoefs - coefs) < 1e-4)
[1] TRUE @felixcheung @WeichenXu123 What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly this #20146 (comment) but why not consider what was suggested earlier - use a dataset that doesn't have a duplicated values?
Ah, I didn't see that suggestion, sounds good to me to use a dataset without duplicated values. I will look up for a proper dataset. Or you have a suggested one already? |
Hmm, I reconsider this #20146 (review). Even we use a dataset without duplicate values, if the string indexer order from R glm is different than the index used by RFormula, we still can't get the same results because looks like R glm doesn't follow frequency/alphabet. For example, I've tried the dataset Puromycin: > training <- suppressWarnings(createDataFrame(Puromycin))
> stats <- summary(spark.glm(training, conc ~ rate + state))
> rStats <- summary(glm(conc ~ rate + state, data = Puromycin))
> rStats$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.595461828 0.157462177 -3.781618 1.171709e-03
rate 0.006642461 0.001022196 6.498228 2.464757e-06
stateuntreated 0.136323828 0.095090605 1.433620 1.671302e-01
> stats$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.459138000 0.130420375 -3.520447 2.150817e-03
rate 0.006642461 0.001022196 6.498228 2.464757e-06
state_treated -0.136323828 0.095090605 -1.433620 1.671302e-01 You can see because the string index of state column is still different between R glm and RFormula, we can't get the same results. A workaround to this is that we can use a dataset which doesn't need string indexing. What do you think? @felixcheung |
I think all dataset with a string order get indexed, as far as I recall?
Pick existing R dataset is just a convenience, we can also make up a few lines of data if that works out better.
Although as a separate note the difference in sort order is potentially something we should document, esp if it goes beyond glm, for example in sql functions too
|
Ok. I use a handcrafted tiny dataset to replace iris in the failed test. |
Test build #85765 has finished for PR 20146 at commit
|
also cc @jkbradley @MLnick for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you double check the changed values, that they match the result from R's glm?
(looks like there are 2)
@@ -126,15 +134,15 @@ test_that("spark.glm summary", { | |||
|
|||
out <- capture.output(print(stats)) | |||
expect_match(out[2], "Deviance Residuals:") | |||
expect_true(any(grepl("AIC: 59.22", out))) | |||
expect_true(any(grepl("AIC: 35.84", out))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R glm's AIC: 35.839:
> out <- capture.output(print(rStats))
> out
[1] ""
[2] "Call:"
[3] "glm(formula = Sepal.Width ~ Sepal.Length + Species, data = dataset)"
[4] ""
[5] "Deviance Residuals: "
[6] " 1 2 3 4 5 6 7 8 "
[7] " 0.0000 -1.4932 1.5491 0.5411 -0.8581 -1.2228 -0.5969 2.0809 "
[8] ""
[9] "Coefficients:"
[10] " Estimate Std. Error t value Pr(>|t|)"
[11] "(Intercept) 1.7150 2.0492 0.837 0.450"
[12] "Sepal.Length 0.1925 0.5566 0.346 0.747"
[13] "Speciesversicolor 1.7894 1.9240 0.930 0.405"
[14] "Speciesvirginica 1.2613 2.0735 0.608 0.576"
[15] ""
[16] "(Dispersion parameter for gaussian family taken to be 2.960032)"
[17] ""
[18] " Null deviance: 14.719 on 7 degrees of freedom"
[19] "Residual deviance: 11.840 on 4 degrees of freedom"
[20] "AIC: 35.839"
[21] ""
[22] "Number of Fisher Scoring iterations: 2"
[23] ""
baseSummary <- summary(baseModel) | ||
expect_true(abs(baseSummary$deviance - 12.19313) < 1e-4) | ||
expect_true(abs(baseSummary$deviance - 11.84013) < 1e-4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R glm:
> baseSummary <- summary(stats::glm(Sepal.Width ~ Sepal.Length + Species, data = dataset))
> baseSummary$deviance
[1] 11.84013
Spark glm:
> baseSummary <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + Species))
> baseSummary$deviance
[1] 11.84013
@felixcheung Double checked in the comments. Thanks. |
ok SGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the R pieces, LGTM, someone else should review.
If we are changing behavior perhaps it's better to include this in 2.3.0 release
ping @jkbradley @MLnick Should we make this into 2.3.0? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall good. I leave some minor comments. Thanks!
assert(inOutCols2._2 === Array("out1", "out2")) | ||
|
||
intercept[IllegalArgumentException] { | ||
new StringIndexer().setInputCol("in").setOutputCols(Array("out1", "out2")).getInOutCols() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems better that, use the way calling stringIndexer.fit
and check exception thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
@@ -249,6 +249,16 @@ object ParamValidators { | |||
def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => | |||
value.length > lowerBound | |||
} | |||
|
|||
/** Check if more than one param in a set of exclusive params are set. */ | |||
def checkExclusiveParams(model: Params, params: String*): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hhbyyh Is this the common helper methods you want for checking in/out single/multiple columns ?
@@ -331,4 +357,51 @@ class StringIndexerSuite | |||
val dfWithIndex = model.transform(dfNoBristol) | |||
assert(dfWithIndex.filter($"CITYIndexed" === 1.0).count == 1) | |||
} | |||
|
|||
test("StringIndexer multiple input columns") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test case, for testing frequencyAsc/Desc
, when frequency equal, it will secondary sort by alphabets ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I added one test for this.
@viirya |
@WeichenXu123 Ok. Thanks. Then let's revisit this after 2.3. |
Test build #86161 has finished for PR 20146 at commit
|
|
||
/** Check if more than one param in a set of exclusive params are set. */ | ||
def checkExclusiveParams(model: Params, params: String*): Unit = { | ||
if (params.filter(paramName => model.hasParam(paramName) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this checking to see if the Param belongs to the Model? If this method is called with irrelevant Params, shouldn't it throw an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this method is to check if more than one Params are set among some exclusive Params within a Model. Is it useful to put an irrelevant Param into the exclusive Params to check? As we already know what Params the model has, it sounds like we want to check an irrelevant Param that we already know non-existing?
FYI - I’m happy to defer to Holden and DB on this
|
I looked at it. The change is from original PR #19621. Looking at the discussion there, I didn't find related info. I reverted this change and tested it locally. Seems the change is not necessary. I'd revert it first and see if @WeichenXu123 has more comments on this. |
Thanks @felixcheung |
Test build #100851 has finished for PR 20146 at commit
|
Test build #100852 has finished for PR 20146 at commit
|
retest this please. |
Test build #100863 has finished for PR 20146 at commit
|
retest this please. |
Test build #100870 has finished for PR 20146 at commit
|
@holdenk Thanks for continuing to review this change. Last comments are all addressed now. Please take a look if you have time. |
docs/ml-guide.md
Outdated
`stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of | ||
strings is undefined. Since Spark 3.0, the strings with equal frequency are further | ||
sorted by alphabet. And since Spark 3.0, `StringIndexer` supports encoding multiple | ||
columns. Because of this change, `StringIndexerModel`'s public constructor `def this(uid: String, labels: Array[String])` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just keep the constructor? that seems like an easy way to avoid a behavior change, even if we can break things in 3.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I think it is very rare to construct a StringIndexerModel
manually by calling this constructor, so it is removed. Do you think we should keep it? How about keep it and make it deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shrug It seems more common to transform one column than many. You mean that the constructor is rarely used vs the setters? I agree. I could see deprecating all constructors that fit this pattern for 3.0, across other classes too, if so. It just seemed pretty easy to keep the constructor right here, esp. as it wasn't deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will revert it.
private[feature] def getSortFunc( | ||
ascending: Boolean): ((String, Long), (String, Long)) => Boolean = { | ||
if (ascending) { | ||
(a: (String, Long), b: (String, Long)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think indent if off one space here. Can you write something like ((strA, freqA), (strB, freqB)) =>
here for clarity? You could also write this comparison once and then wrap it in something that flips the comparison if ascending
is false
, but no big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can only write it as a pattern matching anonymous function: { case (param1, param2) => ... }
, because Scala compiler complains Tuples cannot be directly destructured in method or function parameters.
.
extends Model[StringIndexerModel] with StringIndexerBase with MLWritable { | ||
|
||
import StringIndexerModel._ | ||
|
||
@deprecated("`this(labels: Array[String])` is deprecated and will be removed in 3.1.0. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's still valid to index one column; I don't see a need to deprecate or remove this? it can just call the other constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will revert this.
while (i < n) { | ||
map.update(labels(i), i) | ||
i += 1 | ||
def this(uid: String, labels: Array[String]) = this(uid, Array(labels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because labels
was added in 1.5.0, I write this constructor's Since
tag as 1.5.0 too.
Test build #101129 has finished for PR 20146 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good, still a few comments
private[feature] def getSortFunc( | ||
ascending: Boolean): ((String, Long), (String, Long)) => Boolean = { | ||
if (ascending) { | ||
{ case ((strA: String, freqA: Long), (strB: String, freqB: Long)) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIt: the indentation is still off below. I think that you can omit the outer braces and case, as this isn't a partial function:
if (ascending) {
((strA, ...), (..., ...)) =>
if (...) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I write it as suggested, the compiler complains:
/spark/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala:283: not a legal formal parameter.
[error] Note: Tuples cannot be directly destructured in method or function parameters.
[error] Either create a single parameter accepting the Tuple2,
[error] or consider a pattern matching anonymous function: `{ case (param1, param2) => ... }
[error] ((strA: String, freqA: Long), (strB: String, freqB: Long)) =>
[error] ^
[error] /spark/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala:290: not a legal formal parameter.
[error] Note: Tuples cannot be directly destructured in method or function parameters.
[error] Either create a single parameter accepting the Tuple2,
[error] or consider a pattern matching anonymous function: `{ case (param1, param2) => ... }
[error] ((strA: String, freqA: Long), (strB: String, freqB: Long)) =>
[error] ^
[error] two errors found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks the outer braces can be omitted, but the case
can't.
"instead.", "3.0.0") | ||
@Since("1.5.0") | ||
def labels: Array[String] = { | ||
require(labelsArray.length == 1, "This StringIndexerModel is fitted by multi-columns, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the description needs some rewording. "This StringIndexerModel is fit on multiple columns. Call labelArray
insetad."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Reworded.
project/MimaExcludes.scala
Outdated
// [SPARK-11215][ML] Add multiple columns support to StringIndexer | ||
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.feature.StringIndexer.validateAndTransformSchema"), | ||
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.feature.StringIndexerModel.validateAndTransformSchema"), | ||
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.feature.StringIndexerModel.this"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this exclusion still needed? Looks like you kept the existing constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. Can be removed.
Test build #101643 has finished for PR 20146 at commit
|
Test build #101656 has finished for PR 20146 at commit
|
@srowen Thanks for continuing the review! I think I addressed your latest comments. Please take a look at this. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems pretty OK to me. I'll see if anyone else has comments.
Merged to master |
Thanks all reviewers! |
## What changes were proposed in this pull request? This takes over apache#19621 to add multi-column support to StringIndexer: 1. Supports encoding multiple columns. 2. Previously, when specifying `frequencyDesc` or `frequencyAsc` as `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically. ## How was this patch tested? Added tests. Closes apache#20146 from viirya/SPARK-11215. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
This takes over #19621 to add multi-column support to StringIndexer:
frequencyDesc
orfrequencyAsc
asstringOrderType
param inStringIndexer
, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically.How was this patch tested?
Added tests.