[SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc #13894

krishnakalyan3 · 2016-06-24T14:14:46Z

What changes were proposed in this pull request?

Updated ML pipeline Cross Validation Scaladoc & PyDoc.

How was this patch tested?

Documentation update

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

krishnakalyan3 · 2016-06-26T12:31:59Z

cc @holdenk

holdenk · 2016-06-28T22:42:24Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

@@ -191,7 +194,8 @@ object CrossValidator extends MLReadable[CrossValidator] {

 /**
 * :: Experimental ::
- * Model from k-fold cross validation.
+ * Pipelines facilitate model selection by making it easy to tune an entire 


I'm not sure if this is the best place to mention that - it seems like it doesn't belong in the CrossValidatorModel scaladoc.

krishnakalyan3 · 2016-06-30T23:41:27Z

Updated the doc based on the reviews. Thanks for the review comments @holdenk and @MLnick.

MLnick · 2016-07-01T10:36:53Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

+ * CrossValidator begins by splitting the dataset into a set of non-overlapping randomly
+ * partitioned folds which are used as separate training and test datasets e.g., with k=3 folds,
+ * CrossValidator will generate 3 (training, test) dataset pairs, each of which uses 2/3 of
+ * the data for training and 1/3 for testing. Each fold is used in the testing set exactly once.


"used in the testing set" -> "used as the test set"

krishnakalyan3 · 2016-07-02T12:40:44Z

@holdenk @MLnick is the current update okay?

MLnick · 2016-07-04T13:23:06Z

ok to test

SparkQA · 2016-07-04T14:13:20Z

Test build #61726 has finished for PR 13894 at commit 7a3a4fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-07-04T14:18:50Z

LGTM. @holdenk ?

MLnick · 2016-07-05T08:32:21Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

@@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams {

 /**
 * :: Experimental ::
- * K-fold cross validation.
+ * CrossValidator begins by splitting the dataset into a set of non-overlapping randomly


sorry about this, but looking at it again I think we can clean up the description a bit more:

CrossValidator performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds, which are used as separate training and test datasets. For example, with k=3 folds, ...

+1 - is good to mention why its splitting the dataset/what its for.

I like the improved description, but can we please keep the phrase "k-fold cross validation?" It's a very common phrase and will be useful for people using keyword search. Thanks!

SparkQA · 2016-07-05T14:47:19Z

Test build #61754 has finished for PR 13894 at commit ffc9576.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-07-06T20:51:15Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

@@ -191,7 +194,8 @@ object CrossValidator extends MLReadable[CrossValidator] {

 /**
 * :: Experimental ::
- * Model from k-fold cross validation.
+ * CrossValidatorModel contains the model that achieved the highest average cross-validation


I think we could maybe make this a bit clearer. Maybe something like:

"CrossValidatorModel contains the model with the highest average cross validation metric across folds and uses this model to transform input data. CrossValidatorModel also tracks the metrics for each param map evaluated."

This way its clear that it contains both the best model and the metrics, as well as uses the best model if asked to predict/transform any data. What do you think?
(Of course if we change it here we would want to have the corresponding change occur in Python).

holdenk · 2016-07-06T20:52:41Z

Really close, just wondering if we can maybe make the docs for the model a bit clearer since it does a bit more than just contain the best model but it looks pretty good. Thanks for taking this on @krishnakalyan3 and sorry for all the suggestions.

MLnick · 2016-07-07T11:46:57Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

@@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams {

 /**
 * :: Experimental ::
- * K-fold cross validation.
+ * CrossValidator begins by splitting the dataset into a set of non-overlapping randomly
+ * partitioned folds as separate training and test datasets e.g., with k=3 folds,


I think we can bring back the "folds, which are used as ..." part

krishnakalyan3 · 2016-07-07T17:41:33Z

@holdenk @MLnick sorry for so many changes. Newbie here. Please let me know if the current state is okay?.

SparkQA · 2016-07-07T18:43:51Z

Test build #61921 has finished for PR 13894 at commit 9fb0562.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

krishnakalyan3 · 2016-07-12T19:23:35Z

cc @holdenk @MLnick @jkbradley. Does the current state look good?.

SparkQA · 2016-07-12T20:09:05Z

Test build #62185 has finished for PR 13894 at commit 013a1db.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-12T20:29:17Z

Test build #62187 has finished for PR 13894 at commit f9725cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-16T20:49:31Z

Test build #62412 has finished for PR 13894 at commit 853d6fd.

This patch fails Python style tests.
This patch does not merge cleanly.
This patch adds no public classes.

MLnick · 2016-07-18T11:42:12Z

@krishnakalyan3 think merge conflicts still need to be resolved - also the Python style issue. Subject to those this LGTM now.

holdenk · 2016-07-19T18:09:38Z

python/pyspark/ml/tuning.py

+    CrossValidatorModel contains the model with the highest average cross-validation
+    metric across folds and uses this model to transform input data. CrossValidatorModel
+    also tracks the metrics for each param map evaluated.
+    


You've got some extra whitespace here.

SparkQA · 2016-07-20T11:06:43Z

Test build #62591 has finished for PR 13894 at commit f40210c.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

MLnick · 2016-07-20T11:09:24Z

@krishnakalyan3 still merge conflicts - could you rebase to current master?

SparkQA · 2016-07-20T21:58:05Z

Test build #62631 has finished for PR 13894 at commit e78d311.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

krishnakalyan3 · 2016-07-21T06:58:26Z

cc @MLnick @holdenk

MLnick · 2016-07-27T12:46:12Z

jenkins retest this please

SparkQA · 2016-07-27T13:33:55Z

Test build #62921 has finished for PR 13894 at commit e78d311.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-07-27T13:39:12Z

Merged to master. Thanks!

krishnakalyan3 · 2016-07-27T13:40:31Z

@MLnick @holdenk @jkbradley thanks for the reviews.

holdenk reviewed Jun 28, 2016
View reviewed changes

MLnick reviewed Jul 1, 2016
View reviewed changes

MLnick reviewed Jul 5, 2016
View reviewed changes

holdenk reviewed Jul 6, 2016
View reviewed changes

MLnick reviewed Jul 7, 2016
View reviewed changes

holdenk reviewed Jul 19, 2016
View reviewed changes

Improve ML pipeline Cross Validation Scaladoc & PyDoc

a562851

krishnakalyan3 added 10 commits July 20, 2016 22:39

CrossValidatorModel improve

852c695

fixed whitespace issue

0952b6a

Updated doc based on review

c5c2e9a

correct minor error

36a45e5

fix grammertical error

ea30664

cleanup description

6b02138

improveing description

e919214

Change CrossValidator to K-fold

450b2c5

clean CrossValidatorModel description

10d78a5

fix conflict

e78d311

krishnakalyan3 force-pushed the kfold-cv branch from f40210c to e78d311 Compare July 20, 2016 21:07

asfgit closed this in 7e8279f Jul 27, 2016

krishnakalyan3 deleted the kfold-cv branch July 27, 2016 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc #13894

[SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc #13894

krishnakalyan3 commented Jun 24, 2016 •

edited

Loading

krishnakalyan3 commented Jun 26, 2016

holdenk Jun 28, 2016

krishnakalyan3 commented Jun 30, 2016

MLnick Jul 1, 2016

krishnakalyan3 commented Jul 2, 2016

MLnick commented Jul 4, 2016

SparkQA commented Jul 4, 2016

MLnick commented Jul 4, 2016

MLnick Jul 5, 2016

holdenk Jul 6, 2016

jkbradley Jul 12, 2016

SparkQA commented Jul 5, 2016

holdenk Jul 6, 2016

holdenk commented Jul 6, 2016

MLnick Jul 7, 2016

krishnakalyan3 commented Jul 7, 2016

SparkQA commented Jul 7, 2016

krishnakalyan3 commented Jul 12, 2016

SparkQA commented Jul 12, 2016

SparkQA commented Jul 12, 2016

SparkQA commented Jul 16, 2016

MLnick commented Jul 18, 2016

holdenk Jul 19, 2016

SparkQA commented Jul 20, 2016

MLnick commented Jul 20, 2016

SparkQA commented Jul 20, 2016

krishnakalyan3 commented Jul 21, 2016

MLnick commented Jul 27, 2016

SparkQA commented Jul 27, 2016

MLnick commented Jul 27, 2016

krishnakalyan3 commented Jul 27, 2016

[SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc #13894

[SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc #13894

Conversation

krishnakalyan3 commented Jun 24, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

krishnakalyan3 commented Jun 26, 2016

Choose a reason for hiding this comment

krishnakalyan3 commented Jun 30, 2016

Choose a reason for hiding this comment

krishnakalyan3 commented Jul 2, 2016

MLnick commented Jul 4, 2016

SparkQA commented Jul 4, 2016

MLnick commented Jul 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 5, 2016

Choose a reason for hiding this comment

holdenk commented Jul 6, 2016

Choose a reason for hiding this comment

krishnakalyan3 commented Jul 7, 2016

SparkQA commented Jul 7, 2016

krishnakalyan3 commented Jul 12, 2016

SparkQA commented Jul 12, 2016

SparkQA commented Jul 12, 2016

SparkQA commented Jul 16, 2016

MLnick commented Jul 18, 2016

Choose a reason for hiding this comment

SparkQA commented Jul 20, 2016

MLnick commented Jul 20, 2016

SparkQA commented Jul 20, 2016

krishnakalyan3 commented Jul 21, 2016

MLnick commented Jul 27, 2016

SparkQA commented Jul 27, 2016

MLnick commented Jul 27, 2016

krishnakalyan3 commented Jul 27, 2016

krishnakalyan3 commented Jun 24, 2016 •

edited

Loading