Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc #13894

Closed
wants to merge 11 commits into from

Conversation

krishnakalyan3
Copy link
Member

@krishnakalyan3 krishnakalyan3 commented Jun 24, 2016

What changes were proposed in this pull request?

Updated ML pipeline Cross Validation Scaladoc & PyDoc.

How was this patch tested?

Documentation update

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@krishnakalyan3
Copy link
Member Author

cc @holdenk

@@ -191,7 +194,8 @@ object CrossValidator extends MLReadable[CrossValidator] {

/**
* :: Experimental ::
* Model from k-fold cross validation.
* Pipelines facilitate model selection by making it easy to tune an entire
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the best place to mention that - it seems like it doesn't belong in the CrossValidatorModel scaladoc.

@krishnakalyan3
Copy link
Member Author

Updated the doc based on the reviews. Thanks for the review comments @holdenk and @MLnick.

* CrossValidator begins by splitting the dataset into a set of non-overlapping randomly
* partitioned folds which are used as separate training and test datasets e.g., with k=3 folds,
* CrossValidator will generate 3 (training, test) dataset pairs, each of which uses 2/3 of
* the data for training and 1/3 for testing. Each fold is used in the testing set exactly once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"used in the testing set" -> "used as the test set"

@krishnakalyan3
Copy link
Member Author

@holdenk @MLnick is the current update okay?

@MLnick
Copy link
Contributor

MLnick commented Jul 4, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Jul 4, 2016

Test build #61726 has finished for PR 13894 at commit 7a3a4fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Jul 4, 2016

LGTM. @holdenk ?

@@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams {

/**
* :: Experimental ::
* K-fold cross validation.
* CrossValidator begins by splitting the dataset into a set of non-overlapping randomly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about this, but looking at it again I think we can clean up the description a bit more:

CrossValidator performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds, which are used as separate training and test datasets. For example, with k=3 folds, ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - is good to mention why its splitting the dataset/what its for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the improved description, but can we please keep the phrase "k-fold cross validation?" It's a very common phrase and will be useful for people using keyword search. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61754 has finished for PR 13894 at commit ffc9576.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -191,7 +194,8 @@ object CrossValidator extends MLReadable[CrossValidator] {

/**
* :: Experimental ::
* Model from k-fold cross validation.
* CrossValidatorModel contains the model that achieved the highest average cross-validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could maybe make this a bit clearer. Maybe something like:

"CrossValidatorModel contains the model with the highest average cross validation metric across folds and uses this model to transform input data. CrossValidatorModel also tracks the metrics for each param map evaluated."

This way its clear that it contains both the best model and the metrics, as well as uses the best model if asked to predict/transform any data. What do you think?
(Of course if we change it here we would want to have the corresponding change occur in Python).

@holdenk
Copy link
Contributor

holdenk commented Jul 6, 2016

Really close, just wondering if we can maybe make the docs for the model a bit clearer since it does a bit more than just contain the best model but it looks pretty good. Thanks for taking this on @krishnakalyan3 and sorry for all the suggestions.

@@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams {

/**
* :: Experimental ::
* K-fold cross validation.
* CrossValidator begins by splitting the dataset into a set of non-overlapping randomly
* partitioned folds as separate training and test datasets e.g., with k=3 folds,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can bring back the "folds, which are used as ..." part

@krishnakalyan3
Copy link
Member Author

@holdenk @MLnick sorry for so many changes. Newbie here. Please let me know if the current state is okay?.

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61921 has finished for PR 13894 at commit 9fb0562.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@krishnakalyan3
Copy link
Member Author

cc @holdenk @MLnick @jkbradley. Does the current state look good?.

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62185 has finished for PR 13894 at commit 013a1db.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62187 has finished for PR 13894 at commit f9725cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2016

Test build #62412 has finished for PR 13894 at commit 853d6fd.

  • This patch fails Python style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Jul 18, 2016

@krishnakalyan3 think merge conflicts still need to be resolved - also the Python style issue. Subject to those this LGTM now.

CrossValidatorModel contains the model with the highest average cross-validation
metric across folds and uses this model to transform input data. CrossValidatorModel
also tracks the metrics for each param map evaluated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got some extra whitespace here.

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62591 has finished for PR 13894 at commit f40210c.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Jul 20, 2016

@krishnakalyan3 still merge conflicts - could you rebase to current master?

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62631 has finished for PR 13894 at commit e78d311.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@krishnakalyan3
Copy link
Member Author

cc @MLnick @holdenk

@MLnick
Copy link
Contributor

MLnick commented Jul 27, 2016

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jul 27, 2016

Test build #62921 has finished for PR 13894 at commit e78d311.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Jul 27, 2016

Merged to master. Thanks!

@asfgit asfgit closed this in 7e8279f Jul 27, 2016
@krishnakalyan3
Copy link
Member Author

@MLnick @holdenk @jkbradley thanks for the reviews.

@krishnakalyan3 krishnakalyan3 deleted the kfold-cv branch July 27, 2016 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants