-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc #13894
Conversation
cc @holdenk |
@@ -191,7 +194,8 @@ object CrossValidator extends MLReadable[CrossValidator] { | |||
|
|||
/** | |||
* :: Experimental :: | |||
* Model from k-fold cross validation. | |||
* Pipelines facilitate model selection by making it easy to tune an entire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the best place to mention that - it seems like it doesn't belong in the CrossValidatorModel scaladoc.
* CrossValidator begins by splitting the dataset into a set of non-overlapping randomly | ||
* partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, | ||
* CrossValidator will generate 3 (training, test) dataset pairs, each of which uses 2/3 of | ||
* the data for training and 1/3 for testing. Each fold is used in the testing set exactly once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"used in the testing set" -> "used as the test set"
ok to test |
Test build #61726 has finished for PR 13894 at commit
|
LGTM. @holdenk ? |
@@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams { | |||
|
|||
/** | |||
* :: Experimental :: | |||
* K-fold cross validation. | |||
* CrossValidator begins by splitting the dataset into a set of non-overlapping randomly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry about this, but looking at it again I think we can clean up the description a bit more:
CrossValidator performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds, which are used as separate training and test datasets. For example, with k=3 folds, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - is good to mention why its splitting the dataset/what its for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the improved description, but can we please keep the phrase "k-fold cross validation?" It's a very common phrase and will be useful for people using keyword search. Thanks!
Test build #61754 has finished for PR 13894 at commit
|
@@ -191,7 +194,8 @@ object CrossValidator extends MLReadable[CrossValidator] { | |||
|
|||
/** | |||
* :: Experimental :: | |||
* Model from k-fold cross validation. | |||
* CrossValidatorModel contains the model that achieved the highest average cross-validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could maybe make this a bit clearer. Maybe something like:
"CrossValidatorModel contains the model with the highest average cross validation metric across folds and uses this model to transform input data. CrossValidatorModel also tracks the metrics for each param map evaluated."
This way its clear that it contains both the best model and the metrics, as well as uses the best model if asked to predict/transform any data. What do you think?
(Of course if we change it here we would want to have the corresponding change occur in Python).
Really close, just wondering if we can maybe make the docs for the model a bit clearer since it does a bit more than just contain the best model but it looks pretty good. Thanks for taking this on @krishnakalyan3 and sorry for all the suggestions. |
@@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams { | |||
|
|||
/** | |||
* :: Experimental :: | |||
* K-fold cross validation. | |||
* CrossValidator begins by splitting the dataset into a set of non-overlapping randomly | |||
* partitioned folds as separate training and test datasets e.g., with k=3 folds, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can bring back the "folds, which are used as ..." part
Test build #61921 has finished for PR 13894 at commit
|
cc @holdenk @MLnick @jkbradley. Does the current state look good?. |
Test build #62185 has finished for PR 13894 at commit
|
Test build #62187 has finished for PR 13894 at commit
|
Test build #62412 has finished for PR 13894 at commit
|
@krishnakalyan3 think merge conflicts still need to be resolved - also the Python style issue. Subject to those this LGTM now. |
CrossValidatorModel contains the model with the highest average cross-validation | ||
metric across folds and uses this model to transform input data. CrossValidatorModel | ||
also tracks the metrics for each param map evaluated. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've got some extra whitespace here.
Test build #62591 has finished for PR 13894 at commit
|
@krishnakalyan3 still merge conflicts - could you rebase to current master? |
Test build #62631 has finished for PR 13894 at commit
|
jenkins retest this please |
Test build #62921 has finished for PR 13894 at commit
|
Merged to master. Thanks! |
@MLnick @holdenk @jkbradley thanks for the reviews. |
What changes were proposed in this pull request?
Updated ML pipeline Cross Validation Scaladoc & PyDoc.
How was this patch tested?
Documentation update
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)