Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc #13894

Closed
wants to merge 11 commits into from
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,11 @@ private[ml] trait CrossValidatorParams extends ValidatorParams {
}

/**
* K-fold cross validation.
* K-fold cross validation performs model selection by splitting the dataset into a set of
* non-overlapping randomly partitioned folds which are used as separate training and test datasets
* e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs,
* each of which uses 2/3 of the data for training and 1/3 for testing. Each fold is used as the
* test set exactly once.
*/
@Since("1.2.0")
class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String)
Expand Down Expand Up @@ -188,7 +192,9 @@ object CrossValidator extends MLReadable[CrossValidator] {
}

/**
* Model from k-fold cross validation.
* CrossValidatorModel contains the model with the highest average cross-validation
* metric across folds and uses this model to transform input data. CrossValidatorModel
* also tracks the metrics for each param map evaluated.
*
* @param bestModel The best model selected from k-fold cross validation.
* @param avgMetrics Average cross-validation metrics for each paramMap in
Expand Down
13 changes: 11 additions & 2 deletions python/pyspark/ml/tuning.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,13 @@ def getEvaluator(self):

class CrossValidator(Estimator, ValidatorParams):
"""
K-fold cross validation.

K-fold cross validation performs model selection by splitting the dataset into a set of
non-overlapping randomly partitioned folds which are used as separate training and test datasets
e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs,
each of which uses 2/3 of the data for training and 1/3 for testing. Each fold is used as the
test set exactly once.


>>> from pyspark.ml.classification import LogisticRegression
>>> from pyspark.ml.evaluation import BinaryClassificationEvaluator
Expand Down Expand Up @@ -260,7 +266,10 @@ def copy(self, extra=None):

class CrossValidatorModel(Model, ValidatorParams):
"""
Model from k-fold cross validation.

CrossValidatorModel contains the model with the highest average cross-validation
metric across folds and uses this model to transform input data. CrossValidatorModel
also tracks the metrics for each param map evaluated.

.. versionadded:: 1.4.0
"""
Expand Down