Skip to content

Commit

Permalink
Update docs for QuantileDiscretizer
Browse files Browse the repository at this point in the history
  • Loading branch information
yanboliang committed Nov 29, 2016
1 parent eae0d2c commit 019e5af
Show file tree
Hide file tree
Showing 3 changed files with 11 additions and 7 deletions.
4 changes: 3 additions & 1 deletion docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -1188,7 +1188,9 @@ categorical features. The number of bins is set by the `numBuckets` parameter. I
that the number of buckets used will be smaller than this value, for example, if there are too few
distinct values of the input to create enough distinct quantiles.

NaN values: Note also that QuantileDiscretizer
NaN values:
NaN values will be removed from the column when `QuantileDiscretizer` fitting. This will produce
a `Bucketizer` model for making prediction and transformation. During the transformation, `Bucketizer`
will raise an error when it finds NaN values in the dataset, but the user can also choose to either
keep or remove NaN values within the dataset by setting `handleInvalid`. If the user chooses to keep
NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,11 +82,11 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
* invalid values), error (throw an error), or keep (keep invalid values in a special additional
* bucket).
* Default: "error"
* TODO: Reuse handleInvalid in HasHandleInvalid.
* @group param
*/
// TODO: SPARK-18619 Make Bucketizer inherit from HasHandleInvalid.
@Since("2.1.0")
val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle" +
val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle " +
"invalid entries. Options are skip (filter out rows with invalid values), " +
"error (throw an error), or keep (keep invalid values in a special additional bucket).",
ParamValidators.inArray(Bucketizer.supportedHandleInvalids))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,10 +70,10 @@ private[feature] trait QuantileDiscretizerBase extends Params
* invalid values), error (throw an error), or keep (keep invalid values in a special additional
* bucket).
* Default: "error"
* TODO: Reuse handleInvalid in HasHandleInvalid.
* @group param
*/
val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle" +
// TODO: SPARK-18619 Make QuantileDiscretizer inherit from HasHandleInvalid.
val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle " +
"invalid entries. Options are skip (filter out rows with invalid values), " +
"error (throw an error), or keep (keep invalid values in a special additional bucket).",
ParamValidators.inArray(Bucketizer.supportedHandleInvalids))
Expand All @@ -90,8 +90,10 @@ private[feature] trait QuantileDiscretizerBase extends Params
* possible that the number of buckets used will be smaller than this value, for example, if there
* are too few distinct values of the input to create enough distinct quantiles.
*
* NaN handling: Note also that
* QuantileDiscretizer will raise an error when it finds NaN values in the dataset, but the user can
* NaN handling:
* NaN values will be removed from the column when `QuantileDiscretizer` fitting. This will produce
* a `Bucketizer` model for making prediction and transformation. During the transformation,
* `Bucketizer` will raise an error when it finds NaN values in the dataset, but the user can
* also choose to either keep or remove NaN values within the dataset by setting `handleInvalid`.
* If the user chooses to keep NaN values, they will be handled specially and placed into their own
* bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3],
Expand Down

0 comments on commit 019e5af

Please sign in to comment.