-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scala APIs, docs #16009
Conversation
Test build #69152 has finished for PR 16009 at commit
|
Test build #69157 has finished for PR 16009 at commit
|
@@ -49,15 +49,13 @@ private[feature] trait ChiSqSelectorParams extends Params | |||
* | |||
* @group param | |||
*/ | |||
@Since("1.6.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are the@since
removed, btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we don't add since
tag to variables and functions in traits, since they may be inherited by new child classes later on and the tag is incorrect for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens in several other places though, are we going to remove them all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, theoretically we should do that, but I'm not very confidence whether this change is appropriate. If we meet an agreement on how to deal with this issue, we can address other places in this PR or follow-up work. cc @jkbradley
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I think it's OK to have Since annotations in the trait since it is private and should never be used beyond ChiSqSelector and ChiSqSelectorModel. That seems pretty safe to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's safe for this case. However, I found lots of other traits which is also safe enough to add since tag but did not add. I reverted this part of change in this PR to make it catch another RC of 2.1, and I think we should unify them in a separate work. Thanks.
Test build #69184 has finished for PR 16009 at commit
|
Test build #69185 has finished for PR 16009 at commit
|
@@ -49,15 +49,13 @@ private[feature] trait ChiSqSelectorParams extends Params | |||
* | |||
* @group param | |||
*/ | |||
@Since("1.6.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens in several other places though, are we going to remove them all?
@@ -70,17 +70,16 @@ private[feature] trait QuantileDiscretizerBase extends Params | |||
* invalid values), error (throw an error), or keep (keep invalid values in a special additional | |||
* bucket). | |||
* Default: "error" | |||
* TODO: Reuse handleInvalid in HasHandleInvalid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just a note to add "HasHandleInvalid" as a shared param? I was a bit confused by it at first. Maybe we can reference a jira number for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, updated.
Test build #69316 has finished for PR 16009 at commit
|
@@ -1188,7 +1188,9 @@ categorical features. The number of bins is set by the `numBuckets` parameter. I | |||
that the number of buckets used will be smaller than this value, for example, if there are too few | |||
distinct values of the input to create enough distinct quantiles. | |||
|
|||
NaN values: Note also that QuantileDiscretizer | |||
NaN values: | |||
NaN values will be removed from the column when `QuantileDiscretizer` fitting. This will produce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"from the column when" -> "from the column during"
"model for making prediction and transformation" -> "model for making predictions"
Test build #69381 has finished for PR 16009 at commit
|
NaN values: Note also that QuantileDiscretizer | ||
NaN values: | ||
NaN values will be removed from the column during `QuantileDiscretizer` fitting. This will produce | ||
a `Bucketizer` model for making predictions. During the transformation, `Bucketizer` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, what does this mean exactly? When I train a QuantileDiscretizer to handle NaN values, I get back a Bucketizer that also handles them. The bucketizer does not raise an error when encountering NaNs. That seems contradictory to this statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QuantileDiscretizer always drops NaNs during fitting, so it will not throw an error for a dataset with NaNs even if handleInvalid = "error"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying that!
LGTM |
## What changes were proposed in this pull request? API review for 2.1, except ```LSH``` related classes which are still under development. ## How was this patch tested? Only doc changes, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16009 from yanboliang/spark-18318. (cherry picked from commit 60022bf) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
## What changes were proposed in this pull request? API review for 2.1, except ```LSH``` related classes which are still under development. ## How was this patch tested? Only doc changes, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#16009 from yanboliang/spark-18318.
## What changes were proposed in this pull request? API review for 2.1, except ```LSH``` related classes which are still under development. ## How was this patch tested? Only doc changes, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#16009 from yanboliang/spark-18318.
What changes were proposed in this pull request?
API review for 2.1, except
LSH
related classes which are still under development.How was this patch tested?
Only doc changes, no new tests.