Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scala APIs, docs #16009

Closed
wants to merge 8 commits into from

Conversation

yanboliang
Copy link
Contributor

What changes were proposed in this pull request?

API review for 2.1, except LSH related classes which are still under development.

How was this patch tested?

Only doc changes, no new tests.

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69152 has finished for PR 16009 at commit 8c36940.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69157 has finished for PR 16009 at commit f494a47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -49,15 +49,13 @@ private[feature] trait ChiSqSelectorParams extends Params
*
* @group param
*/
@Since("1.6.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are the@since removed, btw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we don't add since tag to variables and functions in traits, since they may be inherited by new child classes later on and the tag is incorrect for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens in several other places though, are we going to remove them all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, theoretically we should do that, but I'm not very confidence whether this change is appropriate. If we meet an agreement on how to deal with this issue, we can address other places in this PR or follow-up work. cc @jkbradley

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I think it's OK to have Since annotations in the trait since it is private and should never be used beyond ChiSqSelector and ChiSqSelectorModel. That seems pretty safe to me.

Copy link
Contributor Author

@yanboliang yanboliang Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's safe for this case. However, I found lots of other traits which is also safe enough to add since tag but did not add. I reverted this part of change in this PR to make it catch another RC of 2.1, and I think we should unify them in a separate work. Thanks.

@SparkQA
Copy link

SparkQA commented Nov 26, 2016

Test build #69184 has finished for PR 16009 at commit 0b1cced.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2016

Test build #69185 has finished for PR 16009 at commit eae0d2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -49,15 +49,13 @@ private[feature] trait ChiSqSelectorParams extends Params
*
* @group param
*/
@Since("1.6.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens in several other places though, are we going to remove them all?

@@ -70,17 +70,16 @@ private[feature] trait QuantileDiscretizerBase extends Params
* invalid values), error (throw an error), or keep (keep invalid values in a special additional
* bucket).
* Default: "error"
* TODO: Reuse handleInvalid in HasHandleInvalid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a note to add "HasHandleInvalid" as a shared param? I was a bit confused by it at first. Maybe we can reference a jira number for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, updated.

@SparkQA
Copy link

SparkQA commented Nov 29, 2016

Test build #69316 has finished for PR 16009 at commit 019e5af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1188,7 +1188,9 @@ categorical features. The number of bins is set by the `numBuckets` parameter. I
that the number of buckets used will be smaller than this value, for example, if there are too few
distinct values of the input to create enough distinct quantiles.

NaN values: Note also that QuantileDiscretizer
NaN values:
NaN values will be removed from the column when `QuantileDiscretizer` fitting. This will produce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"from the column when" -> "from the column during"
"model for making prediction and transformation" -> "model for making predictions"

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69381 has finished for PR 16009 at commit 27b07ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

NaN values: Note also that QuantileDiscretizer
NaN values:
NaN values will be removed from the column during `QuantileDiscretizer` fitting. This will produce
a `Bucketizer` model for making predictions. During the transformation, `Bucketizer`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, what does this mean exactly? When I train a QuantileDiscretizer to handle NaN values, I get back a Bucketizer that also handles them. The bucketizer does not raise an error when encountering NaNs. That seems contradictory to this statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QuantileDiscretizer always drops NaNs during fitting, so it will not throw an error for a dataset with NaNs even if handleInvalid = "error"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying that!

@jkbradley
Copy link
Member

LGTM
Merging with master and branch-2.1
Thank you!

asfgit pushed a commit that referenced this pull request Nov 30, 2016
## What changes were proposed in this pull request?
API review for 2.1, except ```LSH``` related classes which are still under development.

## How was this patch tested?
Only doc changes, no new tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16009 from yanboliang/spark-18318.

(cherry picked from commit 60022bf)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@asfgit asfgit closed this in 60022bf Nov 30, 2016
@yanboliang yanboliang deleted the spark-18318 branch December 1, 2016 04:24
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
## What changes were proposed in this pull request?
API review for 2.1, except ```LSH``` related classes which are still under development.

## How was this patch tested?
Only doc changes, no new tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#16009 from yanboliang/spark-18318.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
API review for 2.1, except ```LSH``` related classes which are still under development.

## How was this patch tested?
Only doc changes, no new tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#16009 from yanboliang/spark-18318.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants