-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix so not preparing data twice when calling model selector fit method #251
Conversation
Codecov Report
@@ Coverage Diff @@
## master #251 +/- ##
==========================================
- Coverage 86.53% 82.59% -3.95%
==========================================
Files 314 314
Lines 10297 10301 +4
Branches 331 537 +206
==========================================
- Hits 8911 8508 -403
- Misses 1386 1793 +407
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #251 +/- ##
==========================================
- Coverage 86.56% 86.55% -0.02%
==========================================
Files 314 314
Lines 10294 10298 +4
Branches 339 342 +3
==========================================
+ Hits 8911 8913 +2
- Misses 1383 1385 +2
Continue to review full report at Codecov.
|
* @param data | ||
* @return Parameters set in examining data | ||
*/ | ||
def examine(data: Dataset[Row]): Option[SplitterSummary] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is designed to have a side-effects for prepare whereas examine sounds read-only. Can we call it something like setupPrepare
or prePrepare
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about assessForPrepare?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same read-only connotation, I think your current is preValidationPrepare
is good
@@ -113,6 +113,7 @@ E <: Estimator[_] with OpPipelineStage2[RealNN, OPVector, Prediction]] | |||
protected[op] def findBestEstimator(data: Dataset[_], dag: StagesDAG, persistEveryKStages: Int = 0) | |||
(implicit spark: SparkSession): Unit = { | |||
|
|||
splitter.map(_.examine(data.select(labelColName).toDF())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we don't use the result of transformation let us do splitter.foreach(_.examine(...))
@leahmcguire please update PR title |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. modulo rename followup work
/** | ||
* Split into a training set and a test set and balance the training set | ||
* Function to use examine the data set to set parameters for preparation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Search/replace examine
, sorry for the post-rename work.
Don't realize the bug, but it's been a while. I remember something like the first call of Splitter would "examine", but doesn't seem to be the case |
@mweilsalesforce yes that it what I remember as well but it is not the case now - it is hard to unravel the history since this got moved to the public repo about the time this code went in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, one comment
*/ | ||
def validationPrepare(data: Dataset[Row]): Dataset[Row] = { | ||
|
||
if (summary.isEmpty) throw new RuntimeException("Cannot call prepare until examine has been called") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
examine
was renamed to preValidationPrepare
? so please update the error message.
Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request. |
Related issues
The prepare method on the spliters had unclear use - it was used both to measure the properties of the data and to resample data for cross validation / training split. This meant that the fit method of the model selectors had a bug where it would rebalance data before cross validation as well as during cross validation leading to potential label leakage. This was not generally a problem as within a workflow we called the findBestEstimator method rather than fit - but it would lead to bad results for anyone using the model selectors as a stand alone spark stage.
Describe the proposed solution
made the base splitter class have two methods which are called independently. one to measure the data and decide how to clean or rebalance it and one to actually do the cleaning or rebalancing
Describe alternatives you've considered
make prepare only estimate once - however this could lead to unexpected results